# Ovis VLM with RAG (Retrieval-Augmented Generation)

This notebook demonstrates how to add RAG capabilities to the Ovis Vision-Language Model. By integrating a retrieval system, the model can answer questions using information from an external knowledge base, leading to more accurate and detailed responses.

In [None]:
!pip install Pillow sentence-transformers pandas openpyxl chromadb langchain

## Step 1: Prepare Knowledge Base

First, we'll create a simple knowledge base. We'll use a few text documents, split them into manageable chunks, and then convert them into vector embeddings using a sentence transformer model. These embeddings will be stored in a FAISS index for fast retrieval.

In [6]:
import pandas as pd
from sentence_transformers import SentenceTransformer

# Load the Excel file
try:
    print('Step 1: Reading Excel file...')
    df = pd.read_excel('/home/aisw/Project/UST-ETRI-2025/VLM_RAG/data/korean_train_20250101.xlsx')
    print('Step 2: Creating combined text columns...')
    text_columns = [
        '역명(한글)', '역명(영어)', '운영노선', '역 주소(지번주소)', '역 주소(도로명 주소)', '참고사항'
    ]
    df['임베딩텍스트'] = df[text_columns].astype(str).agg(' / '.join, axis=1)
    documents = df['임베딩텍스트'].tolist()
    print(f'Step 3: Total documents to embed: {len(documents)}')

    # (선택) 데이터 일부만 사용
    # documents = documents[:1000]

    print('Step 4: Loading embedding model...')
    embedding_model = SentenceTransformer('BAAI/bge-m3')
    import torch
    print('Step 5: Starting embedding (this may take a while)...')
    doc_embeddings = embedding_model.encode(documents, show_progress_bar=True, device='cuda' if torch.cuda.is_available() else 'cpu')
    print('Step 6: Embedding finished.')
    print(f"Successfully loaded and combined text from Excel file.")
    print("Document embeddings created successfully.")
    print("Shape of embeddings:", doc_embeddings.shape)

except FileNotFoundError:
    print("Error: korean_train_20250101.xlsx not found. Using sample data instead.")
    documents = [
        "광장시장은 대한민국 서울특별시 종로구에 위치한 전통 시장이다.",
        "1905년에 개설되었으며, 대한민국 최초의 상설 시장으로 알려져 있다.",
        "주요 판매 품목은 한복, 직물, 구제 의류, 그리고 다양한 먹거리이다.",
        "특히 빈대떡, 마약김밥, 육회 등이 유명하여 많은 관광객들이 찾는다."
    ]
    print('Step 4: Loading embedding model...')
    embedding_model = SentenceTransformer('BAAI/bge-m3')
    import torch
    print('Step 5: Starting embedding (sample data)...')
    doc_embeddings = embedding_model.encode(documents, show_progress_bar=True, device='cuda' if torch.cuda.is_available() else 'cpu')
    print('Step 6: Embedding finished.')
    print("Document embeddings created successfully.")
    print("Shape of embeddings:", doc_embeddings.shape)
except Exception as e:
    print(f"An error occurred: {e}. Using sample data.")
    documents = [
        "광장시장은 대한민국 서울특별시 종로구에 위치한 전통 시장이다.",
        "1905년에 개설되었으며, 대한민국 최초의 상설 시장으로 알려져 있다.",
        "주요 판매 품목은 한복, 직물, 구제 의류, 그리고 다양한 먹거리이다.",
        "특히 빈대떡, 마약김밥, 육회 등이 유명하여 많은 관광객들이 찾는다."
    ]
    print('Step 4: Loading embedding model...')
    embedding_model = SentenceTransformer('BAAI/bge-m3')
    import torch
    print('Step 5: Starting embedding (sample data)...')
    doc_embeddings = embedding_model.encode(documents, show_progress_bar=True, device='cuda' if torch.cuda.is_available() else 'cpu')
    print('Step 6: Embedding finished.')
    print("Document embeddings created successfully.")
    print("Shape of embeddings:", doc_embeddings.shape)

Step 1: Reading Excel file...


  warn(msg)


Step 2: Creating combined text columns...
Step 3: Total documents to embed: 1070
Step 4: Loading embedding model...
Step 5: Starting embedding (this may take a while)...


Batches: 100%|██████████| 34/34 [00:01<00:00, 33.08it/s]

Step 6: Embedding finished.
Successfully loaded and combined text from Excel file.
Document embeddings created successfully.
Shape of embeddings: (1070, 1024)





In [8]:
import chromadb

# Initialize ChromaDB client (in-memory)
client = chromadb.Client()

# Create a new collection or get an existing one
collection_name = "korean_knowledge_base"
collection = client.get_or_create_collection(name=collection_name)

# Generate IDs for each document
doc_ids = [str(i) for i in range(len(documents))]

# Add documents to the collection
collection.add(
    embeddings=doc_embeddings.tolist(),
    documents=documents,
    ids=doc_ids
)

print(f"ChromaDB collection '{collection_name}' created/updated with {collection.count()} documents.")

ChromaDB collection 'korean_knowledge_base' created/updated with 1070 documents.


## Step 2: Implement Retriever

Now, we'll create a retriever function. This function will take a user's query, embed it using the same sentence transformer model, and then search the FAISS index to find the most relevant document chunks.

In [9]:
def retrieve_documents(query, k=2):
    # Embed the query
    query_embedding = embedding_model.encode([query]).tolist()
    
    # Query the collection
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=k
    )
    
    # Return the retrieved documents
    return results['documents'][0]

# Test the retriever with a Korean query
test_query = "평촌역의 위도와 경도를 알려주세요"
retrieved = retrieve_documents(test_query)
print(f"Query: {test_query}")
print("Retrieved documents:")
for doc in retrieved:
    print(f"- {doc}")

Query: 평촌역의 위도와 경도를 알려주세요
Retrieved documents:
- 평촌역 / Pyeongchon / 4호선 / nan / 경기도 안양시 동안구 부림로 지하 123 / nan
- 강촌역 / Gangchon / 경춘선 / 강원도 춘천시 남산면 방곡리 409 / 강원도 춘천시 남산면 강촌로 150 / nan


## Step 3: Load Ovis VLM Model

Next, we load the Ovis VLM model and its tokenizers. This code is adapted from your `main.py` script.

In [10]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_path = "AIDC-AI/Ovis2-8B"

torch_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch_dtype,
    trust_remote_code=True,
    cache_dir="./hf_cache",
    device_map="auto",
    low_cpu_mem_usage=True,
)

tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

print("Ovis VLM model and tokenizers loaded successfully.")

A new version of the following files was downloaded from https://huggingface.co/AIDC-AI/Ovis2-8B:
- configuration_ovis.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/AIDC-AI/Ovis2-8B:
- modeling_ovis.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]Error while downloading from https://cdn-lfs-us-1.hf.co/repos/1f/b4/1fb40860dac4e815d5d9936016bd3b1801c54c24cef5aed8132a9628677579d9/480011cf1313ddd3be5f9a755a199685520b213053c772ea79d60d1ae1375e3c?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model-00001-of-00004.safetensors%3B+filename%3D%22model-00001-of-00004.safetensors%22%3B&Expires=1753769174&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpd

Ovis VLM model and tokenizers loaded successfully.


## Step 4: RAG-Enhanced Inference

Finally, we'll combine everything. We'll take an image and a question, use our retriever to find relevant information, construct a new prompt with this context, and then feed it to the Ovis model to get a RAG-enhanced answer.

In [18]:
# Step 1: 이미지로부터 Ovis가 1차 설명 생성
import os
image_path = '/home/aisw/Project/UST-ETRI-2025/VLM_RAG/data/Pyeongchon_station.jpg'  # 실제 이미지 경로로 변경

if not os.path.exists(image_path):
    print(f"Warning: Image not found at {image_path}. Please upload an image.")
    images = [Image.new('RGB', (512, 512), color='blue')]
else:
    images = [Image.open(image_path)]

# 1차 프롬프트: 이미지만 보고 설명 생성
image_only_prompt = "이 이미지를 보고 간단히 설명해 주세요. <image>"
max_partition = 9
prompt, input_ids, pixel_values = model.preprocess_inputs(image_only_prompt, images, max_partition=max_partition)
attention_mask = torch.ne(input_ids, tokenizer.pad_token_id)
input_ids = input_ids.unsqueeze(0).to(device=model.device)
attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
if pixel_values is not None:
    pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
pixel_values = [pixel_values]

with torch.inference_mode():
    gen_kwargs = dict(max_new_tokens=256, do_sample=False)
    output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
    image_description = tokenizer.decode(output_ids, skip_special_tokens=True)

print("--- 이미지 설명 ---")
print(image_description)

# Step 2: 이미지 설명을 기반으로 벡터DB에서 관련 정보 검색
k = 3
query_embedding = embedding_model.encode([image_description], show_progress_bar=False, device='cuda' if torch.cuda.is_available() else 'cpu')
results = collection.query(query_embeddings=query_embedding.tolist(), n_results=k)
retrieved_context = results['documents'][0]
context_str = "\n".join(retrieved_context)

# Step 3: 최종 RAG 프롬프트 생성 및 응답
user_question = "이 역에서 환승이 가능한가요?"  # 실제 질문으로 교체 가능

# --- 프롬프트 강화: 문맥이 없으면 답변하지 않기 ---
rag_prompt = f"""
아래 [문맥]에 주어진 정보만을 근거로 [질문]에 답변하세요.
- 반드시 [문맥]의 내용을 참고하여 답변하세요.
- [문맥]에 답이 없으면, 'VectorDB(지식베이스)에서 답변을 찾을 수 없습니다.'라고 답하세요.
- [문맥]에 없는 내용을 상상하거나 지어내지 마세요.

[문맥]
{context_str if context_str.strip() else '없음'}

[질문]
{user_question}
<image>
"""

prompt, input_ids, pixel_values = model.preprocess_inputs(rag_prompt, images, max_partition=max_partition)
attention_mask = torch.ne(input_ids, tokenizer.pad_token_id)
input_ids = input_ids.unsqueeze(0).to(device=model.device)
attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
if pixel_values is not None:
    pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
pixel_values = [pixel_values]

with torch.inference_mode():
    gen_kwargs = dict(max_new_tokens=1024, do_sample=False)
    output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
    output = tokenizer.decode(output_ids, skip_special_tokens=True)

print("--- RAG 기반 답변 ---")
print(f'결과:\n{output}')

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


--- 이미지 설명 ---
이 이미지는 서울 지하철 4호선의 평촌역을 나타내는 표지판입니다. 표지판은 파란색 테두리와 흰색 배경으로 구성되어 있으며, 왼쪽에는 파란색 원 안에 "441"이라는 숫자가 표시되어 있습니다. 오른쪽에는 "평촌 (한림대성심병원)"이라는 한글과 "Pyeongchon"이라는 영문, 그리고 "坪村"이라는 한자 표기가 있습니다. 표지판은 벽에 부착되어 있으며, 벽은 베이지색과 분홍색 타일로 구성되어 있습니다.


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


--- RAG 기반 답변 ---
결과:
평촌역은 4호선에 위치한 역으로, 서울역과 환승이 가능합니다.


In [13]:
context_str

'언주 / Eonju / 9호선 / 서울특별시 강남구 논현동 279-165 / 서울특별시 강남구 봉은사로 201 / nan\n금남로5가 / Geumnamno 5(o)-ga  / 1호선 / 광주광역시 북구 북동 299 / 광주광역시 북구 금남로 지하 138(북동) / nan\n수서역 / Suseo / GTX-A / 서울 강남구 수서동 728 / 서울 강남구 광평로 지하270 / nan'