# Hybrid Search Data Preparation Pipeline

This notebook prepares your OCR and object detection data for a hybrid search system. It produces two main outputs:

1.  `keyword_data.csv`: A file ready for a keyword/lexical search engine like **Elasticsearch** or a **BM25** index.
2.  `vector_data.pkl`: A file containing vector embeddings, ready for a vector database like **Milvus**.

You can run this notebook in an environment like Google Colab.

### Step 1: Install Dependencies

In [None]:
!pip install sentence-transformers pandas tqdm

### Step 2: Load Raw Data

We'll load the data from the `ocr_data.jsonl` file. This file should be uploaded to your Colab environment alongside this notebook.

In [None]:
import json
import pandas as pd

data = []
with open('ocr_data.jsonl', 'r') as f:
    for line in f:
        data.append(json.loads(line))

print(f"Loaded {len(data)} records.")
print("First record:", data[0])

### Step 3: Process Data and Generate Embeddings

Now, we'll iterate through the data and prepare it for our two different search indexes.

In [None]:
from sentence_transformers import SentenceTransformer
from tqdm.notebook import tqdm
import pickle

print("Loading sentence transformer model... (This may take a moment)")
# Using a multilingual model as the sample data contains Vietnamese
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
print("Model loaded.")

keyword_data = []
vector_data = []

for item in tqdm(data, desc="Processing records"):
    keyframe_id = item['keyframe_id']
    ocr_text = " ".join(item['ocr_text'])
    object_text = " ".join(item['object'])
    
    # 1. Prepare data for KEYWORD search (BM25/Elasticsearch)
    # We create a single, simple text field containing all keywords.
    combined_text_for_keyword = f"{ocr_text} {object_text}"
    keyword_data.append({
        'keyframe_id': keyframe_id,
        'text': combined_text_for_keyword
    })
    
    # 2. Prepare data for VECTOR search (Milvus)
    # We create a more descriptive string for better semantic embedding.
    combined_text_for_vector = f"OCR text: {ocr_text}. Detected objects: {object_text}."
    embedding = model.encode(combined_text_for_vector, convert_to_tensor=False).tolist()
    vector_data.append({
        'keyframe_id': keyframe_id,
        'embedding': embedding
    })

print("Processing complete.")
print(f"{len(keyword_data)} records prepared for keyword search.")
print(f"{len(vector_data)} records prepared for vector search.")

### Step 4: Save Processed Data for Migration

Now we save the processed data into two separate files. These files are what you will 'migrate' to your production search systems.

In [None]:
# Save data for keyword search engine
keyword_df = pd.DataFrame(keyword_data)
keyword_df.to_csv('keyword_data.csv', index=False)
print("Saved keyword data to keyword_data.csv")

# Save data for vector database
with open('vector_data.pkl', 'wb') as f:
    pickle.dump(vector_data, f)
print("Saved vector data to vector_data.pkl")

### Step 5: Verify Outputs and Next Steps

In [None]:
!ls -lh

### Migration and Hybrid Search Guide

You now have `keyword_data.csv` and `vector_data.pkl`. Here’s how to use them:

**1. For the Keyword Search Engine (e.g., Elasticsearch):**
   - Create an index in Elasticsearch with a mapping for `keyframe_id` and `text`.
   - Use the Elasticsearch bulk ingest API or a client library to upload the data from `keyword_data.csv` into this index.

**2. For the Vector Database (Milvus):**
   - Create a collection in Milvus. Your schema should include a primary key (`keyframe_id`), the vector field (`embedding`), and any other metadata you want to store.
   - Write a Python script using `pymilvus` to:
     - Load the `vector_data.pkl` file (`pickle.load(open('vector_data.pkl', 'rb'))`).
     - Iterate through the list of records and insert them into your Milvus collection in batches.

**3. Implementing the Hybrid Search Query:**
   - Your application's search function will now query **both** systems with the user's input.
   - **Query 1:** Send the query to Elasticsearch to get a list of `keyframe_id`s based on keyword matches (BM25 score).
   - **Query 2:** Embed the user's query using the same sentence transformer model and send it to Milvus to get a list of `keyframe_id`s based on vector similarity (e.g., cosine similarity or L2 distance).
   - **Fuse the Results:** Combine the two lists of results. A common technique is **Reciprocal Rank Fusion (RRF)**, where you give each result a score based on its rank in its respective list, sum the scores for items that appear in both lists, and then sort to get a final, blended ranking.