## 06: Creating FAISS Indices from the Fine-Tuned Model

**Objective:** To generate a new set of vector embeddings and FAISS indices using our specialized, fine-tuned sentence transformer model.

**Why is this necessary?** Our fine-tuned model now represents text differently than the original, generic model. The vector space has been altered to be more specific to travel reviews. Therefore, our old FAISS indices, which were built with the old model's vectors, are now incompatible. We must re-generate all embeddings and indices to match the new model.

### Step 1: Load Data and the Fine-Tuned Model

In [1]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from tqdm.notebook import tqdm
import pickle
import faiss
import numpy as np
import json
import re

tqdm.pandas()

# --- Load the FINE-TUNED model ---
# This is the most important change. We are loading our specialized model.
print("Loading the fine-tuned model...")
model = SentenceTransformer('./fine_tuned_model')
print("Model loaded.")

# --- Load the raw data ---
PATH = "/Users/amanjaiswal/Work/hop_v3/backend/combined_results.csv"
df = pd.read_csv(PATH)

Loading the fine-tuned model...
Model loaded.


### Step 2: Pre-process and Feature Engineering (Same as before)

In [2]:
def safe_literal_eval(s):
    try:
        return json.loads(s)
    except (json.JSONDecodeError, TypeError):
        return []

df['detailed_reviews'] = df['detailed_reviews'].apply(safe_literal_eval)
df_reviews_exploded = df[['place_id', 'detailed_reviews']].explode('detailed_reviews')
df_reviews_exploded_filtered = df_reviews_exploded[df_reviews_exploded['detailed_reviews'].apply(lambda x: isinstance(x, dict) and bool(x))]
df_reviews = pd.json_normalize(df_reviews_exploded_filtered['detailed_reviews'])
df_reviews['place_id'] = df_reviews_exploded_filtered['place_id'].values
place_features = ['place_id', 'name', 'main_category', 'rating', 'address', 'reviews']
df_places = df[place_features]
flat_df = df_reviews.merge(df_places, on='place_id', how='left', suffixes=('_review', '_place'))

filtered_df = flat_df[flat_df['rating_review'] >= 3]
vibe_df = filtered_df.groupby(
    ['place_id', 'name_place', 'main_category', 'rating_place', 'address', 'reviews']
)['review_text'].apply(lambda texts: ' '.join([str(t) for t in texts if pd.notna(t)])).reset_index()
vibe_df.rename(columns={
    'name_place': 'place_name',
    'rating_place': 'avg_place_rating',
    'review_text': 'combined_reviews'
}, inplace=True)

def extract_city_from_query(query):
    if pd.isna(query):
        return "unknown"
    query = query.lower().strip()
    match = re.search(r'in\s+([a-z\s]+)$', query)
    if match:
        return match.group(1).strip()
    return "unknown"

df['city'] = df['query'].apply(extract_city_from_query)
vibe_df = vibe_df.merge(df[['place_id', 'city']].drop_duplicates(), on='place_id', how='left')

print("Data processing complete.")

Data processing complete.


### Step 3: Generate New Embeddings and Build New FAISS Indices

In [3]:
# --- Generate new embeddings with the fine-tuned model ---
print("Generating new embeddings...")
vibe_df['embedding'] = vibe_df['combined_reviews'].progress_apply(lambda x: model.encode(x) if isinstance(x, str) else model.encode(""))
print("Embedding generation complete.")

# --- Build new FAISS index for each city ---
print("Building new FAISS indices...")
city_indices_finetuned = {}
for city, group in tqdm(vibe_df.groupby('city'), desc="Building city indices"):
    if city == 'unknown' or group.empty:
        continue
    
    embeddings = np.vstack(group['embedding'].values).astype('float32')
    faiss.normalize_L2(embeddings)
    
    index = faiss.IndexFlatIP(embeddings.shape[1])
    index.add(embeddings)
    
    city_indices_finetuned[city] = {
        'index': index,
        'df': group.reset_index(drop=True)
    }

print("Index building complete.")

Generating new embeddings...


  0%|          | 0/328 [00:00<?, ?it/s]

Embedding generation complete.
Building new FAISS indices...


Building city indices:   0%|          | 0/3 [00:00<?, ?it/s]

Index building complete.


### Step 4: Save the New Indices

In [4]:
# --- Save the new indices to a new file ---
output_path = "city_faiss_indices_finetuned.pkl"
with open(output_path, "wb") as f:
    pickle.dump(city_indices_finetuned, f)

print(f"New fine-tuned indices saved to: {output_path}")

New fine-tuned indices saved to: city_faiss_indices_finetuned.pkl
