## 05: Fine-Tuning the Sentence Transformer Model

**Objective:** To improve the performance of our recommendation system by making the embedding model an "expert" on our specific travel review data.

### The "Why": Pre-trained vs. Fine-tuned

We are using a **pre-trained** model (`all-MiniLM-L6-v2`). It's been trained on a massive, general-purpose dataset from the internet. It's good at understanding general language, but it doesn't know the specific nuances of travel reviews. For example, words like "vibe," "insta-worthy," or "chill" might have very specific meanings in a travel context.

**Fine-tuning** is the process of taking this pre-trained model and training it a little more on our own specific dataset. This adapts the model to our domain, teaching it the specific vocabulary and semantic relationships in our data. The goal is that after fine-tuning, reviews for the *same place* will have more similar embeddings than they did before.

### The Approach: Contrastive Learning

We will use a form of **contrastive learning**. We need to provide the model with examples of what it should consider "similar" and "dissimilar."

1.  **Positive Pairs (Similar):** Two different reviews for the *same* place. The model should learn to make their embeddings very close.
2.  **Negative Pairs (Dissimilar):** Two reviews for *different* places. The model should learn to make their embeddings far apart.

We will use a specific loss function, `MultipleNegativesRankingLoss`, which is highly efficient and effective for this type of training.

### Step 1: Data Preparation

In [1]:
import pandas as pd
from sentence_transformers import InputExample
from tqdm.notebook import tqdm
import random

# --- Load the pre-processed data from a previous notebook ---
# This is the dataframe that has one row per review
PATH = "/Users/amanjaiswal/Work/hop_v3/backend/combined_results.csv"
df = pd.read_csv(PATH)

import json
def safe_literal_eval(s):
    try:
        return json.loads(s)
    except (json.JSONDecodeError, TypeError):
        return []

df['detailed_reviews'] = df['detailed_reviews'].apply(safe_literal_eval)
df_reviews_exploded = df[['place_id', 'name', 'detailed_reviews']].explode('detailed_reviews')
df_reviews_exploded_filtered = df_reviews_exploded[df_reviews_exploded['detailed_reviews'].apply(lambda x: isinstance(x, dict) and bool(x))]
df_reviews = pd.json_normalize(df_reviews_exploded_filtered['detailed_reviews'])
df_reviews['place_id'] = df_reviews_exploded_filtered['place_id'].values
df_reviews['place_name'] = df_reviews_exploded_filtered['name'].values

# Clean up the data: we only need reviews with actual text
flat_df = df_reviews[['place_id', 'place_name', 'review_text']].copy()
flat_df.dropna(subset=['review_text'], inplace=True)
flat_df = flat_df[flat_df['review_text'].str.strip() != '']

print(f"Loaded {len(flat_df)} reviews.")

Loaded 73722 reviews.


In [2]:
# --- Create Training Examples ---

# Group reviews by place
reviews_by_place = flat_df.groupby('place_id')['review_text'].apply(list)

# We only want places with at least 2 reviews to form pairs
places_with_multiple_reviews = reviews_by_place[reviews_by_place.apply(len) >= 2]

train_examples = []
for place_id, reviews in tqdm(places_with_multiple_reviews.items(), desc="Creating training pairs"):
    # For each place, we treat all its reviews as a positive group.
    # The MultipleNegativesRankingLoss will automatically create positive and negative pairs.
    # For example, for a place with reviews [r1, r2, r3], it will create positive pairs (r1, r2), (r1, r3), (r2, r3)
    # and contrast them with reviews from other places in the same batch.
    for i in range(len(reviews) - 1):
        train_examples.append(InputExample(texts=[reviews[i], reviews[i+1]]))

# For demonstration, we'll just use a sample of the data to keep training fast
random.shuffle(train_examples)
train_sample = train_examples[:10000] # Use up to 10,000 examples for training

print(f"Created {len(train_sample)} training examples.")

Creating training pairs: 0it [00:00, ?it/s]

Created 10000 training examples.


### Step 2: Model Training

In [None]:
from sentence_transformers import SentenceTransformer, losses
from torch.utils.data import DataLoader

import os

os.environ['PYTORCH_MPS_HIGH_WATERMARK_RATIO'] = '0.0'
os.environ["WANDB_DISABLED"] = "true"


# --- Load the pre-trained model ---
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
model = SentenceTransformer(model_name, device='cpu')

# --- Define the loss function ---
# This loss is ideal for our task. It takes a batch of sentences and assumes that
# sentences from the same InputExample are positive pairs, and all others are negative.
loss = losses.MultipleNegativesRankingLoss(model)

# --- Create a DataLoader ---
# The DataLoader will batch our training examples.
batch_size = 4
train_dataloader = DataLoader(train_sample, shuffle=True, batch_size=batch_size)

# --- Start the training process ---
num_epochs = 1
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1) # 10% of training steps for warmup

print("Starting the fine-tuning process...")
model.fit(
    train_objectives=[(train_dataloader, loss)],
    epochs=num_epochs,
    warmup_steps=warmup_steps,
    output_path='./fine_tuned_model',
    show_progress_bar=True
)

### Step 3: Saving and Using the Fine-Tuned Model

The `model.fit()` function has already saved our fine-tuned model to the `./fine_tuned_model` directory. This directory now contains everything needed to load the new, specialized model.

You can now use this model in your API by simply changing the model name in your `main.py` from `'sentence-transformers/all-MiniLM-L6-v2'` to `'./fine_tuned_model'`. This will load your specialized model instead of the general-purpose one, and all your recommendations should now be based on its improved understanding of travel reviews.

In [3]:
from sentence_transformers import SentenceTransformer
fine_tuned_model = SentenceTransformer('./fine_tuned_model')


# You can now use this `fine_tuned_model` object to encode sentences, and it will have a better
# understanding of your specific data.
print("Fine-tuned model loaded successfully.")

# For example, let's compare its embeddings for two reviews of the same place
review1 = "This place had such a great vibe, very chill and relaxing."
review2 = "Loved the atmosphere here, it was a perfect spot to unwind."

from sklearn.metrics.pairwise import cosine_similarity

embedding1 = fine_tuned_model.encode(review1)
embedding2 = fine_tuned_model.encode(review2)

similarity = cosine_similarity([embedding1], [embedding2])[0][0]
print(f"Similarity between two similar reviews: {similarity:.4f}")

Fine-tuned model loaded successfully.
Similarity between two similar reviews: 0.7595


In [None]:
# Code to upload trained model zip to gdrive
from google.colab import drive
drive.mount('/content/drive')
destination_path = "/content/drive/MyDrive/fine_tuned_model.zip"
shutil.move('fine_tuned_model.zip', destination_path)
print(f"Zipped model successfully uploaded to Google Drive at: {destination_path}")