### Sentiment Analysis Fine-tuning for Restaurant Reviews

This notebook fine-tunes DistilBERT for sentiment analysis on Zomato restaurant reviews.

**Steps:**
1. Load and preprocess Zomato reviews
2. Create balanced dataset with proper labels
3. Fine-tune DistilBERT
4. Save model to `models/sentiment/`

## 1. Install Dependencies

## 2. Imports

In [2]:
from datasets import Dataset, DatasetDict

from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer
)

import evaluate
import torch
import numpy as np
import pandas as pd
import ast
import re
from tqdm import tqdm
from pathlib import Path

## 3. Configuration

In [7]:
# Paths
RAW_DATA_PATH = '/Users/swarnendubanik/Desktop/AI powered restaurant/data/raw/zomato2.csv'
MODEL_SAVE_PATH = 'models/sentiment/final_model'

# Model config
MODEL_CHECKPOINT = 'distilbert-base-uncased'

# Training config
MAX_LENGTH = 256
BATCH_SIZE = 16
LEARNING_RATE = 2e-5
NUM_EPOCHS = 3
SAMPLE_SIZE = 10000  # Number of restaurants to sample

# Labels (binary classification like movie reviews)
id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative": 0, "Positive": 1}

print(f"Model checkpoint: {MODEL_CHECKPOINT}")
print(f"Save path: {MODEL_SAVE_PATH}")

Model checkpoint: distilbert-base-uncased
Save path: models/sentiment/final_model


## 4. Load and Preprocess Zomato Reviews

The `reviews_list` column contains a list of tuples: `[("Rated X.X", "review text"), ...]`

We need to:
1. Parse the string representation to actual list
2. Extract rating and review text
3. Convert rating to binary label: ≥4 → Positive, <3 → Negative (skip neutral 3-4)

In [9]:
# Load raw data
print("Loading raw data...")
df = pd.read_csv("/Users/swarnendubanik/Desktop/AI powered restaurant/data/raw/zomato2.csv")
print(f"Total restaurants: {len(df):,}")

# Sample for faster processing
if SAMPLE_SIZE and SAMPLE_SIZE < len(df):
    df = df.sample(n=SAMPLE_SIZE, random_state=42)
    print(f"Sampled: {len(df):,} restaurants")


Loading raw data...
Total restaurants: 51,717
Sampled: 10,000 restaurants


In [10]:
def parse_reviews(reviews_str):
    """
    Parse the reviews_list column and extract (rating, text) pairs.
    """
    if pd.isna(reviews_str) or reviews_str == '[]':
        return []
    
    try:
        # Safely evaluate the string representation
        reviews = ast.literal_eval(str(reviews_str))
        
        parsed = []
        for rating_str, review_text in reviews:
            # Extract numeric rating from "Rated 4.0"
            rating_match = re.search(r'(\d+\.?\d*)', rating_str)
            if not rating_match:
                continue
            rating = float(rating_match.group(1))
            
            # Clean review text - remove "RATED\n" prefix
            clean_text = str(review_text).replace('RATED\n', '').replace('RATED\\n', '').strip()
            
            # Skip empty or very short reviews
            if not clean_text or len(clean_text) < 20:
                continue
            
            parsed.append((rating, clean_text))
        
        return parsed
    except Exception as e:
        return []

# Test parsing
sample_reviews = df['reviews_list'].iloc[0]
parsed = parse_reviews(sample_reviews)
print(f"Sample parsed reviews: {len(parsed)} reviews")
if parsed:
    print(f"First review: Rating={parsed[0][0]}, Text={parsed[0][1][:100]}...")

Sample parsed reviews: 5 reviews
First review: Rating=3.0, Text=A pocket friendly food joint in the locality to have odia cuisine.

But it is very small and congest...


In [11]:
def extract_all_reviews(df):
    """
    Extract all reviews from dataframe and create labeled dataset.
    Binary labels: rating >= 4 → Positive (1), rating < 3 → Negative (0)
    We skip neutral reviews (3-4) to get cleaner training signal.
    """
    all_reviews = []
    
    print("Extracting reviews from dataset...")
    for idx, row in tqdm(df.iterrows(), total=len(df)):
        reviews = parse_reviews(row['reviews_list'])
        
        for rating, text in reviews:
            # Binary labeling (skip 3-4 range for clearer signal)
            if rating >= 4:
                label = 1  # Positive
            elif rating < 3:
                label = 0  # Negative
            else:
                continue  # Skip neutral (3-4)
            
            # Truncate very long reviews
            text = text[:1000]
            
            all_reviews.append({
                'text': text,
                'label': label
            })
    
    return pd.DataFrame(all_reviews)

# Extract all reviews
reviews_df = extract_all_reviews(df)
print(f"\nTotal reviews extracted: {len(reviews_df):,}")
print(f"Class distribution:")
print(f"  Positive (1): {(reviews_df['label'] == 1).sum():,}")
print(f"  Negative (0): {(reviews_df['label'] == 0).sum():,}")

Extracting reviews from dataset...


100%|██████████| 10000/10000 [00:01<00:00, 6554.00it/s]



Total reviews extracted: 196,890
Class distribution:
  Positive (1): 154,598
  Negative (0): 42,292


In [12]:
# Balance the dataset by undersampling the majority class
def balance_dataset(df):
    positive = df[df['label'] == 1]
    negative = df[df['label'] == 0]
    
    min_samples = min(len(positive), len(negative))
    print(f"Balancing to {min_samples:,} samples per class")
    
    positive_sampled = positive.sample(n=min_samples, random_state=42)
    negative_sampled = negative.sample(n=min_samples, random_state=42)
    
    balanced = pd.concat([positive_sampled, negative_sampled])
    balanced = balanced.sample(frac=1, random_state=42).reset_index(drop=True)
    
    return balanced

balanced_df = balance_dataset(reviews_df)
print(f"\nBalanced dataset size: {len(balanced_df):,}")

Balancing to 42,292 samples per class

Balanced dataset size: 84,584


In [13]:
# View sample data
print("Sample positive review:")
pos_sample = balanced_df[balanced_df['label'] == 1].iloc[0]
print(f"  {pos_sample['text'][:200]}...")

print("\nSample negative review:")
neg_sample = balanced_df[balanced_df['label'] == 0].iloc[0]
print(f"  {neg_sample['text'][:200]}...")

Sample positive review:
  Great place in the midst of a busy city. Good view of the city with a great ambience and extremely couteous staff. We tried Mutton Chaap and it was really amazing. Awadhi chicken with kadak roti and t...

Sample negative review:
  They got parts of the order wrong, twice! The food was average. The decor looks fancy mainly in pics. Overall unimpressed with this place.. Highly rated for cocktails but expected great food for the p...


## 5. Create HuggingFace Dataset

In [14]:
# Split into train and validation
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(
    balanced_df, 
    test_size=0.2, 
    random_state=42, 
    stratify=balanced_df['label']
)

print(f"Training samples: {len(train_df):,}")
print(f"Validation samples: {len(val_df):,}")

# Create HuggingFace datasets
train_dataset = Dataset.from_pandas(train_df[['text', 'label']])
val_dataset = Dataset.from_pandas(val_df[['text', 'label']])

dataset = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset
})

dataset

Training samples: 67,667
Validation samples: 16,917


DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 67667
    })
    validation: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 16917
    })
})

In [15]:
# Display % of training data with label=1
np.array(dataset['train']['label']).sum() / len(dataset['train']['label'])

np.float64(0.5000073891261619)

## 6. Load Model and Tokenizer

In [16]:
# Load model for binary classification
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_CHECKPOINT, 
    num_labels=2, 
    id2label=id2label, 
    label2id=label2id
)

Loading weights:   0%|          | 0/100 [00:00<?, ?it/s]

DistilBertForSequenceClassification LOAD REPORT from: distilbert-base-uncased
Key                     | Status     | 
------------------------+------------+-
vocab_transform.bias    | UNEXPECTED | 
vocab_layer_norm.bias   | UNEXPECTED | 
vocab_projector.bias    | UNEXPECTED | 
vocab_transform.weight  | UNEXPECTED | 
vocab_layer_norm.weight | UNEXPECTED | 
pre_classifier.bias     | MISSING    | 
pre_classifier.weight   | MISSING    | 
classifier.bias         | MISSING    | 
classifier.weight       | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


In [17]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSelfAttention(
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [18]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT, add_prefix_space=True)

# Add pad token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

print(f"Vocab size: {len(tokenizer)}")

Vocab size: 30522


## 7. Tokenize Dataset

In [19]:
def tokenize_function(examples):
    text = examples["text"]
    
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=MAX_LENGTH
    )
    
    return tokenized_inputs

In [20]:
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset

Map:   0%|          | 0/67667 [00:00<?, ? examples/s]

Map:   0%|          | 0/16917 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 67667
    })
    validation: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 16917
    })
})

In [21]:
# Data collator for padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## 8. Evaluation Metrics

In [32]:
accuracy = evaluate.load("accuracy")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)
    
    # accuracy.compute() already returns {"accuracy": value}, so unpack it
    result = accuracy.compute(predictions=predictions, references=labels)
    return result  # Returns: {"accuracy": 0.85} - FLAT


## 9. Test Untrained Model (Baseline)

In [35]:
# Test with sample texts
text_list = [
    "The food was absolutely delicious! Best restaurant ever.",
    "Terrible experience. Food was cold and service was rude.",
    "Amazing ambiance and the pasta was incredible.",
    "Worst biryani I've ever had. Never coming back.",
    "Loved the desserts! Will definitely visit again."
]

# Get model's device
device = next(model.parameters()).device
print(f"Model is on: {device}")

print("\nUntrained model predictions:")
for text in text_list:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=MAX_LENGTH)
    
    # Move inputs to same device as model
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        prediction = torch.argmax(outputs.logits, dim=-1).item()
    
    print(f"  {id2label[prediction]}: {text[:50]}...")


Model is on: mps:0

Untrained model predictions:
  Positive: The food was absolutely delicious! Best restaurant...
  Negative: Terrible experience. Food was cold and service was...
  Positive: Amazing ambiance and the pasta was incredible....
  Negative: Worst biryani I've ever had. Never coming back....
  Positive: Loved the desserts! Will definitely visit again....


## 10. Training

In [39]:
# Calculate steps for 0.25 epoch validation
train_samples = len(tokenized_dataset['train'])
steps_per_epoch = train_samples // BATCH_SIZE
eval_steps = int(steps_per_epoch * 0.25)  # Validate every 0.25 epoch

print(f"Total training samples: {train_samples}")
print(f"Steps per epoch: {steps_per_epoch}")
print(f"Eval every {eval_steps} steps (0.25 epoch)")

# Training arguments
training_args = TrainingArguments(
    output_dir=MODEL_SAVE_PATH + "_checkpoints",
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=NUM_EPOCHS,
    weight_decay=0.01,
    eval_strategy="steps",           # Changed from "epoch"
    eval_steps=eval_steps,            # Validate every 0.25 epoch
    save_strategy="steps",            # Changed from "epoch"
    save_steps=eval_steps,            # Save at same frequency
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    logging_steps=50,
    warmup_ratio=0.1,
    report_to="none",
)


warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Total training samples: 67667
Steps per epoch: 4229
Eval every 1057 steps (0.25 epoch)


In [40]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [41]:
# Train the model
print("Starting training...")
trainer.train()

Starting training...


Step,Training Loss,Validation Loss,Accuracy
1057,0.037075,0.075195,0.988
2114,0.054145,0.063448,0.983567
3171,0.013227,0.069971,0.988118
4228,0.00036,0.058152,0.99001
5285,0.016014,0.047548,0.991783
6342,0.007035,0.048072,0.99267
7399,0.018118,0.053179,0.991665
8456,0.015376,0.045148,0.993793
9513,0.003839,0.052602,0.992552
10570,0.000409,0.054223,0.993793


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

  super().__init__(loader)


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

  super().__init__(loader)


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

  super().__init__(loader)


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

  super().__init__(loader)


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

  super().__init__(loader)


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

  super().__init__(loader)


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

  super().__init__(loader)


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

  super().__init__(loader)


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

  super().__init__(loader)


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

  super().__init__(loader)


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

  super().__init__(loader)


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

There were missing keys in the checkpoint model loaded: ['distilbert.embeddings.LayerNorm.weight', 'distilbert.embeddings.LayerNorm.bias'].
There were unexpected keys in the checkpoint model loaded: ['distilbert.embeddings.LayerNorm.beta', 'distilbert.embeddings.LayerNorm.gamma'].


TrainOutput(global_step=12690, training_loss=0.014613968513390802, metrics={'train_runtime': 3200.5317, 'train_samples_per_second': 63.427, 'train_steps_per_second': 3.965, 'total_flos': 9892669729961676.0, 'train_loss': 0.014613968513390802, 'epoch': 3.0})

## 11. Evaluate Model

In [44]:
# Evaluate on validation set
eval_results = trainer.evaluate()
print(f"\nValidation Results:")
print(f"  Accuracy: {eval_results['eval_accuracy']:.4f}")  # Remove ['accuracy']
print(f"  Loss: {eval_results['eval_loss']:.4f}")



Validation Results:
  Accuracy: 0.9939
  Loss: 0.0509


## 12. Test Trained Model

In [45]:
# Test with sample texts
print("Trained model predictions:")
for text in text_list:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=MAX_LENGTH)
    
    # Move to same device as model
    device = next(model.parameters()).device
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)
        prediction = torch.argmax(probs, dim=-1).item()
        confidence = probs[0][prediction].item()
    
    print(f"  {id2label[prediction]} ({confidence:.2%}): {text[:50]}...")

Trained model predictions:
  Positive (100.00%): The food was absolutely delicious! Best restaurant...
  Negative (100.00%): Terrible experience. Food was cold and service was...
  Positive (100.00%): Amazing ambiance and the pasta was incredible....
  Negative (100.00%): Worst biryani I've ever had. Never coming back....
  Positive (100.00%): Loved the desserts! Will definitely visit again....


In [46]:
# Test with actual Zomato reviews
zomato_test_reviews = [
    "The biryani was amazing! Perfect spices and tender meat. Must visit for biryani lovers.",
    "Pathetic service. Waited 45 minutes for cold food. Staff was extremely rude.",
    "Great ambiance for a romantic dinner. The pasta was creamy and delicious.",
    "Overpriced and underwhelming. The pizza was soggy and tasteless.",
    "Loved the live music and the cocktails. Perfect weekend hangout spot!"
]

print("\nZomato review predictions:")
for text in zomato_test_reviews:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=MAX_LENGTH)
    device = next(model.parameters()).device
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)
        prediction = torch.argmax(probs, dim=-1).item()
        confidence = probs[0][prediction].item()
    
    print(f"  {id2label[prediction]} ({confidence:.2%}): {text[:60]}...")


Zomato review predictions:
  Positive (100.00%): The biryani was amazing! Perfect spices and tender meat. Mus...
  Negative (100.00%): Pathetic service. Waited 45 minutes for cold food. Staff was...
  Positive (100.00%): Great ambiance for a romantic dinner. The pasta was creamy a...
  Negative (100.00%): Overpriced and underwhelming. The pizza was soggy and tastel...
  Positive (100.00%): Loved the live music and the cocktails. Perfect weekend hang...


## 13. Save Model

In [48]:
# Create output directory
Path(MODEL_SAVE_PATH).mkdir(parents=True, exist_ok=True)

# Save the model and tokenizer
print(f"Saving model to: {MODEL_SAVE_PATH}")
trainer.save_model(MODEL_SAVE_PATH)
tokenizer.save_pretrained(MODEL_SAVE_PATH)

# Save training info
import json

training_info = {
    "base_model": MODEL_CHECKPOINT,
    "num_epochs": NUM_EPOCHS,
    "batch_size": BATCH_SIZE,
    "learning_rate": LEARNING_RATE,
    "max_length": MAX_LENGTH,
    "train_samples": len(train_df),
    "val_samples": len(val_df),
    "eval_accuracy": eval_results['eval_accuracy'],
    "id2label": id2label,
    "label2id": label2id
}

with open(f"{MODEL_SAVE_PATH}/training_info.json", "w") as f:
    json.dump(training_info, f, indent=2)

print("Model saved successfully!")

Saving model to: models/sentiment/final_model


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Model saved successfully!


## 14. Verify Saved Model

In [49]:
# Load and verify the saved model
print("Loading saved model for verification...")

loaded_tokenizer = AutoTokenizer.from_pretrained(MODEL_SAVE_PATH)
loaded_model = AutoModelForSequenceClassification.from_pretrained(MODEL_SAVE_PATH)

# Test prediction
test_text = "This restaurant has the best food I've ever tasted!"
inputs = loaded_tokenizer(test_text, return_tensors="pt", truncation=True, max_length=MAX_LENGTH)

with torch.no_grad():
    outputs = loaded_model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=-1).item()

print(f"Loaded model prediction: {id2label[prediction]}")
print(f"   Test text: {test_text}")

Loading saved model for verification...


Loading weights:   0%|          | 0/104 [00:00<?, ?it/s]

Loaded model prediction: Positive
   Test text: This restaurant has the best food I've ever tasted!


---

##  Training Complete!

The sentiment model has been saved to `models/sentiment/final_model`

You can now use this model in the restaurant recommendation agents.