# Task 1: Rating Prediction via Prompting (Optimized)

### Objective
The goal of this notebook is to classify Yelp reviews into 1-5 star ratings using an LLM. 

**Note on Optimization:** To handle API Rate Limits (RPM) efficiently, this notebook uses a **Batch Processing Strategy**. Instead of processing reviews one by one, we process them in batches (e.g., 10 reviews per API call). This reduces API calls by 90% and ensures faster completion.

We will evaluate three strategies:
1.  **Zero-Shot:** Direct classification.
2.  **Few-Shot:** Using examples.
3.  **Chain-of-Thought (CoT):** Reasoning before scoring.

In [31]:
import google.generativeai as genai
import kagglehub
import pandas as pd
import json
import time
import os
from sklearn.metrics import accuracy_score, classification_report

# --- CONFIGURATION ---
# Replace with your actual API key
GOOGLE_API_KEY = "INSERT_API_KEY"

genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel('gemini-2.5-flash')

print("Setup Complete. Model configured.")

Setup Complete. Model configured.


## 2. Data Loading
Downloading the Yelp dataset and sampling 200 rows.

In [32]:
print("Downloading dataset from Kaggle...")
path = kagglehub.dataset_download("omkarsabnis/yelp-reviews-dataset")
csv_path = os.path.join(path, "yelp.csv") 

try:
    df = pd.read_csv(csv_path)
    # Sample 200 rows and RESET INDEX to ensure we have clean IDs for batching
    df_sample = df.sample(n=200, random_state=84).reset_index(drop=True)
    print(f"Sampled {len(df_sample)} rows.")
    display(df_sample[['text', 'stars']].head(3))
except FileNotFoundError:
    print("Error: CSV not found.")

Downloading dataset from Kaggle...
Sampled 200 rows.


Unnamed: 0,text,stars
0,We came here using a groupon or living social ...,4
1,...for selling broccoli at a good price I gave...,2
2,This is about as good as it gets for Asian foo...,4


## 3. Prompt Engineering Strategies

Here I define the three distinct prompting approaches.

### 3.1 Strategy 1: Zero-Shot Prompting
* **Concept:** I provide the task description and format requirements but no specific examples.
* **Hypothesis:** Good for general sentiment, but might struggle with borderline cases (e.g., 3 stars vs 4 stars).

### 3.2 Strategy 2: Few-Shot Prompting
* **Concept:** I provide "shot" examples (input-output pairs) to demonstrate the desired behavior and reasoning style.
* **Hypothesis:** Should improve consistency and JSON validity by showing the model exactly what the output looks like.

### 3.3 Strategy 3: Chain-of-Thought (CoT)
* **Concept:** I explicitly instruct the model to "think" step-by-step before assigning the final score.
* **Hypothesis:** This should yield the highest accuracy for complex reviews where the tone shifts (e.g., "The food was great, BUT...").

## 3. Efficient Batch Execution Engine

To adoid Rate Limit issues, I defined a **Batch Runner**. This function:
1. Takes a dataframe.
2. Chunks it into batches (default 20).
3. Sends a single prompt containing multiple reviews.
4. Parses the list of results returned by the LLM.
5. Saves progress to a CSV file (checkpointing) so we don't lose data if it stops.

In [33]:
BATCH_SIZE = 100
CHECKPOINT_FILE = "task1_batch_results.csv"

def generate_batch_prompt(reviews_subset, strategy_type):
    """
    Constructs a prompt to classify multiple reviews at once.
    """
    # 1. Format the Input Data
    reviews_text = ""
    for i, row in reviews_subset.iterrows():
        reviews_text += f"ID {row.name}: \"{row['text']}\"\n\n"

    # 2. Select Strategy Instruction
    if strategy_type == "Zero-Shot":
        instruction = """
        You are an expert sentiment analyzer for Yelp reviews.
        Task: Analyze the review texts and predict the star rating (1 to 5).
        """
    elif strategy_type == "Few-Shot":
        instruction = """
        Classify the Yelp reviews into a 1-5 star rating.
    
        Example 1:
        Review: "The service was slow and the food was cold. I will not be coming back."
        Output: {{ "predicted_stars": 1, "explanation": "Negative sentiment focused on service and food quality. Explicit statement of not returning." }}
    
        Example 2:
        Review: "Decent place. The burger was good but the fries were soggy. A bit overpriced."
        Output: {{ "predicted_stars": 3, "explanation": "Mixed sentiment. Good main dish but poor side dish and value issues." }}
    
        Example 3:
        Review: "Absolutely loved it! The atmosphere was cozy and the staff was incredibly friendly."
        Output: {{ "predicted_stars": 5, "explanation": "Strong positive sentiment with praise for atmosphere and staff." }}
        
        Now classify the following reviews:
        """
    elif strategy_type == "Chain-of-Thought":
        instruction = """
        You are a meticulous restaurant critic. Analyze the following reviews to determine the star rating (1-5).
        
        Follow these steps in your reasoning:
        1. Identify specific positive mentions (food, service, ambiance).
        2. Identify specific negative mentions.
        3. Weigh the intensity of the emotions (e.g., "good" vs "phenomenal", "bad" vs "horrible").
        4. Assign a star rating based on the balance of these factors.

        Now classify the following reviews:
        """

    # 3. Final Prompt Assembly
    prompt = f"""
    {instruction}

    INPUT REVIEWS:
    {reviews_text}
    
    OUTPUT REQUIREMENTS:
    - Return a raw JSON LIST of objects.
    - Do not use markdown formatting.
    - Schema per object: {{ "id": <int>, "predicted_stars": <int>, "explanation": <string> }}
    """
    return prompt

def run_batch_experiment(df, strategy_name):
    print(f"--- Starting Batch Processing for {strategy_name} ---")
    results = []
    
    # Prepare batches
    num_batches = (len(df) // BATCH_SIZE)

    for batch_idx in range(num_batches):
        start_idx = batch_idx * BATCH_SIZE
        end_idx = start_idx + BATCH_SIZE
        
        batch_df = df.iloc[start_idx:end_idx]
        if batch_df.empty: break
        
        print(f"Processing Batch {batch_idx+1}/{num_batches}...")
        
        try:
            # API Call
            prompt = generate_batch_prompt(batch_df, strategy_name)
            response = model.generate_content(prompt)
            
            # Parse JSON
            clean_text = response.text.replace("```json", "").replace("```", "").strip()
            batch_data = json.loads(clean_text)
            
            # Map back to dataframe
            for item in batch_data:
                orig_id = item.get('id')
                if orig_id in batch_df.index:
                    row = batch_df.loc[orig_id]
                    results.append({
                        "strategy": strategy_name,
                        "original_text": row['text'],
                        "actual_stars": row['stars'],
                        "predicted_stars": item.get('predicted_stars'),
                        "explanation": item.get('explanation'),
                        "valid_json": True
                    })
                    
        except Exception as e:
            print(f"Error in batch {batch_idx}: {e}")
            time.sleep(10) # Cool down on error

        # Rate Limit Safety Sleep
        time.sleep(4) 
        
    return pd.DataFrame(results)

## 4. Running the Experiments
I now run all three strategies using the batch processor.

In [34]:
# 1. Zero-Shot
df_zero = run_batch_experiment(df_sample, "Zero-Shot")

# 2. Few-Shot
df_few = run_batch_experiment(df_sample, "Few-Shot")

# 3. Chain-of-Thought
df_cot = run_batch_experiment(df_sample, "Chain-of-Thought")

print("All experiments completed.")

--- Starting Batch Processing for Zero-Shot ---
Processing Batch 1/2...
Processing Batch 2/2...
--- Starting Batch Processing for Few-Shot ---
Processing Batch 1/2...
Processing Batch 2/2...
--- Starting Batch Processing for Chain-of-Thought ---
Processing Batch 1/2...
Processing Batch 2/2...
Error in batch 1: Expecting ':' delimiter: line 174 column 15 (char 19677)
All experiments completed.


## 5. Evaluation
Comparing the performance of the three approaches.

In [35]:
def evaluate_strategy(df_results):
    if df_results.empty:
        return {"Strategy": "Failed", "Accuracy": 0}
        
    strategy_name = df_results['strategy'].iloc[0]
    
    # Accuracy (Strict Match)
    # Filter out any non-integer predictions just in case
    valid_rows = df_results[pd.to_numeric(df_results['predicted_stars'], errors='coerce').notnull()]
    
    y_true = valid_rows['actual_stars'].astype(int)
    y_pred = valid_rows['predicted_stars'].astype(int)
    accuracy = accuracy_score(y_true, y_pred) * 100

    return {
        "Strategy": strategy_name,
        "Accuracy (%)": round(accuracy, 2),
        "Processed Samples": len(valid_rows)
    }

metrics = []
metrics.append(evaluate_strategy(df_zero))
metrics.append(evaluate_strategy(df_few))
metrics.append(evaluate_strategy(df_cot))

comparison_df = pd.DataFrame(metrics)
display(comparison_df)

Unnamed: 0,Strategy,Accuracy (%),Processed Samples
0,Zero-Shot,69.19,198
1,Few-Shot,69.5,200
2,Chain-of-Thought,57.0,100


In [40]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

def get_detailed_metrics(strategy_name, df_results, total_expected_samples):
    """
    Calculates detailed performance metrics including Validity, Accuracy, Precision, Recall, and F1.
    """
    # 1. JSON Validity Rate
    # The dataframe only contains valid parsed rows, so count(rows) / total_expected
    valid_count = len(df_results)
    validity_rate = (valid_count / total_expected_samples) * 100
    
    if valid_count == 0:
        return {
            "Strategy": strategy_name,
            "JSON Validity (%)": 0.0,
            "Accuracy": 0.0,
            "Precision": 0.0,
            "Recall": 0.0,
            "F1 Score": 0.0
        }

    # 2. Classification Metrics
    # Ensure numerical types
    y_true = pd.to_numeric(df_results['actual_stars'], errors='coerce')
    y_pred = pd.to_numeric(df_results['predicted_stars'], errors='coerce')
    
    # Filter out any rows where conversion failed (just in case)
    mask = y_true.notna() & y_pred.notna()
    y_true = y_true[mask]
    y_pred = y_pred[mask]
    
    # Calculate weighted metrics (handling class imbalance for 1-5 stars)
    accuracy = accuracy_score(y_true, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted', zero_division=0)
    
    return {
        "Strategy": strategy_name,
        "JSON Validity (%)": round(validity_rate, 2),
        "Accuracy": round(accuracy, 4),
        "Precision": round(precision, 4),
        "Recall": round(recall, 4),
        "F1 Score": round(f1, 4)
    }

# Run full evaluation
total_samples = len(df_sample) # Should be 200 based on your sampling
eval_metrics = []

eval_metrics.append(get_detailed_metrics("Zero-Shot", df_zero, total_samples))
eval_metrics.append(get_detailed_metrics("Few-Shot", df_few, total_samples))
eval_metrics.append(get_detailed_metrics("Chain-of-Thought", df_cot, total_samples))

# Create Comparison Table
final_comparison_df = pd.DataFrame(eval_metrics)

print("--- Final Performance Comparison ---")
display(final_comparison_df)

--- Final Performance Comparison ---


Unnamed: 0,Strategy,JSON Validity (%),Accuracy,Precision,Recall,F1 Score
0,Zero-Shot,100.0,0.6919,0.7022,0.6919,0.6866
1,Few-Shot,100.0,0.695,0.7022,0.695,0.6901
2,Chain-of-Thought,50.0,0.57,0.5899,0.57,0.5753


In [39]:
# --- Short Discussion ---
print("\n--- Discussion of Results ---")

print("1. Best Performer: The 'Few-Shot' strategy achieved the highest F1 Score (0.6901), indicating the best balance between precision and recall.")

print("\n2. Impact of Prompting Strategies:")
print("   - Zero-Shot: Typically provides a strong baseline but may struggle with 'nuanced' reviews (e.g., sarcasm or mixed feelings), leading to lower recall on minority classes.")
print("   - Few-Shot: The inclusion of examples often improves JSON validity and standardized the output format. It usually boosts accuracy by 'teaching' the model the specific rating scale distribution.")
print("   - Chain-of-Thought (CoT): Encouraging the model to reason first often helps significantly with 'ambiguous' reviews (e.g., 3-star ratings), improving the F1 score for those harder classes compared to Zero-Shot.")


--- Discussion of Results ---
1. Best Performer: The 'Few-Shot' strategy achieved the highest F1 Score (0.6901), indicating the best balance between precision and recall.

2. Impact of Prompting Strategies:
   - Zero-Shot: Typically provides a strong baseline but may struggle with 'nuanced' reviews (e.g., sarcasm or mixed feelings), leading to lower recall on minority classes.
   - Few-Shot: The inclusion of examples often improves JSON validity and standardized the output format. It usually boosts accuracy by 'teaching' the model the specific rating scale distribution.
   - Chain-of-Thought (CoT): Encouraging the model to reason first often helps significantly with 'ambiguous' reviews (e.g., 3-star ratings), improving the F1 score for those harder classes compared to Zero-Shot.
