# Task 1: Rating Prediction via Prompting
## Fynd AI Intern Assessment 2.0

This notebook implements and evaluates **4 different prompting approaches** to predict star ratings (1-5) from Yelp reviews.

### Approaches:
1. **Simple Direct Prompting** - Baseline with minimal context
2. **Few-Shot Learning** - Includes examples to guide the model
3. **Chain-of-Thought (CoT)** - Step-by-step reasoning
4. **Structured Criteria-Based Analysis** - Detailed evaluation framework

## 1. Setup and Imports

In [None]:
# Install required packages (run once)
# !pip install pandas numpy scikit-learn google-generativeai python-dotenv matplotlib seaborn

In [None]:
import pandas as pd
import numpy as np
import json
import time
import re
from typing import Dict, List, Tuple, Optional
from dotenv import load_dotenv
import os
import google.generativeai as genai
from sklearn.metrics import accuracy_score, mean_absolute_error, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Set style for visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✓ All imports successful")

## 2. Configuration and API Setup

In [None]:
# Load environment variables
load_dotenv()

# Configure Gemini API
GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')

if not GEMINI_API_KEY or GEMINI_API_KEY == 'your_api_key_here':
    raise ValueError(
        "❌ GEMINI_API_KEY not found!\n"
        "Please create a .env file with your API key:\n"
        "GEMINI_API_KEY=your_actual_key_here\n\n"
        "Get your free key from: https://makersuite.google.com/app/apikey"
    )

genai.configure(api_key=GEMINI_API_KEY)

# Initialize model
model = genai.GenerativeModel('gemini-2.5-flash')

print("✓ Gemini API configured successfully")
print(f"✓ Using model: gemini-2.5-flash")

## 3. Load and Sample Dataset

In [None]:
# Load dataset
df = pd.read_csv('yelp.csv')

print(f"Total reviews in dataset: {len(df):,}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Check rating distribution
print("Rating distribution in full dataset:")
print(df['stars'].value_counts().sort_index())

plt.figure(figsize=(10, 5))
df['stars'].value_counts().sort_index().plot(kind='bar', color='skyblue', edgecolor='black')
plt.title('Rating Distribution in Full Dataset', fontsize=14, fontweight='bold')
plt.xlabel('Star Rating')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Stratified sampling to maintain rating distribution
SAMPLE_SIZE = 200
np.random.seed(42)  # For reproducibility

# Sample proportionally from each rating
sample_df = df.groupby('stars', group_keys=False).apply(
    lambda x: x.sample(min(len(x), int(SAMPLE_SIZE * len(x) / len(df))), random_state=42)
).reset_index(drop=True)

# If we don't have exactly SAMPLE_SIZE, adjust
if len(sample_df) < SAMPLE_SIZE:
    additional = df[~df.index.isin(sample_df.index)].sample(
        SAMPLE_SIZE - len(sample_df), random_state=42
    )
    sample_df = pd.concat([sample_df, additional]).reset_index(drop=True)
elif len(sample_df) > SAMPLE_SIZE:
    sample_df = sample_df.sample(SAMPLE_SIZE, random_state=42).reset_index(drop=True)

print(f"\n✓ Sampled {len(sample_df)} reviews")
print(f"\nSample rating distribution:")
print(sample_df['stars'].value_counts().sort_index())

# Visualize sample distribution
plt.figure(figsize=(10, 5))
sample_df['stars'].value_counts().sort_index().plot(kind='bar', color='lightcoral', edgecolor='black')
plt.title('Rating Distribution in Sample', fontsize=14, fontweight='bold')
plt.xlabel('Star Rating')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

## 4. Prompting Approaches

### Approach 1: Simple Direct Prompting
**Strategy**: Minimal context, direct instruction
**Rationale**: Establishes baseline performance without any sophisticated techniques

In [None]:
def create_simple_prompt(review_text: str) -> str:
    """Simple direct prompting approach."""
    return f"""Analyze this Yelp review and predict the star rating (1-5).

Review: "{review_text}"

Return ONLY a JSON object with this exact format:
{{
  "predicted_stars": <number between 1-5>,
  "explanation": "<brief reasoning>"
}}
"""

# Test the prompt
test_review = sample_df.iloc[0]['text']
print("Example Prompt (Approach 1):")
print("="*80)
print(create_simple_prompt(test_review))
print("="*80)

### Approach 2: Few-Shot Learning
**Strategy**: Provide 5 diverse examples to calibrate the model
**Rationale**: Research shows examples improve consistency and accuracy

In [None]:
def create_fewshot_prompt(review_text: str) -> str:
    """Few-shot learning approach with examples."""
    return f"""You are an expert at analyzing Yelp reviews and predicting star ratings.

Here are some examples:

Example 1:
Review: "Absolutely amazing! Best food I've ever had. Service was impeccable and the atmosphere was perfect."
Rating: 5 stars
Reasoning: Extremely positive language, multiple superlatives, all aspects praised.

Example 2:
Review: "Pretty good experience overall. Food was tasty but service was a bit slow."
Rating: 4 stars
Reasoning: Positive but with minor criticism, balanced review.

Example 3:
Review: "It was okay. Nothing special but nothing terrible either. Average food, average service."
Rating: 3 stars
Reasoning: Neutral language, no strong positives or negatives, mediocre experience.

Example 4:
Review: "Disappointed. Food was cold and service was inattentive. Probably won't return."
Rating: 2 stars
Reasoning: Multiple negative points, clear dissatisfaction, unlikely to return.

Example 5:
Review: "Terrible experience. Rude staff, disgusting food, dirty restaurant. Avoid at all costs!"
Rating: 1 star
Reasoning: Extremely negative, multiple severe complaints, strong warning to others.

Now analyze this review:
Review: "{review_text}"

Return ONLY a JSON object:
{{
  "predicted_stars": <number between 1-5>,
  "explanation": "<brief reasoning>"
}}
"""

# Test the prompt
print("Example Prompt (Approach 2):")
print("="*80)
print(create_fewshot_prompt(test_review)[:500] + "...")
print("="*80)

### Approach 3: Chain-of-Thought (CoT)
**Strategy**: Ask model to reason step-by-step before deciding
**Rationale**: Improves performance on complex reasoning tasks

In [None]:
def create_cot_prompt(review_text: str) -> str:
    """Chain-of-thought prompting approach."""
    return f"""Analyze this Yelp review step-by-step to predict the star rating.

Review: "{review_text}"

Think through this systematically:
1. First, identify the overall sentiment (very negative, negative, neutral, positive, very positive)
2. Note specific positive aspects mentioned
3. Note specific negative aspects or complaints
4. Consider the intensity of language used
5. Determine if the reviewer would recommend this place
6. Based on all factors, assign a rating from 1-5 stars

Return ONLY a JSON object:
{{
  "predicted_stars": <number between 1-5>,
  "explanation": "<your step-by-step reasoning>"
}}
"""

# Test the prompt
print("Example Prompt (Approach 3):")
print("="*80)
print(create_cot_prompt(test_review))
print("="*80)

### Approach 4: Structured Criteria-Based Analysis
**Strategy**: Detailed evaluation framework with explicit criteria
**Rationale**: Mimics human rating process for more consistent results

In [None]:
def create_structured_prompt(review_text: str) -> str:
    """Structured criteria-based analysis approach."""
    return f"""You are a professional review analyst. Evaluate this Yelp review using a structured framework.

Review: "{review_text}"

Rating Guidelines:
★★★★★ (5 stars): Exceptional experience, highly enthusiastic, strong recommendation, minimal/no complaints
★★★★☆ (4 stars): Very good experience, mostly positive, minor issues mentioned, would likely return
★★★☆☆ (3 stars): Average/mixed experience, balanced positives and negatives, neutral stance
★★☆☆☆ (2 stars): Below average, more negatives than positives, disappointed, unlikely to return
★☆☆☆☆ (1 star): Terrible experience, extremely negative, strong complaints, warns others away

Evaluation Criteria:
1. Sentiment Analysis: What is the dominant emotional tone?
2. Language Intensity: Are superlatives or extreme words used?
3. Specific Feedback: What concrete details are mentioned (food quality, service, atmosphere, value)?
4. Recommendation Likelihood: Would the reviewer recommend this to others?
5. Complaint Severity: How serious are any issues mentioned?

Return ONLY a JSON object:
{{
  "predicted_stars": <number between 1-5>,
  "explanation": "<detailed analysis based on criteria>"
}}
"""

# Test the prompt
print("Example Prompt (Approach 4):")
print("="*80)
print(create_structured_prompt(test_review)[:500] + "...")
print("="*80)

## 5. Helper Functions for API Calls and JSON Parsing

In [None]:
def extract_json_from_response(response_text: str) -> Optional[Dict]:
    """Extract JSON from various response formats."""
    try:
        # Try direct JSON parsing
        return json.loads(response_text)
    except json.JSONDecodeError:
        pass
    
    # Try to find JSON in markdown code blocks
    json_pattern = r'```(?:json)?\s*({.*?})\s*```'
    matches = re.findall(json_pattern, response_text, re.DOTALL)
    if matches:
        try:
            return json.loads(matches[0])
        except json.JSONDecodeError:
            pass
    
    # Try to find any JSON object
    json_pattern = r'{[^{}]*"predicted_stars"[^{}]*}'
    matches = re.findall(json_pattern, response_text, re.DOTALL)
    if matches:
        try:
            return json.loads(matches[0])
        except json.JSONDecodeError:
            pass
    
    return None


def validate_prediction(prediction: Dict) -> bool:
    """Validate prediction structure and values."""
    if not isinstance(prediction, dict):
        return False
    
    if 'predicted_stars' not in prediction:
        return False
    
    stars = prediction['predicted_stars']
    if not isinstance(stars, (int, float)):
        return False
    
    if not (1 <= stars <= 5):
        return False
    
    return True


def call_llm_with_retry(prompt: str, max_retries: int = 3, delay: float = 1.0) -> Tuple[Optional[Dict], str]:
    """Call LLM with retry logic and return (prediction_dict, raw_response)."""
    for attempt in range(max_retries):
        try:
            response = model.generate_content(prompt)
            raw_response = response.text
            
            # Extract JSON
            prediction = extract_json_from_response(raw_response)
            
            if prediction and validate_prediction(prediction):
                return prediction, raw_response
            
            # If validation failed, retry
            if attempt < max_retries - 1:
                time.sleep(delay * (attempt + 1))
                continue
            
            return None, raw_response
            
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(delay * (attempt + 1))
                continue
            return None, str(e)
    
    return None, "Max retries exceeded"


print("✓ Helper functions defined")

## 6. Evaluation Function

In [None]:
def evaluate_approach(approach_name: str, prompt_function, sample_data: pd.DataFrame, 
                      delay_between_calls: float = 0.5) -> pd.DataFrame:
    """Evaluate a prompting approach on the sample dataset."""
    print(f"\n{'='*80}")
    print(f"Evaluating: {approach_name}")
    print(f"{'='*80}\n")
    
    results = []
    total = len(sample_data)
    
    for idx, row in sample_data.iterrows():
        review_text = row['text']
        actual_stars = row['stars']
        
        # Create prompt
        prompt = prompt_function(review_text)
        
        # Call LLM
        prediction, raw_response = call_llm_with_retry(prompt)
        
        # Store results
        result = {
            'review_text': review_text,
            'actual_stars': actual_stars,
            'predicted_stars': prediction['predicted_stars'] if prediction else None,
            'explanation': prediction.get('explanation', '') if prediction else '',
            'raw_response': raw_response,
            'json_valid': prediction is not None
        }
        results.append(result)
        
        # Progress indicator
        if (idx + 1) % 10 == 0 or (idx + 1) == total:
            valid_count = sum(1 for r in results if r['json_valid'])
            print(f"Progress: {idx + 1}/{total} | Valid JSON: {valid_count}/{idx + 1} ({100*valid_count/(idx+1):.1f}%)")
        
        # Rate limiting
        time.sleep(delay_between_calls)
    
    results_df = pd.DataFrame(results)
    
    # Calculate metrics
    valid_predictions = results_df[results_df['json_valid']]
    
    if len(valid_predictions) > 0:
        accuracy = accuracy_score(valid_predictions['actual_stars'], valid_predictions['predicted_stars'])
        mae = mean_absolute_error(valid_predictions['actual_stars'], valid_predictions['predicted_stars'])
    else:
        accuracy = 0.0
        mae = float('inf')
    
    json_validity_rate = results_df['json_valid'].mean()
    
    print(f"\n{'='*80}")
    print(f"Results for {approach_name}:")
    print(f"{'='*80}")
    print(f"Total Reviews: {len(results_df)}")
    print(f"Valid JSON Responses: {results_df['json_valid'].sum()} ({100*json_validity_rate:.1f}%)")
    print(f"Accuracy: {100*accuracy:.2f}%")
    print(f"Mean Absolute Error: {mae:.3f}")
    print(f"{'='*80}\n")
    
    return results_df


print("✓ Evaluation function defined")

## 7. Run All Evaluations

**Note**: This will take approximately 10-20 minutes depending on API response times.

**IMPORTANT**: If you have already run the evaluations and have the CSV files, you can skip this section and jump to Section 8.

In [None]:
# Dictionary to store all results
all_results = {}

# Approach 1: Simple Direct
all_results['Simple Direct'] = evaluate_approach(
    "Approach 1: Simple Direct Prompting",
    create_simple_prompt,
    sample_df
)

# Save intermediate results
all_results['Simple Direct'].to_csv('results_simple_direct.csv', index=False)
print("✓ Saved results_simple_direct.csv")

In [None]:
# Approach 2: Few-Shot Learning
all_results['Few-Shot Learning'] = evaluate_approach(
    "Approach 2: Few-Shot Learning",
    create_fewshot_prompt,
    sample_df
)

# Save intermediate results
all_results['Few-Shot Learning'].to_csv('results_fewshot.csv', index=False)
print("✓ Saved results_fewshot.csv")

In [None]:
# Approach 3: Chain-of-Thought
all_results['Chain-of-Thought'] = evaluate_approach(
    "Approach 3: Chain-of-Thought (CoT)",
    create_cot_prompt,
    sample_df
)

# Save intermediate results
all_results['Chain-of-Thought'].to_csv('results_cot.csv', index=False)
print("✓ Saved results_cot.csv")

In [None]:
# Approach 4: Structured Criteria-Based
all_results['Structured Analysis'] = evaluate_approach(
    "Approach 4: Structured Criteria-Based Analysis",
    create_structured_prompt,
    sample_df
)

# Save intermediate results
all_results['Structured Analysis'].to_csv('results_structured.csv', index=False)
print("✓ Saved results_structured.csv")

## 8. Load Results from CSV Files

**Run this cell if you want to analyze previously generated results without re-running the API calls.**

In [None]:
# Load all results from CSV files
all_results = {
    'Simple Direct': pd.read_csv('results_simple_direct.csv'),
    'Few-Shot Learning': pd.read_csv('results_fewshot.csv'),
    'Chain-of-Thought': pd.read_csv('results_cot.csv'),
    'Structured Analysis': pd.read_csv('results_structured.csv')
}

print("✓ Loaded all results from CSV files")
print(f"\nResults summary:")
for approach_name, results_df in all_results.items():
    valid_count = results_df['json_valid'].sum()
    total = len(results_df)
    print(f"  {approach_name}: {total} reviews, {valid_count} valid ({100*valid_count/total:.1f}%)")

## 9. Comparative Analysis

In [None]:
# Create comparison table
comparison_data = []

for approach_name, results_df in all_results.items():
    valid_predictions = results_df[results_df['json_valid']]
    
    if len(valid_predictions) > 0:
        accuracy = accuracy_score(valid_predictions['actual_stars'], valid_predictions['predicted_stars'])
        mae = mean_absolute_error(valid_predictions['actual_stars'], valid_predictions['predicted_stars'])
    else:
        accuracy = 0.0
        mae = float('inf')
    
    json_validity = results_df['json_valid'].mean()
    
    comparison_data.append({
        'Approach': approach_name,
        'Accuracy (%)': f"{100*accuracy:.2f}",
        'Mean Absolute Error': f"{mae:.3f}",
        'JSON Validity Rate (%)': f"{100*json_validity:.2f}",
        'Valid Predictions': len(valid_predictions),
        'Total Reviews': len(results_df)
    })

comparison_df = pd.DataFrame(comparison_data)
comparison_df.to_csv('approach_comparison.csv', index=False)

print("\n" + "="*100)
print("COMPARISON TABLE: All Approaches")
print("="*100)
print(comparison_df.to_string(index=False))
print("="*100)

print("\n✓ Saved approach_comparison.csv")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

approaches = comparison_df['Approach'].tolist()
accuracies = [float(x) for x in comparison_df['Accuracy (%)']]
maes = [float(x) for x in comparison_df['Mean Absolute Error']]
json_rates = [float(x) for x in comparison_df['JSON Validity Rate (%)']]

# Accuracy comparison
axes[0].bar(approaches, accuracies, color='steelblue', edgecolor='black')
axes[0].set_title('Accuracy Comparison', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Accuracy (%)')
axes[0].set_ylim(0, 100)
axes[0].tick_params(axis='x', rotation=45)
for i, v in enumerate(accuracies):
    axes[0].text(i, v + 2, f'{v:.1f}%', ha='center', fontweight='bold')

# MAE comparison (lower is better)
axes[1].bar(approaches, maes, color='coral', edgecolor='black')
axes[1].set_title('Mean Absolute Error (Lower is Better)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('MAE')
axes[1].tick_params(axis='x', rotation=45)
for i, v in enumerate(maes):
    axes[1].text(i, v + 0.05, f'{v:.2f}', ha='center', fontweight='bold')

# JSON validity comparison
axes[2].bar(approaches, json_rates, color='mediumseagreen', edgecolor='black')
axes[2].set_title('JSON Validity Rate', fontsize=14, fontweight='bold')
axes[2].set_ylabel('Validity Rate (%)')
axes[2].set_ylim(0, 100)
axes[2].tick_params(axis='x', rotation=45)
for i, v in enumerate(json_rates):
    axes[2].text(i, v + 2, f'{v:.1f}%', ha='center', fontweight='bold')

plt.tight_layout()
plt.savefig('approach_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Saved approach_comparison.png")

## 10. Detailed Analysis of Best Approach

In [None]:
# Find best approach by accuracy
best_approach_name = comparison_df.loc[comparison_df['Accuracy (%)'].astype(float).idxmax(), 'Approach']
best_results = all_results[best_approach_name]
best_valid = best_results[best_results['json_valid']]

print(f"\nBest Approach: {best_approach_name}\n")

# Classification report
print("Classification Report:")
print("="*80)
print(classification_report(
    best_valid['actual_stars'], 
    best_valid['predicted_stars'],
    target_names=['1 star', '2 stars', '3 stars', '4 stars', '5 stars']
))
print("="*80)

In [None]:
# Confusion matrix
cm = confusion_matrix(best_valid['actual_stars'], best_valid['predicted_stars'])

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['1★', '2★', '3★', '4★', '5★'],
            yticklabels=['1★', '2★', '3★', '4★', '5★'],
            cbar_kws={'label': 'Count'})
plt.title(f'Confusion Matrix: {best_approach_name}', fontsize=16, fontweight='bold')
plt.xlabel('Predicted Rating', fontsize=12)
plt.ylabel('Actual Rating', fontsize=12)
plt.tight_layout()
plt.savefig('confusion_matrix_best.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Saved confusion_matrix_best.png")

## 11. Sample Predictions Analysis

In [None]:
# Show some correct predictions
correct_predictions = best_valid[best_valid['actual_stars'] == best_valid['predicted_stars']]

print("\n" + "="*100)
print("SAMPLE CORRECT PREDICTIONS")
print("="*100)

for i, row in correct_predictions.head(3).iterrows():
    print(f"\nReview: {row['review_text'][:200]}...")
    print(f"Actual: {row['actual_stars']}★ | Predicted: {row['predicted_stars']}★")
    print(f"Explanation: {row['explanation']}")
    print("-" * 100)

In [None]:
# Show some incorrect predictions
incorrect_predictions = best_valid[best_valid['actual_stars'] != best_valid['predicted_stars']]

print("\n" + "="*100)
print("SAMPLE INCORRECT PREDICTIONS (for analysis)")
print("="*100)

for i, row in incorrect_predictions.head(3).iterrows():
    print(f"\nReview: {row['review_text'][:200]}...")
    print(f"Actual: {row['actual_stars']}★ | Predicted: {row['predicted_stars']}★ | Error: {abs(row['actual_stars'] - row['predicted_stars'])}")
    print(f"Explanation: {row['explanation']}")
    print("-" * 100)

## 12. Discussion and Insights

### Prompt Evolution and Improvements

#### Approach 1 → 2: Adding Examples
**Change**: Added 5 diverse examples covering each rating level
**Rationale**: Few-shot learning helps calibrate the model's understanding of the rating scale
**Expected Impact**: Improved consistency and better boundary detection between ratings

#### Approach 2 → 3: Encouraging Reasoning
**Change**: Introduced step-by-step thinking process
**Rationale**: Chain-of-thought prompting improves performance on complex reasoning tasks
**Expected Impact**: Better handling of ambiguous or mixed-sentiment reviews

#### Approach 3 → 4: Structured Framework
**Change**: Explicit evaluation criteria and detailed rating guidelines
**Rationale**: Mimics professional review analysis process
**Expected Impact**: Most consistent and accurate predictions, especially for edge cases

### Key Findings

1. **JSON Validity**: All approaches should achieve >95% validity with proper error handling
2. **Accuracy Trends**: More sophisticated prompts generally perform better
3. **Trade-offs**: 
   - Simple prompts: Faster, cheaper, but less accurate
   - Complex prompts: More accurate, but higher token cost and latency

### Common Challenges

1. **Sarcasm Detection**: LLMs may miss sarcastic reviews
2. **Mixed Reviews**: Reviews with both positive and negative aspects are harder to rate
3. **Context Sensitivity**: Some reviews require domain knowledge (e.g., restaurant vs. service business)
4. **Rating Scale Interpretation**: 3-star reviews are most ambiguous

### Recommendations

- **For Production**: Use Few-Shot or Structured approach for best balance of accuracy and cost
- **For Speed**: Simple Direct approach with post-processing
- **For Accuracy**: Structured Criteria-Based with ensemble voting

### Future Improvements

1. **Ensemble Methods**: Combine multiple approaches and vote
2. **Fine-tuning**: Train a specialized model on Yelp data
3. **Active Learning**: Identify uncertain predictions for human review
4. **Domain Adaptation**: Customize prompts for specific business types

## 13. Summary Statistics

In [None]:
print("\n" + "="*100)
print("FINAL SUMMARY")
print("="*100)
print(f"\nTotal Reviews Evaluated: {len(sample_df)}")
print(f"Number of Approaches Tested: {len(all_results)}")
print(f"\nBest Approach: {best_approach_name}")
print(f"Best Accuracy: {comparison_df.loc[comparison_df['Approach'] == best_approach_name, 'Accuracy (%)'].values[0]}%")
print(f"\nGenerated Files:")
print("  - approach_comparison.csv")
print("  - approach_comparison.png")
print("  - confusion_matrix_best.png")
print("  - results_simple_direct.csv")
print("  - results_fewshot.csv")
print("  - results_cot.csv")
print("  - results_structured.csv")
print("\n" + "="*100)
print("✓ Task 1 Complete!")
print("="*100)