# Yelp Rating Drop Early-Warning System
This notebook orchestrates the full analytical pipeline.

**Stages:**
1. Data Preparation
2. Optimized Model Training
3. LLM Root Cause Analysis
4. Dashboard Generation


## Setup


In [None]:
import os
from datetime import datetime

print("=" * 70)
print("YELP RATING DROP EARLY-WARNING SYSTEM")
print("Full Pipeline Execution")
print("=" * 70)
print(f"\nStarted: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Working directory: {os.getcwd()}\n")


Full Pipeline Execution

Started: 2025-12-01 20:27:37
Working directory: /Users/valeira/Desktop/yelp-rating-alert-system



## Stage 1: Data Preparation

Loads raw Yelp data, filters Philadelphia restaurants, cleans and processes.

**Outputs:**
- `data/processed/reviews_clean.csv`
- `data/processed/reviews_features.csv`


In [None]:
print("\n" + "=" * 70)
print("STAGE 1: DATA PREPARATION")
print("=" * 70)
print(f"Started: {datetime.now().strftime('%H:%M:%S')}\n")

try:
    %run data_prep.ipynb
    print(f"\n✓ Stage 1 completed: {datetime.now().strftime('%H:%M:%S')}")
    stage_1_success = True
except Exception as e:
    print(f"\n✗ Stage 1 failed: {e}")
    stage_1_success = False
    raise



STAGE 1: DATA PREPARATION
Started: 20:27:37

Directories created successfully!
Total businesses loaded: 150,346
Philadelphia restaurants: 5,852

Sample business IDs saved for filtering reviews
Unique business IDs: 5,852
Loading reviews (this may take a few minutes)...
Processed 100,000 reviews, found 12,498 Philadelphia reviews
Processed 200,000 reviews, found 25,327 Philadelphia reviews
Processed 300,000 reviews, found 37,858 Philadelphia reviews
Processed 400,000 reviews, found 47,970 Philadelphia reviews
Processed 500,000 reviews, found 55,951 Philadelphia reviews
Processed 600,000 reviews, found 63,135 Philadelphia reviews
Processed 700,000 reviews, found 70,113 Philadelphia reviews
Processed 800,000 reviews, found 82,501 Philadelphia reviews
Processed 900,000 reviews, found 95,975 Philadelphia reviews
Processed 1,000,000 reviews, found 109,595 Philadelphia reviews
Processed 1,100,000 reviews, found 121,428 Philadelphia reviews
Processed 1,200,000 reviews, found 130,773 Philadelph

## Stage 2: Optimized Model Training

Trains Random Forest with:
- 31 engineered features
- SMOTE for class balance
- Optimized decision threshold

**Outputs:**
- `models/rating_drop_model_optimized.pkl`
- `models/model_summary_optimized.csv`
- `figures/model_results_optimized.png`


In [None]:
print("\n" + "=" * 70)
print("STAGE 2: OPTIMIZED MODEL TRAINING")
print("=" * 70)
print(f"Started: {datetime.now().strftime('%H:%M:%S')}\n")

try:
    %run end_to_end_pipeline.ipynb
    print(f"\n✓ Stage 2 completed: {datetime.now().strftime('%H:%M:%S')}")
    stage_2_success = True
except Exception as e:
    print(f"\n✗ Stage 2 failed: {e}")
    stage_2_success = False
    raise



STAGE 2: OPTIMIZED MODEL TRAINING
Started: 20:27:53



  validate(nb)


Libraries imported.
Loaded data from: data/processed/reviews_features.csv
Shape: (99997, 27)
              business_id                date
0  -0M0b-XhtFagyLmsBtOe8w 2012-02-21 18:24:45
1  -0M0b-XhtFagyLmsBtOe8w 2013-12-03 23:56:37
2  -0M0b-XhtFagyLmsBtOe8w 2015-06-24 22:10:39
3  -0PN_KFPtbnLQZEeb23XiA 2011-11-23 22:10:01
4  -0TffRSXXIlBYVbb5AwfTg 2013-06-01 01:47:50

UPGRADING LABEL — FUTURE 7-DAY RATING DROP
Label distribution (proportion):
label_rating_drop
0    0.802824
1    0.197176
Name: proportion, dtype: float64
Dropped 10119 rows due to NaNs in key fields.

PRIORITY 3: ENHANCED FEATURE ENGINEERING

Creating interaction features...
✓ Interaction features created

Creating time-based features...
✓ Time-based features created

Creating business-specific features...
✓ Business-specific features created

Creating acceleration features...
✓ Acceleration features created

Enhanced feature engineering complete.
Total numeric features used: 31
['stars_review', 'useful', 'funny', 'cool',

## Stage 3: LLM Root Cause Analysis

Analyzes negative reviews with Llama 3.2 via Ollama.

**Prerequisites:** Ollama must be running
```bash
ollama run llama3.2
```

**Outputs:**
- `results/llm_complaint_analysis.csv`


In [None]:
print("\n" + "=" * 70)
print("STAGE 3: LLM ROOT CAUSE ANALYSIS")
print("=" * 70)
print(f"Started: {datetime.now().strftime('%H:%M:%S')}\n")

# Check Ollama availability
try:
    import requests

    response = requests.get("http://localhost:11434/api/tags", timeout=2)
    if response.status_code == 200:
        print("✓ Ollama is running\n")
        ollama_available = True
    else:
        print("⚠ Ollama not responding properly\n")
        ollama_available = False
except:
    print("⚠ Cannot connect to Ollama")
    print("  Start it with: ollama run llama3.2\n")
    ollama_available = False

if ollama_available:
    try:
        %run llm_root_cause_analysis.ipynb
        print(f"\n✓ Stage 3 completed: {datetime.now().strftime('%H:%M:%S')}")
        stage_3_success = True
    except Exception as e:
        print(f"\n✗ Stage 3 failed: {e}")
        print("Continuing to dashboards...")
        stage_3_success = False
else:
    print("Skipping LLM analysis (Ollama not available)")
    stage_3_success = False



STAGE 3: LLM ROOT CAUSE ANALYSIS
Started: 20:28:20

✓ Ollama is running

Libraries loaded!

Make sure Ollama is running with Llama 3.2:
  Terminal command: ollama run llama3.2
Testing Ollama connection...

✓ Ollama is working!

Test result:
{
  "food_quality": true,
  "service_speed": true,
  "staff_behavior": true,
  "cleanliness": false,
  "portion_size": false,
  "pricing": false,
  "order_accuracy": false,
  "severity": "high",
  "primary_issue": "The food was cold and the waiter was very rude. We waited 45 minutes for our order."
}
Loaded data: 89,878 reviews

Negative reviews (≤3 stars): 27,933
Selected 20 negative reviews for LLM analysis

Rating distribution in sample:
stars_review
1.0    10
2.0     1
3.0     9
Name: count, dtype: int64

Analyzing reviews with Llama 3.2...
This may take 1-2 minutes (about 1 second per review)

Analyzing review 19/20...

✓ Analysis complete!
  Successful: 19/20
  Errors: 1

Analyzed 19 reviews

DataFrame columns: ['food_quality', 'service_speed

  category_flags = complaint_df[available_cols].applymap(


## Stage 4: Dashboard Generation & Final Report

Creates interactive dashboards and comprehensive report.

**Outputs:**
- `dashboards/dashboard_overview.html`
- `dashboards/dashboard_alerts.html`
- `dashboards/trend_analysis.html`
- `reports/final_project_report.txt`


In [None]:
print("\n" + "=" * 70)
print("STAGE 4: DASHBOARDS & FINAL REPORT")
print("=" * 70)
print(f"Started: {datetime.now().strftime('%H:%M:%S')}\n")

try:
    %run rating_drop_model.ipynb
    print(f"\n✓ Stage 4 completed: {datetime.now().strftime('%H:%M:%S')}")
    stage_4_success = True
except Exception as e:
    print(f"\n✗ Stage 4 failed: {e}")
    stage_4_success = False
    raise



STAGE 4: DASHBOARDS & FINAL REPORT
Started: 20:28:58



  validate(nb)


Libraries imported and output directories ensured.
Loaded optimized features from: data/processed/reviews_features_optimized.csv
Shape: (89878, 41)
Loaded optimized model from: models/rating_drop_model_optimized.pkl
Loaded 31 features from: models/feature_list_optimized.txt
Using 31 features present in dataset.
Loaded model summary from: models/model_summary_optimized.csv
Loaded LLM complaint analysis from: results/llm_complaint_analysis.csv
Computed risk scores and risk_flag using optimal threshold: 0.29
              business_id                                   name  \
0  -0M0b-XhtFagyLmsBtOe8w                         Paris Wine Bar   
1  -0TffRSXXIlBYVbb5AwfTg  IndeBlue Modern Indian Food & Spirits   
2  -0TffRSXXIlBYVbb5AwfTg  IndeBlue Modern Indian Food & Spirits   
3  -0TffRSXXIlBYVbb5AwfTg  IndeBlue Modern Indian Food & Spirits   
4  -0TffRSXXIlBYVbb5AwfTg  IndeBlue Modern Indian Food & Spirits   

                 date  stars_business  risk_score  risk_flag  
0 2015-06-24 22:1


DataFrame.applymap has been deprecated. Use DataFrame.map instead.



## Pipeline Summary


In [None]:
print("\n" + "=" * 70)
print("PIPELINE EXECUTION SUMMARY")
print("=" * 70)
print(f"\nCompleted: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"\nStage Results:")
print(
    f"  1. Data Preparation:          {'✓ SUCCESS' if stage_1_success else '✗ FAILED'}"
)
print(
    f"  2. Model Training:            {'✓ SUCCESS' if stage_2_success else '✗ FAILED'}"
)
print(
    f"  3. LLM Analysis:              {'✓ SUCCESS' if stage_3_success else '⚠ SKIPPED/FAILED'}"
)
print(
    f"  4. Dashboards & Report:       {'✓ SUCCESS' if stage_4_success else '✗ FAILED'}"
)

all_critical_success = stage_1_success and stage_2_success and stage_4_success
print(
    f"\nOverall Status: {'✓ PIPELINE COMPLETE' if all_critical_success else '⚠ PIPELINE INCOMPLETE'}"
)
print(f"Note: LLM stage is optional and does not affect overall success.")



PIPELINE EXECUTION SUMMARY

Completed: 2025-12-01 20:28:59

Stage Results:
  1. Data Preparation:          ✓ SUCCESS
  2. Model Training:            ✓ SUCCESS
  3. LLM Analysis:              ✓ SUCCESS
  4. Dashboards & Report:       ✓ SUCCESS

Overall Status: ✓ PIPELINE COMPLETE
Note: LLM stage is optional and does not affect overall success.


## Output Verification


In [None]:
print("\n" + "=" * 70)
print("OUTPUT VERIFICATION")
print("=" * 70)

expected_files = {
    "Data Files": [
        "data/processed/reviews_clean.csv",
        "data/processed/reviews_features.csv",
        "data/processed/reviews_features_optimized.csv",
    ],
    "Model Files": [
        "models/rating_drop_model_optimized.pkl",
        "models/model_summary_optimized.csv",
        "models/feature_list_optimized.txt",
    ],
    "Figures": [
        "figures/model_results_optimized.png",
        "figures/threshold_optimization.png",
    ],
    "Dashboards": [
        "dashboards/dashboard_overview.html",
        "dashboards/dashboard_alerts.html",
    ],
    "Results": [
        "results/llm_complaint_analysis.csv",
    ],
}

for category, files in expected_files.items():
    print(f"\n{category}:")
    for filepath in files:
        exists = os.path.exists(filepath)
        if exists:
            size = os.path.getsize(filepath) / 1024
            size_str = f"{size:.1f} KB" if size < 1024 else f"{size / 1024:.1f} MB"
        else:
            size_str = "N/A"
        status = "✓" if exists else "✗"
        print(f"  {status} {filepath:50s} {size_str:>12s}")

print("\n" + "=" * 70)
print("✓ Pipeline verification complete")
print("=" * 70)



OUTPUT VERIFICATION

Data Files:
  ✓ data/processed/reviews_clean.csv                        70.9 MB
  ✓ data/processed/reviews_features.csv                     84.0 MB
  ✓ data/processed/reviews_features_optimized.csv           85.1 MB

Model Files:
  ✓ models/rating_drop_model_optimized.pkl                   9.9 MB
  ✓ models/model_summary_optimized.csv                       0.4 KB
  ✓ models/feature_list_optimized.txt                        0.4 KB

Figures:
  ✓ figures/model_results_optimized.png                     83.8 KB
  ✓ figures/threshold_optimization.png                      53.0 KB

Dashboards:
  ✓ dashboards/dashboard_overview.html                      12.2 MB
  ✓ dashboards/dashboard_alerts.html                         4.6 MB

Results:
  ✓ results/llm_complaint_analysis.csv                       6.7 KB

✓ Pipeline verification complete


## Next Steps

1. **View Dashboards:** Open HTML files in `dashboards/` with a browser
2. **Read Report:** Check `reports/final_project_report.txt`
3. **Review Metrics:** See `models/model_summary_optimized.csv`
4. **Analyze Results:** Examine `results/llm_complaint_analysis.csv`

---

**Project:** Yelp Rating Drop Early-Warning System  
**Author:** Fancheng  
**Institution:** UC San Diego, MSBA Program
