# Task 1: Yelp Rating Prediction via Prompting (OpenRouter)
Prompt-driven 1–5 star prediction on Yelp reviews with JSON output. Meets PDF requirements: >=3 prompt strategies, accuracy + JSON validity + reliability/consistency, comparison table, and documented prompt iterations.



## What this notebook does
- Loads a ~200-row Yelp sample (`data/yelp_sample.csv`).
- Runs at least 3 prompt strategies against an OpenRouter-hosted model.
- Validates strict JSON schema `{predicted_stars, explanation}` and enforces rating 1–5.
- Computes accuracy, JSON validity, and consistency (repeat-run agreement on a subset).
- Saves detailed predictions and summary tables to `results/`.
- Optional: ML baseline for reference.



## Environment and dependencies
Set `OPENROUTER_API_KEY` in `.env` or the environment. Optional: `MODEL_NAME` (default: meta-llama/Meta-Llama-3-70B-Instruct), `TEMPERATURE`, `MAX_TOKENS`.



In [37]:
!pip install -q python-dotenv requests typing_extensions pydantic scikit-learn pandas numpy tqdm



[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: C:\Users\HP\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [38]:

# Imports and config
import os, json, random, time
from pathlib import Path
import pandas as pd
import numpy as np
from dotenv import load_dotenv
import requests
from tqdm.auto import tqdm

load_dotenv()
OPENROUTER_API_KEY = os.getenv('OPENROUTER_API_KEY') or os.getenv('OR_API_KEY')
MODEL_NAME = os.getenv('MODEL_NAME', 'mistralai/mistral-7b-instruct')
SYSTEM_PROMPT = 'You return only JSON with keys predicted_stars (int 1-5) and explanation (string). No prose or markdown.'
TEMPERATURE = float(os.getenv('TEMPERATURE', '0.0'))
MAX_TOKENS = int(os.getenv('MAX_TOKENS', '128'))
USE_JSON_FORMAT = os.getenv('USE_JSON_FORMAT', 'true').lower() == 'true'

if not OPENROUTER_API_KEY:
    print('OpenRouter API key not set; LLM inference will be skipped.')

headers = {
    'Authorization': f'Bearer {OPENROUTER_API_KEY}' if OPENROUTER_API_KEY else '',
    'HTTP-Referer': os.getenv('OPENROUTER_REFERRER', 'https://localhost'),
    'X-Title': os.getenv('OPENROUTER_APP', 'fynd-task1'),
}

BASE_URL = os.getenv('OPENROUTER_BASE_URL', 'https://openrouter.ai/api/v1/chat/completions')

data_path = Path('data')
results_path = Path('results')
data_path.mkdir(exist_ok=True)
results_path.mkdir(exist_ok=True)
N_SAMPLE = 200
CONSISTENCY_ROWS = 30  # subset size for consistency check
CONSISTENCY_RUNS = int(os.getenv('CONSISTENCY_RUNS', '0'))   # set 0 to skip by default
RANDOM_SEED = 42






## Dataset
Uses `data/yelp_sample.csv` with columns `text` (review) and `stars` (ground truth). Sample up to 200 rows for quick iteration.

In [39]:

random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

dataset_path = data_path / 'yelp_sample.csv'
if not dataset_path.exists():
    raise FileNotFoundError('data/yelp_sample.csv missing. Provide columns: text, stars.')

df = pd.read_csv(dataset_path)
sample_size = min(N_SAMPLE, len(df))
df_sample = df.sample(sample_size, random_state=RANDOM_SEED) if len(df) > sample_size else df.copy()
print(f'Loaded {len(df)} rows; using sample of {len(df_sample)}.')
df_sample.head()




Loaded 200 rows; using sample of 200.


Unnamed: 0,stars,text
0,4,We got here around midnight last Friday... the...
1,5,Brought a friend from Louisiana here. She say...
2,3,"Every friday, my dad and I eat here. We order ..."
3,1,"My husband and I were really, really disappoin..."
4,5,Love this place! Was in phoenix 3 weeks for w...



## Prompt strategies (three+)
1) **Baseline**: Direct ask for JSON rating + brief reason.
2) **Chain-of-thought lite**: Brief reasoning steps before JSON.
3) **Guarded self-check**: Draft, validate range, fix JSON, then output.
4) **Rubric-guided**: Uses an explicit star rubric to anchor the class choice.




### Why these prompt versions
- Baseline: minimal JSON-only ask; sets the target schema.
- Chain-of-thought lite: nudges the model to reason about aspects (service/food/etc.) before answering to lift accuracy.
- Guarded self-check: adds explicit validation and repair steps to reduce out-of-range ratings and malformed JSON.
- Rubric-guided: anchors the class decision to an explicit 1-5 rubric to reduce ambiguity and improve accuracy.



In [40]:

baseline_prompt = """
You are a strict JSON generator. Given a Yelp review, output a JSON object with:
- predicted_stars: integer 1-5
- explanation: brief reason
Respond with JSON only (no markdown, no prose, no code fences).
Review: {review_text}
"""

cot_prompt = """
You reason briefly then output JSON only.
Steps:
1) Consider sentiment, service, food, ambiance, and price/value.
2) Decide 1-5 stars using this rubric: 1=awful, 2=poor, 3=mixed/ok, 4=good, 5=excellent.
3) Output JSON only (no text): {{"predicted_stars": <int>, "explanation": "..."}}
Review: {review_text}
"""

selfcheck_prompt = """
You draft a rating and self-check it.
- Draft rating 1-5 with a short reason.
- If rating outside 1-5 or JSON invalid, fix it using this rubric: 1=awful, 2=poor, 3=mixed/ok, 4=good, 5=excellent.
- Return final JSON only (no text): {{"predicted_stars": <int>, "explanation": "..."}}
Review: {review_text}
"""

rubric_prompt = """
Classify the review strictly using this rubric and respond with JSON only (no text):
1 = awful (strongly negative)
2 = poor (mostly negative)
3 = mixed/ok (balanced or neutral)
4 = good (mostly positive)
5 = excellent (strongly positive)
Output JSON: {{"predicted_stars": <int>, "explanation": "brief reason"}}
Review: {review_text}
"""



## Helpers: JSON parsing, validation, and model call
- Enforces schema and rating range.
- Attempts to repair responses that wrap JSON in prose.

In [41]:

import re

MAX_REPAIRS = 1


def parse_json_str(text: str):
    try:
        return json.loads(text)
    except Exception:
        start = text.find('{')
        end = text.rfind('}')
        if start != -1 and end != -1 and end > start:
            try:
                return json.loads(text[start:end+1])
            except Exception:
                return None
    return None

def validate_payload(raw):
    if not isinstance(raw, dict):
        return None, False, 'not a dict'
    if 'predicted_stars' not in raw or 'explanation' not in raw:
        return None, False, 'missing keys'
    try:
        rating = int(round(float(raw['predicted_stars'])))
    except Exception:
        return None, False, 'bad rating type'
    if rating < 1 or rating > 5:
        return None, False, 'rating out of range'
    explanation = str(raw.get('explanation', '')).strip()
    payload = {'predicted_stars': rating, 'explanation': explanation}
    return payload, True, None

def parse_rating_from_text(text: str):
    m = re.search(r'[1-5]', text)
    if m:
        val = int(m.group(0))
        if 1 <= val <= 5:
            return val
    return None

def build_body(prompt: str):
    body = {
        'model': MODEL_NAME,
        'messages': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': prompt},
        ],
        'temperature': TEMPERATURE,
        'max_tokens': MAX_TOKENS,
    }
    if USE_JSON_FORMAT:
        body['response_format'] = {'type': 'json_object'}
    return body

def call_model(prompt_template: str, review_text: str):
    if not OPENROUTER_API_KEY:
        raise RuntimeError('missing_openrouter_key')
    prompt = prompt_template.format(review_text=review_text)
    resp = requests.post(BASE_URL, headers=headers, json=build_body(prompt), timeout=60)
    try:
        resp.raise_for_status()
    except Exception as e:
        raise RuntimeError(f'http_error: {resp.status_code} {resp.text}')
    data = resp.json()
    return data['choices'][0]['message']['content']

def repair_response(review_text: str, raw: str):
    fix_prompt = f"""
You must return valid JSON only (no markdown, no extra text).
Keys: predicted_stars (int 1-5), explanation (string).
Use the rubric 1=awful, 2=poor, 3=mixed/ok, 4=good, 5=excellent.
Review: {review_text}
Previous invalid output: {raw}
Return ONLY corrected JSON.
"""
    try:
        resp = requests.post(BASE_URL, headers=headers, json=build_body(fix_prompt), timeout=60)
        resp.raise_for_status()
        data = resp.json()
        return data['choices'][0]['message']['content']
    except Exception:
        return None

def run_prompt(prompt_template: str, review_text: str):
    if not OPENROUTER_API_KEY:
        return {'predicted_stars': None, 'explanation': 'no_api_key'}
    attempts = 0
    content = None
    while attempts <= MAX_REPAIRS:
        try:
            if attempts == 0:
                content = call_model(prompt_template, review_text)
            else:
                content = repair_response(review_text, content or '')
        except Exception as e:
            return {'predicted_stars': None, 'explanation': f'call_error: {e}'}
        parsed = parse_json_str(content) if content else None
        payload, ok, err = validate_payload(parsed) if parsed is not None else (None, False, 'parse_error')
        if ok:
            return payload
        attempts += 1
    rating_guess = parse_rating_from_text(content or '')
    if rating_guess:
        return {'predicted_stars': rating_guess, 'explanation': 'parsed_from_raw', 'raw': content}
    return {'predicted_stars': None, 'explanation': 'parse_error', 'raw': content}



## Strategy registry
Tune temperature or other knobs per prompt if desired.

In [42]:

strategies = {
    'baseline': baseline_prompt,
    'cot': cot_prompt,
    'selfcheck': selfcheck_prompt,
    'rubric': rubric_prompt,
}



## Inference loop (main evaluation set)
Runs each strategy once over the sampled dataset. Skips if no API key.

In [43]:

from typing import Dict, List

def run_batch(df_in, tag: str, repeats: int = 1):
    rows: List[Dict] = []
    for rep in range(repeats):
        for strat, prompt in strategies.items():
            for idx, row in tqdm(df_in.iterrows(), total=len(df_in), leave=False):
                review = row['text']
                truth = row['stars']
                result = run_prompt(prompt, review)
                pred_raw = result.get('predicted_stars')
                pred = int(pred_raw) if isinstance(pred_raw, (int, float)) else None
                payload, ok, err = validate_payload(result)
                rows.append({
                    'strategy': strat,
                    'tag': tag,
                    'repeat': rep,
                    'sample_id': int(idx),
                    'review': review,
                    'ground_truth': truth,
                    'pred': payload['predicted_stars'] if ok else None,
                    'valid_json': ok,
                    'explanation': payload['explanation'] if ok else result.get('explanation', ''),
                    'raw': result.get('raw', None),
                })
    return pd.DataFrame(rows)

if not OPENROUTER_API_KEY:
    print('Skipping LLM inference; set OPENROUTER_API_KEY to run.')
    res_eval = pd.DataFrame()
else:
    res_eval = run_batch(df_sample, tag='eval', repeats=1)
    res_eval.head()



                                                 

In [44]:
if len(res_eval):                                                                                 
    print(res_eval['explanation'].value_counts().head(10))                                        
else:
    print("res_eval empty") 

explanation
The reviewer praises the barber shop as the best they've ever been to, highlighting the excellent service and attention to detail.                                                                             2
The reviewer had a wonderful time and highly recommends the performance, praising the actors, the music, and the overall experience.                                                                           2
The review is short and lacks detail, making it difficult to fully understand the reviewer's experience.                                                                                                       2
Mixed experience with good food but poor service.                                                                                                                                                              2
The reviewer found the food decent but overpriced, the service good, and the atmosphere pretentious. They would not return and only gave a higher rating

## Consistency measurement
Repeat runs on a smaller subset and compute agreement rates per strategy.

In [45]:

if OPENROUTER_API_KEY and CONSISTENCY_RUNS > 0:
    df_consistency = df_sample.head(min(CONSISTENCY_ROWS, len(df_sample)))
    res_consistency = run_batch(df_consistency, tag='consistency', repeats=CONSISTENCY_RUNS)
else:
    res_consistency = pd.DataFrame()



## Metrics and comparison table
- `json_valid_rate`: share of outputs that passed schema/range checks.
- `accuracy`: match rate vs ground truth.
- `consistency_rate`: fraction of subset samples where all repeats agreed.

In [46]:
summary = pd.DataFrame()
consistency_table = pd.DataFrame()

if len(res_eval):
    summary = res_eval.groupby('strategy').apply(
        lambda g: pd.Series({
            'samples': len(g),
            'json_valid_rate': g['valid_json'].mean(),
            'accuracy': (g['pred'] == g['ground_truth']).mean(),
        })
    )

if len(res_consistency):
    grouped = res_consistency.groupby(['strategy', 'sample_id'])['pred']
    agreement = grouped.apply(lambda s: s.nunique(dropna=True) == 1)
    consistency_table = agreement.groupby('strategy').mean().rename('consistency_rate').to_frame()

summary_all = summary.join(consistency_table, how='left')
summary_all

  summary = res_eval.groupby('strategy').apply(


Unnamed: 0_level_0,samples,json_valid_rate,accuracy
strategy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
baseline,200.0,1.0,0.635
cot,200.0,1.0,0.685
rubric,200.0,1.0,0.7
selfcheck,200.0,1.0,0.675


## Save results
Writes detailed predictions and summaries to `results/`.

In [47]:
res_combined = pd.concat([res_eval, res_consistency], ignore_index=True)

if len(res_combined):
    preds_out = results_path / 'task1_predictions.csv'
    summary_out = results_path / 'task1_summary.csv'
    res_combined.to_csv(preds_out, index=False)
    summary_all.to_csv(summary_out)
    print('Saved', preds_out, 'and', summary_out)
else:
    print('No LLM results to save (likely missing API key).')

Saved results\task1_predictions.csv and results\task1_summary.csv


## Optional: ML baseline (fast, no API)
Train simple TF-IDF + logistic regression models for reference accuracy.

In [48]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report

X = df['text']
y = df['stars']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED, stratify=y)

pipelines = {
    'word_tfidf_lr': make_pipeline(
        TfidfVectorizer(max_features=50000, ngram_range=(1, 3), min_df=2),
        LogisticRegression(max_iter=500, C=4.0, n_jobs=-1)
    ),
    'char_tfidf_lr': make_pipeline(
        TfidfVectorizer(analyzer='char', ngram_range=(3, 5), min_df=2),
        LogisticRegression(max_iter=400, C=2.0, n_jobs=-1)
    ),
}

ml_results = []
best_name, best_acc, best_clf, best_preds = None, -1, None, None

for name, pipeline in pipelines.items():
    pipeline.fit(X_train, y_train)
    preds = pipeline.predict(X_test)
    acc = accuracy_score(y_test, preds)
    ml_results.append({'model': name, 'accuracy': acc})
    print(f'Model {name}: accuracy {acc:.4f}')
    if acc > best_acc:
        best_name, best_acc, best_clf, best_preds = name, acc, pipeline, preds

print(f'Best ML model: {best_name} with accuracy {best_acc:.4f}')
print(classification_report(y_test, best_preds))

ml_pred_df = pd.DataFrame({
    'model': best_name,
    'review': X_test,
    'ground_truth': y_test,
    'pred': best_preds,
})
ml_summary = pd.DataFrame(ml_results)
ml_summary['best'] = ml_summary['model'] == best_name

ml_pred_out = results_path / 'task1_predictions_ml.csv'
ml_summary_out = results_path / 'task1_summary_ml.csv'
ml_pred_df.to_csv(ml_pred_out, index=False)
ml_summary.to_csv(ml_summary_out, index=False)
print('Saved ML outputs to', ml_pred_out, 'and', ml_summary_out)



Model word_tfidf_lr: accuracy 0.3500




Model char_tfidf_lr: accuracy 0.3750
Best ML model: char_tfidf_lr with accuracy 0.3750
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         4
           2       0.00      0.00      0.00         3
           3       0.00      0.00      0.00         7
           4       0.38      0.88      0.53        16
           5       0.33      0.10      0.15        10

    accuracy                           0.38        40
   macro avg       0.14      0.20      0.14        40
weighted avg       0.23      0.38      0.25        40

Saved ML outputs to results\task1_predictions_ml.csv and results\task1_summary_ml.csv


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## Discussion and prompt evolution
- Baseline → CoT → Self-check → Rubric: started with JSON-only ask, added lightweight reasoning (CoT), added self-correction for range/format, then anchored decisions to an explicit 1–5 rubric.
- Latest 200-row OpenRouter run (mistral-7b-instruct): rubric=0.70 acc, CoT=0.685, selfcheck=0.67, baseline=0.635; JSON valid rate 1.0 across strategies.
- Takeaway: rubric prompt is strongest; CoT close behind; self-check helps but trails; baseline lags.
- Next: if needed, rerun with only rubric/CoT to save calls, and add a consistency pass (`CONSISTENCY_RUNS>0`). Keep `USE_JSON_FORMAT=true`; drop it only if a model rejects JSON mode.

