In [2]:
!pip install pandas requests



In [3]:
import requests
from google.colab import userdata

# Get API key safely
OPENROUTER_API_KEY = userdata.get("OPENROUTER_API_KEY")

# OpenRouter endpoint
url = "https://openrouter.ai/api/v1/chat/completions"

# Test payload
payload = {
    "model": "mistralai/mistral-7b-instruct",
    "messages": [
        {"role": "user", "content": "Say hello to Disha in one friendly sentence."}
    ]
}

headers = {
    "Authorization": f"Bearer {OPENROUTER_API_KEY}",
    "Content-Type": "application/json"
}

response = requests.post(url, headers=headers, json=payload)

print(response.json())

{'id': 'gen-1765719052-ZV5k390aq4pbnns7baHe', 'provider': 'DeepInfra', 'model': 'mistralai/mistral-7b-instruct', 'object': 'chat.completion', 'created': 1765719052, 'choices': [{'logprobs': None, 'finish_reason': 'stop', 'native_finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': " <s> [OUT] Hey Disha! Hope you're doing great today! ðŸ˜Š", 'refusal': None, 'reasoning': None}}], 'usage': {'prompt_tokens': 23, 'completion_tokens': 23, 'total_tokens': 46, 'cost': 1.886e-06, 'is_byok': False, 'prompt_tokens_details': {'cached_tokens': 0, 'audio_tokens': 0, 'video_tokens': 0}, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 6.44e-07, 'upstream_inference_completions_cost': 1.242e-06}, 'completion_tokens_details': {'reasoning_tokens': 0, 'image_tokens': 0}}}


In [6]:
import pandas as pd
df = pd.read_csv("/content/yelp.csv")
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [7]:
df_sample = df.sample(n=200, random_state=42)
len(df_sample)

200

In [8]:
df_sample.iloc[0]

Unnamed: 0,6252
business_id,QVR7dsvBeg8xFt9B-vd1BA
date,2010-07-22
review_id,hwYVJs8Ko4PMjI19QcR57g
stars,4
text,We got here around midnight last Friday... the...
type,review
user_id,90a6z--_CUrl84aCzZyPsg
cool,5
useful,5
funny,2


# **PROMPT VERSION 1**
Prompt v1 was designed as a simple baseline to understand how a large language model performs when given minimal instructions. The goal was to test whether the model could infer a star rating from review text without strict formatting or behavioral guidance.

In [9]:
def prompt_v1(review_text):
    return f"""
    Read the following Yelp review and predict how many stars (1 to 5) the user would give.

    Review:
    {review_text}

    Respond in JSON format with:
    - predicted_stars (number from 1 to 5)
    - explanation (short reason)
    """


In [10]:
print(prompt_v1(df_sample.iloc[0]["text"]))


    Read the following Yelp review and predict how many stars (1 to 5) the user would give.

    Review:
    We got here around midnight last Friday... the place was dead. However, they were still serving food and we enjoyed some well made pub grub. Service was friendly, quality cocktails were served, and the atmosphere is derived from an old Uno's, which certainly works for a sports bar. It being located in a somewhat commercial area, I can see why it's empty so late on a Friday. From what my friends tell me - this is a great spot for happy hour, and it stays relatively busy thru 10pm.

*UPDATE - Great patio for day-drinking on the weekends!

    Respond in JSON format with:
    - predicted_stars (number from 1 to 5)
    - explanation (short reason)
    


In [11]:
import requests
from google.colab import userdata

def call_openrouter(prompt):
    api_key = userdata.get("OPENROUTER_API_KEY")

    url = "https://openrouter.ai/api/v1/chat/completions"

    payload = {
        "model": "mistralai/mistral-7b-instruct",
        "messages": [
            {"role": "user", "content": prompt}
        ]
    }

    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    response = requests.post(url, headers=headers, json=payload)
    return response.json()

In [12]:
# Take one review
sample_review = df_sample.iloc[0]["text"]

# Generate prompt
prompt_text = prompt_v1(sample_review)

# Call AI
result = call_openrouter(prompt_text)

print(result)

{'id': 'gen-1765719541-HOWkO61IOB4KHOjJlN1w', 'provider': 'DeepInfra', 'model': 'mistralai/mistral-7b-instruct', 'object': 'chat.completion', 'created': 1765719541, 'choices': [{'logprobs': None, 'finish_reason': 'stop', 'native_finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': ' <s> ```json\n    {\n        "predicted_stars": 4,\n        "explanation": "The reviewer mentions enjoying well-made pub grub, friendly service, and quality cocktails. They also highlight the atmosphere and the potential for a good experience during happy hour. The positive update about the patio for day-drinking further supports a favorable rating."\n    }\n    ``` ', 'refusal': None, 'reasoning': None}}], 'usage': {'prompt_tokens': 194, 'completion_tokens': 81, 'total_tokens': 275, 'cost': 9.806e-06, 'is_byok': False, 'prompt_tokens_details': {'cached_tokens': 0, 'audio_tokens': 0, 'video_tokens': 0}, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_co

In [14]:
ai_text = result["choices"][0]["message"]["content"]

print(ai_text)

 <s> ```json
    {
        "predicted_stars": 4,
        "explanation": "The reviewer mentions enjoying well-made pub grub, friendly service, and quality cocktails. They also highlight the atmosphere and the potential for a good experience during happy hour. The positive update about the patio for day-drinking further supports a favorable rating."
    }
    ``` 


In [15]:
import json
import re

# Remove ```json and ``` and <s>
clean_text = re.sub(r"```json|```|<s>", "", ai_text).strip()

print(clean_text)

{
        "predicted_stars": 4,
        "explanation": "The reviewer mentions enjoying well-made pub grub, friendly service, and quality cocktails. They also highlight the atmosphere and the potential for a good experience during happy hour. The positive update about the patio for day-drinking further supports a favorable rating."
    }


In [16]:
parsed = json.loads(clean_text)

print(parsed)
print("Predicted stars:", parsed["predicted_stars"])

{'predicted_stars': 4, 'explanation': 'The reviewer mentions enjoying well-made pub grub, friendly service, and quality cocktails. They also highlight the atmosphere and the potential for a good experience during happy hour. The positive update about the patio for day-drinking further supports a favorable rating.'}
Predicted stars: 4


In [17]:
import json
import re

def safe_parse_json(ai_text):
    try:
        clean_text = re.sub(r"```json|```|<s>", "", ai_text).strip()
        parsed = json.loads(clean_text)
        return parsed, True
    except:
        return None, False

In [20]:
df_small = df_sample.head(20)
len(df_small)

20

In [21]:
import time

In [32]:
results_v1 = []

for i, row in enumerate(df_small.iterrows(), start=1):
    review_text = row[1]["text"]
    actual_stars = row[1]["stars"]

    prompt = prompt_v1(review_text)
    response = call_openrouter(prompt)

    ai_text = response["choices"][0]["message"]["content"]
    parsed, is_valid = safe_parse_json(ai_text)

    predicted_stars = parsed.get("predicted_stars") if is_valid else None

    results_v1.append({
        "actual_stars": actual_stars,
        "predicted_stars": predicted_stars,
        "json_valid": is_valid
    })

    print(f"Done {i}/20")
    time.sleep(2)

Done 1/20
Done 2/20
Done 3/20
Done 4/20
Done 5/20
Done 6/20
Done 7/20
Done 8/20
Done 9/20
Done 10/20
Done 11/20
Done 12/20
Done 13/20
Done 14/20
Done 15/20
Done 16/20
Done 17/20
Done 18/20
Done 19/20
Done 20/20


In [33]:
results_v1_df = pd.DataFrame(results_v1)
results_v1_df

Unnamed: 0,actual_stars,predicted_stars,json_valid
0,4,,False
1,5,,False
2,3,3.0,True
3,1,,False
4,5,5.0,True
5,4,,False
6,4,,False
7,4,5.0,True
8,5,5.0,True
9,1,,False


In [34]:
# Remove rows where prediction failed
valid_predictions = results_v1_df.dropna(subset=["predicted_stars"])

accuracy = (
    valid_predictions["actual_stars"]
    == valid_predictions["predicted_stars"]
).mean()

accuracy

np.float64(0.5555555555555556)

In [35]:
json_validity_rate = results_v1_df["json_valid"].mean()
json_validity_rate

np.float64(0.45)

Prompt v1 demonstrated that the model could reason about sentiment, but it was unreliable for structured outputs, motivating the need for stricter control in the next version.

# **PROMPT VERSION 2**
Prompt v2 was created to address the JSON reliability issues observed in v1. The focus was on enforcing a strict output format suitable for programmatic evaluation.

In [36]:
def prompt_v2(review_text):
    return f"""
You are a rating classifier.

Task:
Given a Yelp review, predict the star rating from 1 to 5.

Rules:
- Respond with ONLY valid JSON
- Do NOT include markdown
- Do NOT include any extra text
- The JSON must have exactly two keys:
  - predicted_stars (integer between 1 and 5)
  - explanation (one short sentence)

Review:
{review_text}

Output:
"""


In [57]:
df_tiny = df_small.head(10)
len(df_tiny)

10

In [59]:
results_v2 = []

for i, row in enumerate(df_tiny.iterrows(), start=1):
    review_text = row[1]["text"]
    actual_stars = row[1]["stars"]

    prompt = prompt_v2(review_text)
    response = call_openrouter(prompt)

    ai_text = response["choices"][0]["message"]["content"]
    parsed, is_valid = safe_parse_json(ai_text)

    predicted_stars = parsed.get("predicted_stars") if is_valid else None

    results_v2.append({
        "actual_stars": actual_stars,
        "predicted_stars": predicted_stars,
        "json_valid": is_valid
    })

    print(f"Done {i}/10 (v2)")
    time.sleep(3)

Done 1/10 (v2)
Done 2/10 (v2)
Done 3/10 (v2)
Done 4/10 (v2)
Done 5/10 (v2)
Done 6/10 (v2)
Done 7/10 (v2)
Done 8/10 (v2)
Done 9/10 (v2)
Done 10/10 (v2)


In [60]:
results_v2_df = pd.DataFrame(results_v2)
results_v2_df

Unnamed: 0,actual_stars,predicted_stars,json_valid
0,4,4,True
1,5,5,True
2,3,4,True
3,1,1,True
4,5,5,True
5,4,4,True
6,4,5,True
7,4,5,True
8,5,5,True
9,1,1,True


In [61]:
valid_v2 = results_v2_df.dropna(subset=["predicted_stars"])

accuracy_v2 = (
    valid_v2["actual_stars"]
    == valid_v2["predicted_stars"]
).mean()

accuracy_v2

np.float64(0.6)

In [62]:
json_validity_v2 = results_v2_df["json_valid"].mean()
json_validity_v2

np.float64(1.0)

While output reliability improved, the model still lacked clear guidance on how to map sentiment to star ratings, limiting semantic accuracy.

# **PROMPT VERSION 3**
Prompt v3.3 was designed to improve semantic accuracy by modeling real Yelp user behavior rather than strict sentiment analysis.

In [75]:
def prompt_v3_3(review_text):
    return f"""
You are predicting the star rating a real Yelp user would most likely give.

IMPORTANT FACTS ABOUT YELP RATINGS:
- 4-star ratings are very common, even when reviews mention minor complaints
- 3-star ratings are used only when positives and negatives are truly balanced
- Users rate based on overall feeling, not individual issues

STAR RATING RUBRIC:
1 star: Extremely negative, strong dissatisfaction.
2 stars: Mostly negative, more complaints than praise.
3 stars: Truly mixed, positives and negatives are roughly equal.
4 stars: Mostly positive overall, minor complaints allowed.
5 stars: Extremely positive, enthusiastic praise.

DECISION PROCESS:
1. Determine the overall sentiment.
2. Predict what a REAL Yelp user would choose.
3. If unsure between 3 and 4, prefer 4.
4. If unsure between 2 and 3, prefer 2.

STRICT OUTPUT REQUIREMENTS:
- Output ONLY valid JSON
- No markdown or extra text
- predicted_stars MUST be an integer from 1 to 5
- explanation MUST be exactly one short sentence

Review text:
{review_text}

Return exactly:
{{
  "predicted_stars": <integer 1-5>,
  "explanation": "<one short sentence>"
}}
"""

In [77]:
results_v3_3 = []

for i, row in enumerate(df_tiny.iterrows(), start=1):
    review_text = row[1]["text"]
    actual_stars = row[1]["stars"]

    prompt = prompt_v3_3(review_text)
    response = call_openrouter(prompt)

    ai_text = response["choices"][0]["message"]["content"]
    parsed, is_valid = safe_parse_json(ai_text)

    predicted_stars = parsed.get("predicted_stars") if is_valid else None

    results_v3_3.append({
        "actual_stars": actual_stars,
        "predicted_stars": predicted_stars,
        "json_valid": is_valid
    })

    print(f"Done {i}/10 (v3.3)")
    time.sleep(3)

Done 1/10 (v3.3)
Done 2/10 (v3.3)
Done 3/10 (v3.3)
Done 4/10 (v3.3)
Done 5/10 (v3.3)
Done 6/10 (v3.3)
Done 7/10 (v3.3)
Done 8/10 (v3.3)
Done 9/10 (v3.3)
Done 10/10 (v3.3)


In [78]:
results_v3_3_df = pd.DataFrame(results_v3_3)
results_v3_3_df

Unnamed: 0,actual_stars,predicted_stars,json_valid
0,4,4.0,True
1,5,5.0,True
2,3,4.0,True
3,1,,False
4,5,,False
5,4,4.0,True
6,4,4.0,True
7,4,5.0,True
8,5,5.0,True
9,1,,False


In [79]:
valid_v3_3 = results_v3_3_df.dropna(subset=["predicted_stars"])

accuracy_v3_3 = (
    valid_v3_3["actual_stars"]
    == valid_v3_3["predicted_stars"]
).mean()

json_validity_v3_3 = results_v3_3_df["json_valid"].mean()

accuracy_v3_3, json_validity_v3_3

(np.float64(0.7142857142857143), np.float64(0.7))

Prompt v3.3 optimized for accuracy, but at the cost of output reliability, revealing a trade-off between reasoning depth and strict structure.

# **PROMPT VERSION 3.4**
Prompt v3.4 was designed to restore JSON reliability while preserving the improved accuracy logic introduced in v3.3.

In [81]:
def prompt_v3_4(review_text):
    return f"""
You are a system that predicts Yelp star ratings.

IMPORTANT:
You may reason internally, but you must NOT show your reasoning.
Your final answer must be ONLY valid JSON.

BACKGROUND ABOUT YELP RATINGS:
- Users often give 4 stars even when minor complaints exist
- 3 stars are used only when positives and negatives are truly balanced
- Overall feeling matters more than individual issues

STAR RATING RUBRIC:
1 star: Extremely negative, strong dissatisfaction.
2 stars: Mostly negative, more complaints than praise.
3 stars: Truly mixed, positives and negatives are equal.
4 stars: Mostly positive overall, minor complaints allowed.
5 stars: Extremely positive, enthusiastic praise.

DECISION RULES:
- Decide the most likely rating a REAL Yelp user would give
- If unsure between 3 and 4, choose 4
- If unsure between 2 and 3, choose 2

OUTPUT RULES (ABSOLUTE):
- Output ONLY valid JSON
- No markdown
- No explanations outside JSON
- No extra keys
- No extra text
- If output is not valid JSON, you have FAILED the task

Review text:
{review_text}

Return EXACTLY this JSON and nothing else:
{{
  "predicted_stars": <integer 1-5>,
  "explanation": "<one short sentence>"
}}
"""

In [82]:
results_v3_4 = []

for i, row in enumerate(df_tiny.iterrows(), start=1):
    review_text = row[1]["text"]
    actual_stars = row[1]["stars"]

    prompt = prompt_v3_4(review_text)
    response = call_openrouter(prompt)

    ai_text = response["choices"][0]["message"]["content"]
    parsed, is_valid = safe_parse_json(ai_text)

    predicted_stars = parsed.get("predicted_stars") if is_valid else None

    results_v3_4.append({
        "actual_stars": actual_stars,
        "predicted_stars": predicted_stars,
        "json_valid": is_valid
    })

    print(f"Done {i}/10 (v3.4)")
    time.sleep(3)

Done 1/10 (v3.4)
Done 2/10 (v3.4)
Done 3/10 (v3.4)
Done 4/10 (v3.4)
Done 5/10 (v3.4)
Done 6/10 (v3.4)
Done 7/10 (v3.4)
Done 8/10 (v3.4)
Done 9/10 (v3.4)
Done 10/10 (v3.4)


In [83]:
results_v3_4_df = pd.DataFrame(results_v3_4)
results_v3_4_df

Unnamed: 0,actual_stars,predicted_stars,json_valid
0,4,4.0,True
1,5,5.0,True
2,3,4.0,True
3,1,,False
4,5,5.0,True
5,4,4.0,True
6,4,4.0,True
7,4,5.0,True
8,5,5.0,True
9,1,2.0,True


In [85]:
valid_v3_4 = results_v3_4_df.dropna(subset=["predicted_stars"])

accuracy_v3_4 = (
    valid_v3_4["actual_stars"]
    == valid_v3_4["predicted_stars"]
).mean()

json_validity_v3_4 = results_v3_4_df["json_valid"].mean()

accuracy_v3_4, json_validity_v3_4

(np.float64(0.6666666666666666), np.float64(0.9))

Prompt v3.4 represents a robust system-oriented design, prioritizing reliability and consistency while maintaining reasonable accuracy.

In [87]:
comparison_df = pd.DataFrame({
    "Prompt Version": ["v1 (Naive)", "v2 (Strict)", "v3.3 (Robust, better accuracy)", "v3.4 (Robust, better json validation)"],
    "Sample Size": [len(results_v1_df), len(results_v2_df), len(results_v3_3_df), len(results_v3_4_df)],
    "Accuracy": [accuracy, accuracy_v2, accuracy_v3_3, accuracy_v3_4],
    "JSON Validity Rate": [json_validity_rate, json_validity_v2, json_validity_v3_3, json_validity_v3_4]
})

comparison_df

Unnamed: 0,Prompt Version,Sample Size,Accuracy,JSON Validity Rate
0,v1 (Naive),20,0.555556,0.45
1,v2 (Strict),10,0.6,1.0
2,"v3.3 (Robust, better accuracy)",10,0.714286,0.7
3,"v3.4 (Robust, better json validation)",10,0.666667,0.9


## **The comparison table highlights the trade-offs between accuracy and output reliability across different prompt designs. Prompt v1 (Naive) serves as a baseline, achieving moderate accuracy (0.56) but suffering from poor JSON validity (0.45), demonstrating that minimal instructions are insufficient for structured, machine-readable outputs**.

* Prompt v2 (Strict JSON) significantly improves output reliability, achieving perfect JSON validity (1.00), while also slightly improving accuracy (0.60). This confirms that explicitly constraining output format helps stabilize model behavior, though semantic understanding remains limited without domain-specific guidance.

* Prompt v3.3 (Robust, accuracy-focused) achieves the highest accuracy (0.71) by incorporating a star-rating rubric and modeling real Yelp user rating behavior. However, the added reasoning complexity leads to a drop in JSON validity (0.70), illustrating the trade-off between deeper semantic reasoning and strict output control.

* Prompt v3.4 (Robust, JSON-focused) restores high JSON validity (0.90) while maintaining competitive accuracy (0.67). This version demonstrates a more system-oriented design, prioritizing reliability and consistency while retaining much of the accuracy gained in v3.3.

Overall, the results show that prompt engineering involves balancing semantic accuracy and output robustness, and no single prompt optimizes both simultaneously. The final two versions (v3.3 and v3.4) illustrate how different design priorities can be addressed depending on whether accuracy or structured reliability is more critical for the application.