In [6]:
!pip install groq

Collecting groq
  Downloading groq-1.0.0-py3-none-any.whl.metadata (16 kB)
Downloading groq-1.0.0-py3-none-any.whl (138 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.3/138.3 kB[0m [31m779.8 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq
Successfully installed groq-1.0.0


In [1]:
import pandas as pd

df = pd.read_csv("yelp.csv")  # adjust path if needed
df.head()


Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [2]:
df[['stars', 'useful', 'funny', 'cool']].groupby('stars').mean()

Unnamed: 0_level_0,useful,funny,cool
stars,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1.604806,1.056075,0.576769
2,1.563107,0.875944,0.719525
3,1.306639,0.69473,0.788501
4,1.395916,0.670448,0.954623
5,1.38178,0.608631,0.944261


In [53]:
df['stars'].value_counts(normalize=True).sort_index()

Unnamed: 0_level_0,proportion
stars,Unnamed: 1_level_1
1,0.0749
2,0.0927
3,0.1461
4,0.3526
5,0.3337


In [54]:
df['review_length'] = df['text'].str.split().apply(len)

df.groupby('stars')['review_length'].mean()

Unnamed: 0_level_0,review_length
stars,Unnamed: 1_level_1
1,153.953271
2,156.435814
3,140.714579
4,131.174135
5,114.46359


##Key Observations from Data Analysis

- Star ratings are subjective and not purely sentiment-based-
Many 4-star and 5-star reviews contain minor complaints, while some 3-star reviews sound positive overall. This indicates that Yelp stars reflect overall satisfaction and intent (e.g., “would I return?”) rather than raw sentiment alone.

- Mixed-sentiment reviews are common, especially for 3-star ratings
Reviews with both positive and negative points frequently correspond to 3 stars. These reviews are often longer and more detailed.

- Emotional intensity does not imply positivity
Negative reviews tend to be longer, more emotional, and often marked as “useful” or “funny”, while positive reviews are usually shorter and less expressive. This means emotion or verbosity should not be rewarded with higher ratings.

- There is label noise and overlap between star categories
Some 5-star reviews still include complaints, and some 1-star reviews are not strongly negative. This places a natural upper bound on achievable accuracy.

In [17]:
df = df[["text", "stars"]].dropna()
sample_df = df.sample(n=10, random_state=30)
sample_df.head()


Unnamed: 0,text,stars
8793,"Been there many times with many friends, not f...",4
1122,"I have not been bowling in 24 years,so you can...",5
1283,Consistency is an issue with the Chipotle chai...,2
9318,This is my first time using Groupon. It's one ...,2
7765,I absolutely love this sub shop! Its the only ...,5


In [None]:
#Direct Classification prompt
p1 = """
Classify the Yelp review into a star rating from 1 to 5.

Return ONLY valid JSON:
{{
  "predicted_stars": <1-5>,
  "explanation": "<brief reason>"
}}

Review:
"{review}"
"""


In [None]:
#Criteria-Based Analysis
p2 = """
You are rating a Yelp review strictly based on the customer's overall satisfaction.

Rules:
- If the review mentions serious complaints, service issues, or disappointment, do NOT give 4 or 5 stars.
- If both positives and negatives are present, default to 3 stars unless praise clearly dominates.
- Use 5 stars ONLY if the review shows strong enthusiasm with no complaints.
- Use 4 stars ONLY if mostly positive with very minor issues.
- Be conservative: avoid inflating ratings.

Return ONLY valid JSON:
{{
  "predicted_stars": <1-5>,
  "explanation": "<short justification>"
}}

Review:
"{review}"
"""


In [None]:
#Criteria-Based Analysis
p3 = """
You are rating a Yelp review strictly based on the customer's overall satisfaction.

Rules:
- If the review mentions serious complaints, service issues, or disappointment, do NOT give 4 or 5 stars.
- If both positives and negatives are present, default to 3 stars unless praise or negative clearly dominates.
- Use 5 stars ONLY if the review shows strong enthusiasm with no complaints.
- Use 4 stars ONLY if mostly positive with very minor issues.
- Be conservative: avoid inflating ratings.

Return ONLY valid JSON:
{{
  "predicted_stars": <1-5>,
  "explanation": "<short justification>"
}}

Review:
"{review}"
"""


In [29]:
# Sentiment Anchoring + Rating Rubric
prompt1= """
You are a Yelp review rating classifier.

First, determine the overall sentiment of the review as one of:
very negative, negative, neutral/mixed, positive, very positive.

Then assign a star rating using this rubric:
- very negative → 1 star
- negative → 2 stars
- neutral or mixed → 3 stars
- positive → 4 stars
- very positive → 5 stars

Do not reward humor, sarcasm, or emotional intensity alone.
Focus on overall customer satisfaction.

Return ONLY valid JSON:
{{
  "predicted_stars": <1-5>,
  "explanation": "Short justification referencing sentiment."
}}

Review:
"{review}"
"""

In [30]:
# Hidden Chain-of-Thought + Constraint Rules
prompt2 = """
Analyze the Yelp review carefully.

Internally consider:
- Key positive points
- Key negative points
- Overall satisfaction outcome

Apply these rules:
- If praise and criticism are both present, default to 3 stars unless one clearly dominates.
- Do not assign 4 or 5 stars if meaningful complaints are present.
- 5 stars require clear enthusiasm with no complaints.

Do NOT show your reasoning steps.

Return ONLY valid JSON:
{{
  "predicted_stars": <1-5>,
  "explanation": "Concise summary of why this rating was chosen."
}}

Review:
"{review}"
"""

In [31]:
# Self-Consistency + Calibration (BEST)

prompt3 = """
You are classifying a Yelp review into a 1–5 star rating.

Guidelines:
- Do not let emotional intensity, humor, or review length alone affect the rating.
- Sarcasm or exaggerated language often signals dissatisfaction.
- Mixed praise and criticism usually corresponds to 3 stars.
- 5 stars require strong enthusiasm with no meaningful complaints.
- 1 star requires strong dissatisfaction.

Step 1: Choose the most likely star rating.
Step 2: Ask yourself whether one star higher or lower would be more reasonable.
If yes, adjust the rating.

Return ONLY valid JSON:
{{
  "predicted_stars": <1-5>,
  "explanation": "Brief justification based on overall satisfaction."
}}

Review:
"{review}"
"""

In [32]:
from groq import Groq
import os
from google.colab import userdata
# Create API Key from here
# https://console.groq.com/keys
client = Groq(api_key=userdata.get('GROQ_API2'))
MODEL_NAME = "openai/gpt-oss-120b"


In [33]:
def call_llm(prompt):
    completion = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[
            {"role": "user", "content": prompt}
        ],
        temperature=0,
        max_tokens=600,
        response_format={"type": "json_object"},
    )
    return completion.choices[0].message.content.strip()


In [40]:
import json

def run_experiment(prompt_template):
    results = []

    for _, row in sample_df.iterrows():
        prompt = prompt_template.format(review=row["text"])
        response = call_llm(prompt)


        try:
            parsed = json.loads(response)
            results.append({
                "actual": row["stars"],
                "predicted": parsed["predicted_stars"],
                "valid_json": True
            })
        except:
            results.append({
                "actual": row["stars"],
                "predicted": None,
                "valid_json": False
            })

    return pd.DataFrame(results)


In [52]:
res_v1 = run_experiment(prompt1)
res_v2 = run_experiment(prompt2)
res_v3 = run_experiment(prompt3)


In [36]:
def evaluate(df):
    return {
        "Accuracy": (df["actual"] == df["predicted"]).mean(),
        "JSON_Validity": df["valid_json"].mean()
    }

summary = pd.DataFrame([
    {"Prompt": "1", **evaluate(res_v1)},
    {"Prompt": "2", **evaluate(res_v2)},
    {"Prompt": "3", **evaluate(res_v3)}
])

summary

Unnamed: 0,Prompt,Accuracy,JSON_Validity
0,1,0.7,1.0
1,2,0.6,1.0
2,3,0.8,1.0


In [37]:
# Star-Definition Grounding (Label Semantics Prompt)
# decision boundary anchoring
prompt4 = """
You are assigning a Yelp star rating. Use the following definitions strictly:

1 star: Very bad experience. Strong dissatisfaction. Would not return.
2 stars: Mostly negative experience with a few positives.
3 stars: Mixed or average experience. Acceptable but not great.
4 stars: Good experience overall. Would return, despite minor issues.
5 stars: Excellent experience. Very satisfied, no real complaints.

Rules:
- Minor issues do NOT reduce a 4-star rating.
- Any meaningful dissatisfaction prevents a 5-star rating.
- Mixed praise and criticism usually indicates 3 stars.

Return ONLY valid JSON:
{{
  "predicted_stars": <1-5>,
  "explanation": "Justification based on the experience definition."
}}

Review:
"{review}"
"""


In [38]:
# Decision Tree Prompt (Structured Reasoning)
# Implicit decision tree / branching logic
prompt5 = """
Answer the following questions internally:

1. Is the reviewer clearly dissatisfied overall?
   - Yes → 1 or 2 stars
   - No → continue

2. Is the experience described as excellent with no real complaints?
   - Yes → 5 stars
   - No → continue

3. Does the review contain both praise and criticism?
   - Yes → 3 stars
   - No → continue

4. Is the experience mostly positive with only minor issues?
   - Yes → 4 stars

Choose the most appropriate rating.

Return ONLY valid JSON:
{{
  "predicted_stars": <1-5>,
  "explanation": "Short justification based on decision path."
}}

Review:
"{review}"
"""


In [39]:
# Contrast-Aware Prompt
prompt6 = """
You are rating a Yelp review.

Important:
- Pay special attention to contrast words such as "but", "however", "although", "though".
- If a contrast introduces a complaint, do NOT assign 4 or 5 stars.
- Minor complaints introduced by contrast may still allow 4 stars.
- Strong dissatisfaction introduced by contrast indicates 1 or 2 stars.

Use this guidance to determine the rating.

Return ONLY valid JSON:
{{
  "predicted_stars": <1-5>,
  "explanation": "Brief justification referencing overall satisfaction."
}}

Review:
"{review}"
"""


In [41]:
res_v4 = run_experiment(prompt4)
res_v5 = run_experiment(prompt5)
res_v6 = run_experiment(prompt6)


In [42]:
def evaluate(df):
    return {
        "Accuracy": (df["actual"] == df["predicted"]).mean(),
        "JSON_Validity": df["valid_json"].mean()
    }

summary = pd.DataFrame([
    {"Prompt": "4", **evaluate(res_v4)},
    {"Prompt": "5", **evaluate(res_v5)},
    {"Prompt": "6", **evaluate(res_v6)}
])

summary

Unnamed: 0,Prompt,Accuracy,JSON_Validity
0,4,0.8,1.0
1,5,0.3,1.0
2,6,0.7,1.0


In [43]:
# Two-Pass Internal Voting
# Instead of “think once → decide”, the model:

# Evaluates twice internally

# Resolves disagreement
prompt7 = """
You are classifying a Yelp review.

Step 1: Assign an initial star rating (1–5).
Step 2: Re-evaluate the review focusing only on overall satisfaction.
Step 3: If the two ratings differ, choose the more conservative one.

Guidelines:
- Do not reward emotional intensity alone.
- Mixed sentiment usually corresponds to 3 stars.
- 5 stars require strong enthusiasm with no complaints.

Return ONLY valid JSON:
{{
  "predicted_stars": <1-5>,
  "explanation": "Final justification after reconsideration."
}}

Review:
"{review}"
"""


In [45]:
res_v7 = run_experiment(prompt7)

In [46]:
def evaluate(df):
    return {
        "Accuracy": (df["actual"] == df["predicted"]).mean(),
        "JSON_Validity": df["valid_json"].mean()
    }

summary = pd.DataFrame([
    {"Prompt": "7", **evaluate(res_v7)},

])

summary

Unnamed: 0,Prompt,Accuracy,JSON_Validity
0,7,0.7,1.0


In [47]:
# Star Semantics + Return Intent
# Technique

# Label semantics grounding + behavioral intent inference

# Why this works
# Yelp stars are about “would I return / recommend?”

prompt8 = """
You are assigning a Yelp star rating based on customer experience.

Use these definitions strictly:
1 star: Very bad experience. Strong dissatisfaction. Would not return.
2 stars: Mostly negative experience with a few positives.
3 stars: Mixed or average experience. Acceptable but not memorable.
4 stars: Good experience overall. Would return, despite minor issues.
5 stars: Excellent experience. Very satisfied. No meaningful complaints.

Important rules:
- Minor issues do NOT reduce a 4-star rating.
- Any meaningful dissatisfaction prevents a 5-star rating.
- Mixed praise and criticism usually corresponds to 3 stars.
- If the reviewer suggests they would return or recommend, prefer 4 stars.

Return ONLY valid JSON:
{{
  "predicted_stars": <1-5>,
  "explanation": "Justification based on overall experience and intent."
}}

Review:
"{review}"
"""


In [48]:
# Calibration + Severity Weighting

prompt9 = """
Classify the Yelp review into a 1–5 star rating.

Evaluate complaints by severity:
- Minor issues (e.g., slow service once, small inconvenience) do NOT lower a 4-star rating.
- Repeated or service-breaking issues lower the rating.
- Strong dissatisfaction leads to 1 or 2 stars.

Guidelines:
- Mixed praise and criticism usually indicates 3 stars.
- 5 stars require clear enthusiasm with no real complaints.

Return ONLY valid JSON:
{{
  "predicted_stars": <1-5>,
  "explanation": "Brief justification focusing on complaint severity."
}}

Review:
"{review}"
"""


In [49]:
# Two-Pass Conservative Refinement
# Self-consistency + conservative tie-breaking

prompt10 = """
You are rating a Yelp review.

Step 1: Assign an initial star rating (1–5).
Step 2: Re-evaluate focusing only on overall satisfaction.
Step 3: If the two ratings differ, choose the more conservative rating.

Rules:
- Emotional intensity alone should not affect the rating.
- Mixed sentiment usually corresponds to 3 stars.
- 5 stars require strong enthusiasm with no complaints.

Return ONLY valid JSON:
{{
  "predicted_stars": <1-5>,
  "explanation": "Final justification after reconsideration."
}}

Review:
"{review}"
"""


In [50]:
res_v8 = run_experiment(prompt8)
res_v9 = run_experiment(prompt9)
res_v10 = run_experiment(prompt10)

In [51]:
def evaluate(df):
    return {
        "Accuracy": (df["actual"] == df["predicted"]).mean(),
        "JSON_Validity": df["valid_json"].mean()
    }

summary = pd.DataFrame([
    {"Prompt": "8", **evaluate(res_v8)},
    {"Prompt": "9", **evaluate(res_v9)},
    {"Prompt": "10", **evaluate(res_v10)},

])

summary

Unnamed: 0,Prompt,Accuracy,JSON_Validity
0,8,0.9,1.0
1,9,0.7,1.0
2,10,0.6,1.0


###Prompt 3: Self-Consistency and Calibration

The first improvement over a direct classification prompt was introducing self-consistency and calibration.

Instead of asking the model to immediately predict a rating, the prompt asks it to:

Choose an initial rating

Reconsider whether a rating one star higher or lower would be more reasonable

This reduces overconfident predictions and helps correct borderline cases, especially between 3 and 4 stars. Additional rules were added to prevent emotional language, sarcasm, or review length from dominating the decision.

###Why it works:
This prompt improves reliability and reduces random errors by encouraging the model to reflect on its own decision. It performed better than simple rule-based prompts but still struggled with the 3-star vs 4-star boundary.

###Accuracy achieved: ~0.80

###Prompt 4: Star-Definition Grounding (Label Semantics)

To address confusion between star levels, I introduced explicit star definitions that describe what each rating means in practical terms (e.g., “would return”, “acceptable but not great”).

This prompt anchors the model to the semantics of Yelp stars, rather than treating them as abstract numeric labels.

###Why it works:
Explicitly defining each star level reduces ambiguity and improves consistency, especially for mid-range ratings. However, this prompt still treats all complaints equally, which leads to over-penalizing reviews with minor issues.

###Accuracy achieved: ~0.80

###Prompt 8: Star Semantics + Return Intent (Final Best Prompt)

The final and best-performing prompt builds on star-definition grounding and adds behavioral intent inference, specifically focusing on whether the reviewer would return or recommend the business.

This reflects how real users assign Yelp ratings:

A 4-star review often includes small complaints but still indicates willingness to return.

A 5-star review requires strong satisfaction with no meaningful dissatisfaction.

The prompt explicitly states that:

Minor issues do not reduce a 4-star rating

Meaningful dissatisfaction prevents a 5-star rating

Mixed praise and criticism usually maps to 3 stars

###Why this works best:
This prompt aligns the model’s reasoning with how humans actually use Yelp stars. Instead of optimizing for sentiment polarity, it optimizes for overall experience and intent, which matches the dataset labels more closely.

###Accuracy achieved: ~0.90
This was the highest accuracy achieved while maintaining 100% valid JSON output.

##Final Takeaway

Across experiments, prompts that relied on rigid rules or decision trees underperformed due to the subjective and overlapping nature of Yelp reviews. The best results were obtained by grounding the model in clear star semantics and inferring customer intent, rather than enforcing strict sentiment or complaint-based rules.

This demonstrates that for subjective classification tasks like review ratings, semantic grounding and intent-aware prompting are more effective than rigid rule enforcement.