## ⚠️ Execution Note (API Quota Limitation)

This notebook uses Google Gemini for prompt-based analysis.

At the time of submission, the Gemini free-tier API quota has been exhausted.
As a result, the cells cannot be re-executed without errors.

This notebook is submitted to demonstrate:
- prompt engineering approach
- reasoning and experimentation process
- qualitative evaluation methodology

To ensure reliability in Task 2, a different LLM provider (OpenRouter) was intentionally used.
This limitation and decision are documented transparently in the report.


# Prompt Engineering Analysis for Yelp Rating Prediction

This notebook accompanies Task 1 of the Fynd AI Intern take-home assignment.

The objective is to evaluate how different prompt designs affect:
- prediction accuracy of Yelp star ratings
- reliability of structured (JSON) outputs

Three prompting strategies are evaluated using the same dataset subset and metrics.


In [None]:
import pandas as pd
import json
import time
from typing import Dict
import google.generativeai as genai
from google.colab import userdata
import time
from google.api_core.exceptions import TooManyRequests

In [None]:
df = pd.read_csv("yelp.csv")
df.columns
df = df[['text', 'stars']]
df = df.rename(columns={'text': 'review_text'})
df = df.sample(40, random_state=42).reset_index(drop=True)
df.head()
df.isnull().sum()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Dataset Preparation

The Yelp reviews dataset originally contains a `text` column representing customer reviews
and a `stars` column representing the ground-truth rating (1–5).

For clarity within the analysis pipeline:
- `text` was renamed to `review_text`
- a random subset of 40 reviews was sampled

The reduced sample size was chosen due to free-tier API rate limits while remaining sufficient
to observe consistent prompt behavior.


In [None]:
genai.configure(api_key=userdata.get("GOOGLE_API_KEY"))
model = genai.GenerativeModel("models/gemini-flash-latest")

In [None]:
def call_llm(prompt, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = model.generate_content(prompt)
            return response.text
        except TooManyRequests as e:
            wait_time = 6  # seconds (safe for free tier)
            print(f"Rate limit hit. Waiting {wait_time}s...")
            time.sleep(wait_time)
    return ""


In [None]:
def parse_llm_output(output):
    try:
        cleaned = output.strip()

        # remove ```json and ``` if Gemini adds them
        if cleaned.startswith("```"):
            cleaned = cleaned.replace("```json", "").replace("```", "").strip()

        return json.loads(cleaned)
    except:
        return None


## Prompt Version 1 — Naive Baseline

This prompt uses minimal instructions and weak structural constraints.
It serves as a baseline to observe default LLM behavior.

In [None]:
def prompt_v1(review_text):
    return f"""
Predict the Yelp star rating from 1 to 5 for the following review.
Return the result strictly in JSON format with keys:
- predicted_stars
- explanation

Review:
{review_text}
"""

In [None]:
results_v1 = []

for _, row in df.iterrows():
    prompt = prompt_v1(row["review_text"])
    output = call_llm(prompt)
    parsed = parse_llm_output(output)

    results_v1.append({
        "actual_stars": row["stars"],
        "predicted_stars": parsed["predicted_stars"] if parsed else None,
        "valid_json": parsed is not None
    })

    time.sleep(1.5)

In [None]:
res_v1 = pd.DataFrame(results_v1)
res_v1.head()

### Prompt v1 Results

Accuracy and JSON validity are computed below.
This baseline demonstrates limited output reliability.


In [None]:
accuracy_v1 = (res_v1["actual_stars"] == res_v1["predicted_stars"]).mean()
json_validity_v1 = res_v1["valid_json"].mean()

accuracy_v1, json_validity_v1

## Prompt Version 2 — Strict JSON Enforcement

This prompt enforces JSON-only output with a fixed schema.
The goal is to improve parsing reliability.

In [None]:
def prompt_v2(review_text):
    return f"""
You are a classification system.

Your task:
- Predict the Yelp star rating from 1 to 5 for the given review.

STRICT RULES:
- Respond with ONLY valid JSON.
- Do NOT include markdown.
- Do NOT include explanations outside JSON.
- Do NOT include any extra text.

JSON SCHEMA (follow exactly):
{{
  "predicted_stars": <integer from 1 to 5>,
  "explanation": "<one short sentence>"
}}

Review:
{review_text}
"""

In [None]:
results_v2 = []

for _, row in df.iterrows():
    prompt = prompt_v2(row["review_text"])
    output = call_llm(prompt)
    parsed = parse_llm_output(output)

    results_v2.append({
        "actual_stars": row["stars"],
        "predicted_stars": parsed["predicted_stars"] if parsed else None,
        "valid_json": parsed is not None
    })

    time.sleep(1.5)


In [None]:
res_v2 = pd.DataFrame(results_v2)
res_v2.head()

### Prompt v2 Observations

Although output structure was more constrained, prediction accuracy decreased.
This highlights the trade-off between structure enforcement and reasoning quality.

In [None]:
accuracy_v2 = (res_v2["actual_stars"] == res_v2["predicted_stars"]).mean()
json_validity_v2 = res_v2["valid_json"].mean()

accuracy_v2, json_validity_v2

## Prompt Version 3 — Reasoned and Structured Prompt

This prompt embeds reasoning within the JSON output to balance
prediction quality and structured reliability.

In [None]:
def prompt_v3(review_text):
    return f"""
You are an expert sentiment analyst.

Step 1: Analyze the review and identify:
- overall sentiment (positive / neutral / negative)
- key positive points
- key negative points

Step 2: Based on this analysis, decide the most appropriate Yelp star rating (1 to 5).

STRICT OUTPUT RULES:
- Your final answer MUST be valid JSON only.
- Do NOT include markdown.
- Do NOT include analysis text outside JSON.

JSON FORMAT (follow exactly):
{{
  "predicted_stars": <integer from 1 to 5>,
  "explanation": "<brief justification based on positives and negatives>"
}}

Review:
{review_text}
"""

In [None]:
results_v3 = []
for _, row in df.iterrows():
    prompt = prompt_v3(row["review_text"])
    output = call_llm(prompt)
    parsed = parse_llm_output(output)

    results_v3.append({
        "actual_stars": row["stars"],
        "predicted_stars": parsed["predicted_stars"] if parsed else None,
        "valid_json": parsed is not None
    })

    time.sleep(1.5)

### Execution Note

Full execution of Prompt v3 was constrained by free-tier API rate limits.
Repeated rate-limit responses were observed despite retry and backoff logic.

The prompt design is retained for completeness and comparison.


In [None]:
res_v3 = pd.DataFrame(results_v3)
res_v3.head()

In [None]:
res_v3 = pd.DataFrame(results_v3)
accuracy_v3 = (res_v3["actual_stars"] == res_v3["predicted_stars"]).mean()
json_validity_v3 = res_v3["valid_json"].mean()
accuracy_v3, json_validity_v3

## Comparative Summary

| Prompt | Accuracy | JSON Validity | Key Observation |
|-------|----------|---------------|----------------|
| v1 | 0.175 | 0.30 | Weak structure |
| v2 | 0.125 | 0.225 | Structure reduced reasoning |
| v3 | N/A | Expected highest | Balanced design, quota-limited |


In [None]:
accuracy_v3 = (res_v3["actual_stars"] == res_v3["predicted_stars"]).mean()
json_validity_v3 = res_v3["valid_json"].mean()

accuracy_v3, json_validity_v3

## Key Takeaways

- Prompt design strongly influences both accuracy and reliability
- Strict structure alone does not guarantee better performance
- Reasoning must be explicitly structured for automated pipelines
- External constraints such as API limits are a real-world consideration