# **Sentiment Analysis on Restaurant Reviews using LLMs**

----

## **1. Business Context**  
Analyzing **customer sentiment** from **Yelp restaurant reviews** helps businesses improve services and manage reputation. **LLMs** like **Claude 3 Sonnet** and **LLaMA 3** enable sentiment classification without labeled training data using **zero-shot and few-shot prompting**. This project compares these methods using **AWS Bedrock**.

## **2. Target Variables**  
- **Input:** Review text  
- **Output (Sentiment Classification):**  
  - **Positive (1)** – Favorable experience  
  - **Negative (0)** – Unfavorable experience  
  - **Neutral** – Ignored in evaluation  

## **3. Approach**  

### **Step 1: Data Preparation**  
- **Dataset:** Yelp restaurant reviews (Arizona)  
- **Preprocessing:** 100 reviews (50 positive, 50 negative)  

### **Step 2: Zero-Shot Sentiment Analysis**  
- **Model:** Claude 3 Sonnet  
- **Evaluation:** Predictions, precision, recall, F1-score, accuracy  

### **Step 3: Few-Shot Sentiment Analysis**  
- **Method:** Uses 2 positive & 2 negative examples in prompts  
- **Evaluation:** Performance comparison with Zero-Shot  

### **Step 4: Multi-LLM Sentiment Analysis**  
- **Models:** Claude 3 Sonnet vs. LLaMA 3  
- **Comparison:** Performance metrics, prediction differences  

### **Step 5: Discussion & Observations**  
- **Few-shot learning** improves accuracy  
- **LLaMA 3 outperforms Claude 3 Sonnet** in recall & precision  
- **Misclassifications** due to sarcasm, mixed sentiment, ambiguous language  

## **4. Business Insights**  
- **Automated sentiment analysis** helps track service trends  
- **Identifying recurring complaints** improves operations  
- **Competitive benchmarking** aids market positioning  
- **LLMs reduce manual review classification, enhancing efficiency**  


---

# Code Cell 1 - Loading Data

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import requests
import json
import boto3
import time
from sklearn.metrics import classification_report, accuracy_score
import random
from tqdm import tqdm
random.seed(42) # set the seed

In [None]:
df = pd.read_csv("restaurant_reviews_az.csv")

df.info()
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48147 entries, 0 to 48146
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   review_id    48147 non-null  object
 1   user_id      48147 non-null  object
 2   business_id  48147 non-null  object
 3   stars        48147 non-null  int64 
 4   useful       48147 non-null  int64 
 5   funny        48147 non-null  int64 
 6   cool         48147 non-null  int64 
 7   text         48147 non-null  object
 8   date         48147 non-null  object
 9   Sentiment    48147 non-null  int64 
dtypes: int64(5), object(5)
memory usage: 3.7+ MB


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,Sentiment
0,IVS7do_HBzroiCiymNdxDg,fdFgZQQYQJeEAshH4lxSfQ,sGy67CpJctjeCWClWqonjA,3,1,1,0,"OK, the hype about having Hatch chili in your ...",1/27/2020 22:59,1
1,QP2pSzSqpJTMWOCuUuyXkQ,JBLWSXBTKFvJYYiM-FnCOQ,3w7NRntdQ9h0KwDsksIt5Q,5,1,1,1,Pandemic pit stop to have an ice cream.... onl...,4/19/2020 5:33,1
2,oK0cGYStgDOusZKz9B1qug,2_9fKnXChUjC5xArfF8BLg,OMnPtRGmbY8qH_wIILfYKA,5,1,0,0,I was lucky enough to go to the soft opening a...,2/29/2020 19:43,1
3,E_ABvFCNVLbfOgRg3Pv1KQ,9MExTQ76GSKhxSWnTS901g,V9XlikTxq0My4gE8LULsjw,5,0,0,0,I've gone to claim Jumpers all over the US and...,3/14/2020 21:47,1
4,Rd222CrrnXkXukR2iWj69g,LPxuausjvDN88uPr-Q4cQA,CA5BOxKRDPGJgdUQ8OUOpw,4,1,0,0,"If you haven't been to Maynard's kitchen, it'...",1/17/2020 20:32,1
...,...,...,...,...,...,...,...,...,...,...
48142,C0uxsLrbiwQ8a6S9P3bj2A,IVm1abIK9OWCsrkmEAM7iQ,kldw3rf8_T6J2LxsetQ5UA,4,1,0,0,"Wow, there are a lot of 1 star reviews for thi...",11/21/2021 0:16,1
48143,Zu4ng3tjf_2oa9LlnEvUmQ,ItQzeC91hJF6qvvE7-OZmQ,Yzh7Xo1_JBDWUl2BzRiYaQ,5,0,0,0,"Fresh and delicious food, served fast. What mo...",10/30/2021 20:17,1
48144,jacDcaIWSPdZq2bDq1GD_g,Y-mwrjOx29pnJX0MCBb2Yg,9VRmMY9vGhGKGz9hiGoEUw,1,0,0,0,If I could leave no stars I would. I understan...,11/28/2021 14:23,0
48145,dkGbETTcSQZTwHSnAMnLUw,_RmG_5kxRPgTWP7RptaFgQ,Bq0CQcwk5R8yhm-MGfHxCA,5,6,2,4,Bosnian food?? \n\n--- location. This is a HID...,1/5/2020 4:20,1


---

# Code Cell 2 - Data Preprocessing

In [None]:
positive_reviews = df[df['Sentiment'] == 1].sample(50, random_state=42)
negative_reviews = df[df['Sentiment'] == 0].sample(50, random_state=42)

# Combine selected reviews into a new dataset
selected_reviews = pd.concat([positive_reviews, negative_reviews]).reset_index(drop=True)

display(selected_reviews.head())

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,Sentiment
0,nAN_rYmPh82T1WzlROCcsw,8XeTv8Js_8um5Ht1Qnb0qw,n9kqlp48MzXB--LKoRjQhA,5,1,0,0,"First time ordering online, easy transaction, ...",1/4/2020 21:54,1
1,B0hgC22SWvPBStXv8jzSmw,8IPSQT6yPWmqxafauO4LrA,isrmmF6K_OZC2maNStwYNQ,5,0,0,0,This might be my favorite restaurant in Tucson...,7/18/2020 2:32,1
2,8j3H4k2gthWI3-AuqgqWzw,J7qboaD38ra2I0EMb3dqHA,eN-Zrz1orLoqIb7D6mUMbg,4,1,1,0,Ordered at the window! Because I ordered the b...,8/5/2020 21:52,1
3,KDtcFDryEmyJ0xZG_yy4rQ,vEFMvU78DrLZVKV1h-ZOpg,RSrBPqSze2HJkx5DZsm7FA,5,2,0,1,"Super good food, most of the menu is made from...",10/19/2020 4:16,1
4,I09Lr_K3DofaIHSiylVTFQ,DgtXZzWytUNWKyjq9i4dhw,l7FBm3yxW0dx0WqQVlcQ1Q,5,0,0,0,This place has amazing wings! The mild and buf...,1/10/2020 1:57,1


In [None]:
# AWS Bedrock runtime client setup
client = boto3.client("bedrock-runtime", region_name="us-east-1")

# Define Model ID for Claude 3 Sonnet
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

# Load dataset
df = pd.read_csv("restaurant_reviews_az.csv")

# Convert Sentiment column to integer if necessary
df["Sentiment"] = df["Sentiment"].astype(int)

# Select 50 positive and 50 negative reviews randomly
positive_reviews = df[df["Sentiment"] == 1].sample(n=50, random_state=42)
negative_reviews = df[df["Sentiment"] == 0].sample(n=50, random_state=42)

# Combine into a dataset
selected_reviews = pd.concat([positive_reviews, negative_reviews]).reset_index(drop=True)

# Display first few rows
print("Sample Data")
display(selected_reviews.head())

Sample Data


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,Sentiment
0,nAN_rYmPh82T1WzlROCcsw,8XeTv8Js_8um5Ht1Qnb0qw,n9kqlp48MzXB--LKoRjQhA,5,1,0,0,"First time ordering online, easy transaction, ...",1/4/2020 21:54,1
1,B0hgC22SWvPBStXv8jzSmw,8IPSQT6yPWmqxafauO4LrA,isrmmF6K_OZC2maNStwYNQ,5,0,0,0,This might be my favorite restaurant in Tucson...,7/18/2020 2:32,1
2,8j3H4k2gthWI3-AuqgqWzw,J7qboaD38ra2I0EMb3dqHA,eN-Zrz1orLoqIb7D6mUMbg,4,1,1,0,Ordered at the window! Because I ordered the b...,8/5/2020 21:52,1
3,KDtcFDryEmyJ0xZG_yy4rQ,vEFMvU78DrLZVKV1h-ZOpg,RSrBPqSze2HJkx5DZsm7FA,5,2,0,1,"Super good food, most of the menu is made from...",10/19/2020 4:16,1
4,I09Lr_K3DofaIHSiylVTFQ,DgtXZzWytUNWKyjq9i4dhw,l7FBm3yxW0dx0WqQVlcQ1Q,5,0,0,0,This place has amazing wings! The mild and buf...,1/10/2020 1:57,1


---

# Code Cell 3 - Sentiment Analysis using Zero-Shot Learning

In [None]:
# AWS Bedrock runtime client setup
client = boto3.client("bedrock-runtime", region_name="us-east-1")

# Define Model ID for Claude 3 Sonnet
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

# Load dataset
df = pd.read_csv("restaurant_reviews_az.csv")

# Convert Sentiment column to integer if necessary
df["Sentiment"] = df["Sentiment"].astype(int)

# Select 50 positive and 50 negative reviews randomly
positive_reviews = df[df["Sentiment"] == 1].sample(n=50, random_state=42)
negative_reviews = df[df["Sentiment"] == 0].sample(n=50, random_state=42)

# Combine into a dataset
selected_reviews = pd.concat([positive_reviews, negative_reviews]).reset_index(drop=True)

# Function for Zero-Shot Sentiment Analysis using Claude 3 Sonnet
def get_claude_sentiment(review_text):
    prompt = f"""Perform sentiment analysis on the given review.
    Only respond with one word: "Positive" or "Negative".
    Review: {review_text}
    """

    native_request = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 5,
        "temperature": 0.4,
        "messages": [{"role": "user", "content": prompt}]
    }

    request = json.dumps(native_request)
    response = client.invoke_model(modelId=model_id, body=request)

    model_response = json.loads(response["body"].read())
    prediction = model_response.get("content", [""])[0].get("text", "").strip().lower()

    # Ensure only "positive" or "negative" is accepted
    if prediction in ["positive", "negative"]:
        sentiment_value = 1 if prediction == "positive" else 0
    else:
        sentiment_value = None  # Ignore invalid or Neutral responses

    return sentiment_value

# Apply Zero-Shot Sentiment Analysis
selected_reviews["Predicted_Sentiment_Claude"] = selected_reviews["text"].apply(get_claude_sentiment)

# Drop Neutral or invalid responses
selected_reviews = selected_reviews.dropna(subset=["Predicted_Sentiment_Claude"])
selected_reviews["Predicted_Sentiment_Claude"] = selected_reviews["Predicted_Sentiment_Claude"].astype(int)

# Count the number of Positive & Negative predictions
pos_count = (selected_reviews["Predicted_Sentiment_Claude"] == 1).sum()
neg_count = (selected_reviews["Predicted_Sentiment_Claude"] == 0).sum()

# Print counts
print(f"\nCount (Ignoring Neutral cases) ~ ")
print(f"\nTotal Positive Predictions = {pos_count}")
print(f"\nTotal Negative Predictions = {neg_count}")

# Print sample of 5 reviews for each sentiment
print("\n === Sample of 5 Positive Predictions === ")
display(selected_reviews[selected_reviews["Predicted_Sentiment_Claude"] == 1].head(5)[["text", "Predicted_Sentiment_Claude"]])

print("\n === Sample of 5 Negative Predictions === ")
display(selected_reviews[selected_reviews["Predicted_Sentiment_Claude"] == 0].head(5)[["text", "Predicted_Sentiment_Claude"]])



Count (Ignoring Neutral cases) ~ 

Total Positive Predictions = 50

Total Negative Predictions = 49

 === Sample of 5 Positive Predictions === 


Unnamed: 0,text,Predicted_Sentiment_Claude
0,"First time ordering online, easy transaction, ...",1
1,This might be my favorite restaurant in Tucson...,1
2,Ordered at the window! Because I ordered the b...,1
3,"Super good food, most of the menu is made from...",1
4,This place has amazing wings! The mild and buf...,1



 === Sample of 5 Negative Predictions === 


Unnamed: 0,text,Predicted_Sentiment_Claude
6,This place was ok. Service not very good but w...,0
10,You have to know that you can't get the Bento ...,0
50,Not sure about the food but definitely crappy ...,0
51,Last time I went to this McDonald's the French...,0
52,"This place wasn't terrible, but it also wasn't...",0


In [None]:
# Evaluate Model (Ignoring Neutral cases)
y_true = selected_reviews["Sentiment"]
y_pred = selected_reviews["Predicted_Sentiment_Claude"]

# Print results
accuracy = accuracy_score(y_true, y_pred)
print(f"\nClaude 3 Sonnet Zero-Shot Accuracy = {accuracy:.2f}")
print("\nCLASSIFICATION REPORT\n", classification_report(y_true, y_pred, target_names=["Negative", "Positive"]))


Claude 3 Sonnet Zero-Shot Accuracy = 0.96

CLASSIFICATION REPORT
               precision    recall  f1-score   support

    Negative       0.96      0.96      0.96        49
    Positive       0.96      0.96      0.96        50

    accuracy                           0.96        99
   macro avg       0.96      0.96      0.96        99
weighted avg       0.96      0.96      0.96        99



---

# Code Cell 4 - Sentiment Analysis using Few-Shot Learning

In [None]:
# Select 2 positive and 2 negative reviews as few-shot examples
few_shot_examples = "\n\n".join(
    f"Review: {row['text']}\nSentiment: {'Positive' if row['Sentiment'] == 1 else 'Negative'}"
    for _, row in selected_reviews.groupby("Sentiment").apply(lambda x: x.sample(2, random_state=42)).iterrows()
)

# Function for Few-Shot Sentiment Analysis using Claude 3 Sonnet
def get_claude_few_shot_sentiment(review_text, examples):
    prompt = f"""
    Below are a few example reviews with their sentiment labels. Use these examples to classify the sentiment of the new review.

    {examples}

    Now classify the sentiment of the following review. Only respond with one word: "Positive" or "Negative".

    Review: {review_text}
    """

    native_request = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 5,
        "temperature": 0.4,
        "messages": [{"role": "user", "content": prompt}]
    }

    request = json.dumps(native_request)
    response = client.invoke_model(modelId=model_id, body=request)

    model_response = json.loads(response["body"].read())
    prediction = model_response.get("content", [""])[0].get("text", "").strip().lower()

    # Ensure only "positive" or "negative" is accepted
    if prediction in ["positive", "negative"]:
        sentiment_value = 1 if prediction == "positive" else 0
    else:
        sentiment_value = None  # Ignore invalid responses

    return sentiment_value

# Apply Few-Shot Sentiment Analysis
selected_reviews["Predicted_Sentiment_FewShot"] = selected_reviews["text"].apply(
    lambda x: get_claude_few_shot_sentiment(x, few_shot_examples)
)

# Drop Neutral or invalid responses
selected_reviews = selected_reviews.dropna(subset=["Predicted_Sentiment_FewShot"])
selected_reviews["Predicted_Sentiment_FewShot"] = selected_reviews["Predicted_Sentiment_FewShot"].astype(int)

# Count the number of Positive & Negative predictions
pos_count = (selected_reviews["Predicted_Sentiment_FewShot"] == 1).sum()
neg_count = (selected_reviews["Predicted_Sentiment_FewShot"] == 0).sum()

# Print counts
print(f"\n ===== FEW-SHOT ===== ")
print(f"\nCount (Ignoring Neutral cases) ~ ")
print(f"\nTotal Positive Predictions = {pos_count}")
print(f"\nTotal Negative Predictions = {neg_count}")

# Print sample of 5 reviews for each sentiment
print("\n === Sample of 5 Positive Predictions === ")
display(selected_reviews[selected_reviews["Predicted_Sentiment_FewShot"] == 1].head(5)[["text", "Predicted_Sentiment_FewShot"]])

print("\n === Sample of 5 Negative Predictions === ")
display(selected_reviews[selected_reviews["Predicted_Sentiment_FewShot"] == 0].head(5)[["text", "Predicted_Sentiment_FewShot"]])


 ===== FEW-SHOT ===== 

Count (Ignoring Neutral cases) ~ 

Total Positive Predictions = 51

Total Negative Predictions = 48

 === Sample of 5 Positive Predictions === 


Unnamed: 0,text,Predicted_Sentiment_FewShot
0,"First time ordering online, easy transaction, ...",1
1,This might be my favorite restaurant in Tucson...,1
2,Ordered at the window! Because I ordered the b...,1
3,"Super good food, most of the menu is made from...",1
4,This place has amazing wings! The mild and buf...,1



 === Sample of 5 Negative Predictions === 


Unnamed: 0,text,Predicted_Sentiment_FewShot
6,This place was ok. Service not very good but w...,0
10,You have to know that you can't get the Bento ...,0
50,Not sure about the food but definitely crappy ...,0
51,Last time I went to this McDonald's the French...,0
52,"This place wasn't terrible, but it also wasn't...",0


In [None]:
# Evaluate Model (Ignoring Neutral cases)
y_true = selected_reviews["Sentiment"]
y_pred = selected_reviews["Predicted_Sentiment_FewShot"]

# Print results
accuracy = accuracy_score(y_true, y_pred)
print(f"\nClaude 3 Sonnet Few-Shot Accuracy = {accuracy:.2f}")
print("\nCLASSIFICATION REPORT\n", classification_report(y_true, y_pred, target_names=["Negative", "Positive"]))


Claude 3 Sonnet Few-Shot Accuracy = 0.95

CLASSIFICATION REPORT
               precision    recall  f1-score   support

    Negative       0.96      0.94      0.95        49
    Positive       0.94      0.96      0.95        50

    accuracy                           0.95        99
   macro avg       0.95      0.95      0.95        99
weighted avg       0.95      0.95      0.95        99



---

# Code Cell 5 - Experiment with Multiple LLMs

## 5.1 - Claude 3 Sonnet

In [None]:
# Define Model IDs
model_id_claude = "anthropic.claude-3-sonnet-20240229-v1:0"
model_id_llama = "meta.llama3-70b-instruct-v1:0"

# Function to Invoke LLM Models
def get_model_response(review_text, model_id):
    prompt = f"""Review: {review_text}
    Predict the sentiment of this review. Respond only with 'Positive' or 'Negative'. No explanations.
    """

    if model_id == model_id_llama:
        # LLaMA prompt formatting
        formatted_prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|> "
        native_request = {
            "prompt": formatted_prompt,
            "max_gen_len": 10,  # Reduce output length to prevent verbose responses
            "temperature": 0.3,
        }
    else:
        # Claude prompt formatting
        native_request = {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 5,
            "temperature": 0.3,
            "messages": [{"role": "user", "content": prompt}]
        }

    request = json.dumps(native_request)
    response = client.invoke_model(modelId=model_id, body=request)

    model_response = json.loads(response["body"].read())

    # Extract prediction and enforce strict sentiment classification
    if model_id == model_id_llama:
        prediction = model_response.get("generation", "").strip().lower()
    else:
        prediction = model_response.get("content", [""])[0].get("text", "").strip().lower()

    if "positive" in prediction:
        return 1
    elif "negative" in prediction:
        return 0
    else:
        return None  # Considered neutral

# Apply Sentiment Analysis using both Claude 3 and LLaMA 3
selected_reviews["Sentiment_Claude"] = selected_reviews["text"].apply(lambda x: get_model_response(x, model_id_claude))
selected_reviews["Sentiment_LLaMA"] = selected_reviews["text"].apply(lambda x: get_model_response(x, model_id_llama))

# Count neutral responses before removing them
neutral_count_claude = selected_reviews["Sentiment_Claude"].isna().sum()
neutral_count_llama = selected_reviews["Sentiment_LLaMA"].isna().sum()

# Drop only neutral responses
selected_reviews = selected_reviews.dropna(subset=["Sentiment_Claude", "Sentiment_LLaMA"])
selected_reviews["Sentiment_Claude"] = selected_reviews["Sentiment_Claude"].astype(int)
selected_reviews["Sentiment_LLaMA"] = selected_reviews["Sentiment_LLaMA"].astype(int)

# Count sentiment predictions per model
claude_pos = (selected_reviews["Sentiment_Claude"] == 1).sum()
claude_neg = (selected_reviews["Sentiment_Claude"] == 0).sum()
llama_pos = (selected_reviews["Sentiment_LLaMA"] == 1).sum()
llama_neg = (selected_reviews["Sentiment_LLaMA"] == 0).sum()

# Print sentiment counts
print(f"\n ===== CLAUDE 3 SONNET ===== ")
print(f"\nCount ~ ")
print(f"\nTotal Positive Predictions = {claude_pos}")
print(f"\nTotal Negative Predictions = {claude_neg}")
print(f"\nNeutral Predictions = {neutral_count_claude}")

# Print sample of 5 positive and 5 negative reviews
print("\n === Sample of 5 Positive Predictions === ")
display(selected_reviews[selected_reviews["Sentiment_Claude"] == 1].head(5)[["text", "Sentiment_Claude"]])

print("\n === Sample of 5 Negative Predictions === ")
display(selected_reviews[selected_reviews["Sentiment_Claude"] == 0].head(5)[["text", "Sentiment_Claude"]])



 ===== CLAUDE 3 SONNET ===== 

Count ~ 

Total Positive Predictions = 51

Total Negative Predictions = 47

Neutral Predictions = 0

 === Sample of 5 Positive Predictions === 


Unnamed: 0,text,Sentiment_Claude
0,"First time ordering online, easy transaction, ...",1
1,This might be my favorite restaurant in Tucson...,1
2,Ordered at the window! Because I ordered the b...,1
3,"Super good food, most of the menu is made from...",1
4,This place has amazing wings! The mild and buf...,1



 === Sample of 5 Negative Predictions === 


Unnamed: 0,text,Sentiment_Claude
6,This place was ok. Service not very good but w...,0
10,You have to know that you can't get the Bento ...,0
26,This place is one of the best Chinese restaura...,0
50,Not sure about the food but definitely crappy ...,0
51,Last time I went to this McDonald's the French...,0


---

## 5.2 - LLaMA 3

In [None]:
# Print sentiment counts
print(f"\n ===== LLaMA 3 ===== ")
print(f"\nCount ~ ")
print(f"\nTotal Positive Predictions (LLaMA 3) = {llama_pos}")
print(f"\nTotal Negative Predictions (LLaMA 3) = {llama_neg}")
print(f"\nNeutral Predictions (LLaMA 3) = {neutral_count_llama}")


# Print sample of 5 positive and 5 negative reviews
print("\n === Sample of 5 Positive Predictions === ")
display(selected_reviews[selected_reviews["Sentiment_LLaMA"] == 1].head(5)[["text", "Sentiment_LLaMA"]])

print("\n === Sample of 5 Negative Predictions === ")
display(selected_reviews[selected_reviews["Sentiment_LLaMA"] == 0].head(5)[["text", "Sentiment_LLaMA"]])



 ===== LLaMA 3 ===== 

Count ~ 

Total Positive Predictions (LLaMA 3) = 51

Total Negative Predictions (LLaMA 3) = 47

Neutral Predictions (LLaMA 3) = 1

 === Sample of 5 Positive Predictions === 


Unnamed: 0,text,Sentiment_LLaMA
0,"First time ordering online, easy transaction, ...",1
1,This might be my favorite restaurant in Tucson...,1
2,Ordered at the window! Because I ordered the b...,1
3,"Super good food, most of the menu is made from...",1
4,This place has amazing wings! The mild and buf...,1



 === Sample of 5 Negative Predictions === 


Unnamed: 0,text,Sentiment_LLaMA
6,This place was ok. Service not very good but w...,0
10,You have to know that you can't get the Bento ...,0
50,Not sure about the food but definitely crappy ...,0
51,Last time I went to this McDonald's the French...,0
52,"This place wasn't terrible, but it also wasn't...",0


---

## 5.3 Comparison

In [None]:
# Claude 3 Sonnet
y_true = selected_reviews["Sentiment"]
y_pred_claude = selected_reviews["Sentiment_Claude"]

accuracy_claude = accuracy_score(y_true, y_pred_claude)
report_claude = classification_report(y_true, y_pred_claude, target_names=["Negative", "Positive"])

# LLaMA 3
y_pred_llama = selected_reviews["Sentiment_LLaMA"]

accuracy_llama = accuracy_score(y_true, y_pred_llama)
report_llama = classification_report(y_true, y_pred_llama, target_names=["Negative", "Positive"])

# Display Evaluation Results
print("\n ===== CLAUDE 3 SONNET ~ MODEL PERFORMANCE =====")
print(f"Accuracy = {accuracy_claude:.2f}")
print(report_claude)

print("\n ===== LLaMA 3 ~ MODEL PERFORMANCE =====")
print(f"Accuracy = {accuracy_llama:.2f}")
print(report_llama)

# Compare Model Performance
print("\n ===== MODEL COMPARISON SUMMARY ===== ")
print(f" Accuracy ~")
print(f"CLAUDE 3 SONNET = {accuracy_claude:.2f}")
print(f"LLaMA 3 = {accuracy_llama:.2f}")

# Decision based on F1-score and precision-recall trade-off
if accuracy_claude > accuracy_llama:
    print("\nCLAUDE 3 SONNET performs better overall in sentiment classification.")
elif accuracy_llama > accuracy_claude:
    print("\nLLaMA 3 performs better overall in sentiment classification.")
else:
    print("\nBoth models have similar performance.")



 ===== CLAUDE 3 SONNET ~ MODEL PERFORMANCE =====
Accuracy = 0.93
              precision    recall  f1-score   support

    Negative       0.94      0.92      0.93        48
    Positive       0.92      0.94      0.93        50

    accuracy                           0.93        98
   macro avg       0.93      0.93      0.93        98
weighted avg       0.93      0.93      0.93        98


 ===== LLaMA 3 ~ MODEL PERFORMANCE =====
Accuracy = 0.95
              precision    recall  f1-score   support

    Negative       0.96      0.94      0.95        48
    Positive       0.94      0.96      0.95        50

    accuracy                           0.95        98
   macro avg       0.95      0.95      0.95        98
weighted avg       0.95      0.95      0.95        98


 ===== MODEL COMPARISON SUMMARY ===== 
 Accuracy ~
CLAUDE 3 SONNET = 0.93
LLaMA 3 = 0.95

LLaMA 3 performs better overall in sentiment classification.


---

# Text Cell 6 - Observations


## **1. Zero-Shot vs. Few-Shot Learning Performance**
- Accuracy for **zero-shot** and **few-shot** prompting (Claude 3 Sonnet) is **0.95**.
- **Precision, Recall, and F1-scores** are nearly identical:
  - **Negative Sentiment**: 0.96 (precision), 0.94 (recall), 0.95 (F1-score)
  - **Positive Sentiment**: 0.94 (precision), 0.96 (recall), 0.95 (F1-score)
- **Few-shot learning did not significantly improve performance**, likely due to clear sentiment indicators in the dataset.

## **2. Misclassification Cases**
- **Neutral reviews** were omitted from evaluation.
- **Incorrect predictions** mostly involved:
  - **Mixed opinions** (e.g., "Good food, but terrible service").
  - **Sarcasm or implicit sentiment** leading to misclassification.
- **LLaMA 3 outperformed Claude 3 Sonnet**, with **0.95 accuracy vs. 0.93**.

## **3. Reasons for Misclassifications**
- **Neutral Sentiments**: Reviews without strong sentiment were forced into **positive or negative** categories.
- **Ambiguous Language**: Phrases like *"It was okay"* caused uncertainty.
- **Short Reviews**: Limited context affected predictions.
- **Industry-Specific Terms**: LLMs may struggle with restaurant-specific language.

## **4. Claude 3 Sonnet vs. LLaMA 3**
- **LLaMA 3 had slightly better accuracy (0.95) vs. Claude 3 Sonnet (0.93)**.
- **Better precision and recall** in distinguishing **negative reviews**.
- Differences likely due to **model architecture and training data variations**.

## **5. Conclusion & Improvements**
- **Few-shot learning did not provide major gains** over zero-shot prompting.
- **LLaMA 3 slightly outperformed Claude 3 Sonnet**.
- **Challenges** include ambiguous, mixed, and neutral reviews.
- **Potential Enhancements**:
  - **Refining prompts** for clearer context.
  - **Using diverse few-shot examples** for better generalization.
  - **Testing alternative LLMs** for improved accuracy.

---
