<a href="https://colab.research.google.com/github/bnelson05/Sentiment-Analysis-Model/blob/main/SentimentAnalysisModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. **Choose a Dataset**
You can pick: (or try some other ones you find interesting):
IMDB Movie Reviews (sentiment labels: positive/negative).
Yelp Reviews (sentiment labels: star ratings or binary positive/negative).
For now, complete the rest of the steps (2-4) below with the above two datasets. \\

Come back to do the following task after you're done with the above (time permitting):

We are going to use the Amazon Product Reviews (various categories, can be collapsed into positive/negative) dataset. You are free to decide how to collapse multiple categories into one. You can also compare different approaches of this as well.
Feel free to use the datasets library (e.g., load_dataset("imdb")).

In [42]:
from datasets import load_dataset

# IMDB Movie Review Dataset
imdb_ds = load_dataset("imdb")
# Yelp Reviews Dataset
yelp_ds = load_dataset("yelp_polarity")

print("IMBD Dataset")
print(imdb_ds)

print("Yelp Dataset")
print(yelp_ds)

IMBD Dataset
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
Yelp Dataset
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 560000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 38000
    })
})


# 2. **Select Two or More Pretrained Models**
Pick at least two from the Hugging Face Hub and compare them:
DistilBERT (e.g., distilbert-base-uncased-finetuned-sst-2-english)
BERT (e.g., bert-base-uncased-finetuned-sst-2-english)
RoBERTa (e.g., cardiffnlp/twitter-roberta-base-sentiment-latest or roberta-base-openai-detector)
Feel free to explore the Hugging Face Model Hub if you find something else interesting!

In [43]:
from transformers import pipeline

# RoBERTa
roberta_pipeline = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest")
# DistilBERT
distilbert_pipeline = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

sample_review_1 = "This movie was great!"
sample_review_2 = "I didn't like the actors."
sample_review_3 = "This movie was okay."

roberta_result_1 = roberta_pipeline(sample_review_1)
roberta_result_2 = roberta_pipeline(sample_review_2)
roberta_result_3 = roberta_pipeline(sample_review_3)

distilbert_result_1 = distilbert_pipeline(sample_review_1)
distilbert_result_2 = distilbert_pipeline(sample_review_2)
distilbert_result_3 = distilbert_pipeline(sample_review_3)

print("RoBERTa Predictions:")
print(f"Review 1: {roberta_result_1}")
print(f"Review 2: {roberta_result_2}")
print(f"Review 3: {roberta_result_3}")

print("\nDistilBERT Predictions:")
print(f"Review 1: {distilbert_result_1}")
print(f"Review 2: {distilbert_result_2}")
print(f"Review 3: {distilbert_result_3}")



Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Device set to use cuda:0


RoBERTa Predictions:
Review 1: [{'label': 'positive', 'score': 0.9858884215354919}]
Review 2: [{'label': 'negative', 'score': 0.8923878073692322}]
Review 3: [{'label': 'positive', 'score': 0.9490081071853638}]

DistilBERT Predictions:
Review 1: [{'label': 'POSITIVE', 'score': 0.9998677968978882}]
Review 2: [{'label': 'NEGATIVE', 'score': 0.9989005327224731}]
Review 3: [{'label': 'POSITIVE', 'score': 0.9997859597206116}]


# 3. **Measure Performance**
Implement an evaluation method on a test or validation split. At minimum:\
Accuracy: The fraction of examples predicted correctly.\
F1 Score: Combination of precision and recall. (explanation of this is given below, after the instructions)\
You can use the Hugging Face evaluate or datasets library or write your own small function for computing these metrics.

In [46]:
import evaluate

accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def label_to_num(sentiment):
  if (sentiment == "POSITIVE" or sentiment == "positive"):
    return 1
  return 0

test_split_imdb = imdb_ds["test"].select(range(1000))
imdb_text = [text[:500] for text in test_split_imdb["text"]]

# RoBERTa predictions for the test sample, converted to 1s and 0s
roberta_predictions = roberta_pipeline(imdb_text)
roberta_prediction_nums = [label_to_num(pred["label"]) for pred in roberta_predictions]

# DistilBERT predictions for the test sample, converted to 1s and 0s
distilbert_predictions = distilbert_pipeline(imdb_text)
distilbert_prediction_nums = [label_to_num(pred["label"]) for pred in distilbert_predictions]

# Hugging Face Accuracy Documentation: https://huggingface.co/spaces/evaluate-metric/accuracy
# Hugging Face F1 Score Documentation: https://huggingface.co/spaces/evaluate-metric/f1

roberta_accuracy = accuracy.compute(predictions = roberta_prediction_nums, references = test_split_imdb["label"])
roberta_f1_score = f1.compute(predictions = roberta_prediction_nums, references = test_split_imdb["label"])

distilbert_accuracy = accuracy.compute(predictions = distilbert_prediction_nums, references = test_split_imdb["label"])
distilbert_f1_score = f1.compute(predictions = distilbert_prediction_nums, references = test_split_imdb["label"])

# Getting F1 score of zero

print("RoBERTa Model Evaluation for IMDB Set:")
print(f"Accuracy: {roberta_accuracy['accuracy']}")
print(f"F1 Score: {roberta_f1_score['f1']}")
print("DistilBERT Model Evaluation for IMDB Set:")
print(f"Accuracy: {distilbert_accuracy['accuracy']}")
print(f"F1 Score: {distilbert_f1_score['f1']}")

RoBERTa Model Evaluation for IMDB Set:
Accuracy: 0.868
F1 Score: 0.0
DistilBERT Model Evaluation for IMDB Set:
Accuracy: 0.837
F1 Score: 0.0


# 4. **Compare Models & Do a Short Error Analysis**
After running inference on your test set:
Compare Metrics: Which model is most accurate overall? Does one have higher F1?
Identify Edge Cases:
Look at ~5 examples that were misclassified by at least one model.
What patterns do you see? (e.g., tricky wording, sarcasm, short text, etc.) If you don't see any patter that's fine but make sure you've looked hard enough (maybe you need more than 5 examples?)
What examples do all models make mistakes on? What mistakes are unique to a particular model? (again, if you dont find a pattern that's fine but make sure you've tried a lot of things)
Write a short paragraph or make a small table summarizing your findings:
Which model performed best overall?
Any surprising differences?
How might you improve performance further?

# **Deliverables**
**Code**: A Python script or Jupyter notebook showing how you: \\
Load data (and potentially preprocess it). \\
Instantiate the Hugging Face pipelines. \\
Run predictions and calculate metrics. \\

**Short text on findings**  ( in a text box in your Colab notebook): \\
Which dataset(s) you chose and why. \\
The models you compared and a table/plot of accuracy or F1 (use matplotlib for graphs). \\
Example misclassified cases and your hypothesis for why they failed. \\