Homework 4: Sentiment Analysis - Task 2
----

Names 
----
Names: __Adrian Criollo__ (Write these in every notebook you submit.)

Task 2: Train a Naive Bayes Model (30 points)
----

Using `nltk`'s `NaiveBayesClassifier` class, train a Naive Bayes classifier using a Bag of Words as features.

Learn more about Naive Bayes here: https://www.nltk.org/_modules/nltk/classify/naivebayes.html 

Naive Bayes classifiers use Bayes’ theorem for predictions. Naive Bayes can be a good baseline for NLP applications in particular. You can use it as a baseline for your project!

**

**10 points in Task 5 will be allocated for all 9 graphs (including the one generated here in Task 4 for Naive Bayes Classifier) being:**
- Legible
- Present below
- Properly labeled
     - x and y axes labeled
     - Legend for accuracy measures plotted
     - Plot Title with which model and run number the graph represents

In [25]:
# our utility functions
# RESTART your jupyter notebook kernel if you make changes to this file
import sentiment_utils as sutils

# nltk for Naive Bayes and metrics
import nltk
import nltk.classify.util
from nltk.metrics.scores import (precision, recall, f_measure, accuracy)
from nltk.classify import NaiveBayesClassifier
import pandas as pd

# some potentially helpful data structures from collections
from collections import defaultdict, Counter

# so that we can make plots
import matplotlib.pyplot as plt
# if you want to use seaborn to make plots
#import seaborn as sns

In [27]:
# define constants for the files we are using
TRAIN_FILE = "movie_reviews_train.txt"
DEV_FILE = "movie_reviews_dev.txt"

In [29]:
# load in your data and make sure you understand the format
# Do not print out too much so as to impede readability of your notebook
train_tups = sutils.generate_tuples_from_file(TRAIN_FILE)
dev_tups = sutils.generate_tuples_from_file(DEV_FILE)


In [40]:
# set up a sentiment classifier using NLTK's NaiveBayesClassifier and 
# a bag of words as features
# take a look at the function in lecture notebook 7 (feel free to copy + paste that function)
# the nltk classifier expects a dictionary of features as input where the key is the feature name
# and the value is the feature value

# need to return a dict to work with the NLTK classifier
# Possible problem for students: evaluate the difference 
# between using binarized features and using counts (non binarized features)
def word_feats(words) -> dict:    
    return dict([(word, True) for word in words])      


# set up & train a sentiment classifier using NLTK's NaiveBayesClassifier and
# classify the first example in the dev set as an example
# make sure your output is well-labeled

df_train = pd.read_csv(TRAIN_FILE, header=None, sep='\t', names = ['ID', 'Review', 'Label'])
df_dev = pd.read_csv(DEV_FILE, header=None, sep='\t', names = ['ID', 'Review', 'Label'])

trained_data = []
for review, label in zip(df_train['Review'], df_train['Label']):
    words = review.split()
    trained_data.append((word_feats(words), label))

classifier = nltk.NaiveBayesClassifier.train(trained_data)

# test to make sure that you can train the classifier and use it to classify a new example

test_review = df_dev['Review'].iloc[0].split()
test_label = df_dev['Label'].iloc[0]
prediction = classifier.classify(word_feats(first_dev_review))
print(f"Test label: {test_label}")
print(f"Predicted label: {prediction}")


Test label: 0
Predicted label: 0


In [None]:
# Using the provided dev set, evaluate your model with precision, recall, and f1 score as well as accuracy
# You may use nltk's implemented `precision`, `recall`, `f_measure`, and `accuracy` functions
# (make sure to look at the documentation for these functions!)
# you will be creating a similar graph for logistic regression and neural nets, so make sure
# you use functions wisely so that you do not have excessive repeated code
# write any helper functions you need in sentiment_utils.py (functions that you'll use in your other notebooks as well)



# create a graph of your classifier's performance on the dev set as a function of the amount of training data
# the x-axis should be the amount of training data (as a percentage of the total training data)
# NOTE : make sure one of your experiments uses 10% of the data, you will need this to answer the first question in task 5
# the y-axis should be the performance of the classifier on the dev set
# the graph should have 4 lines, one for each of precision, recall, f1, and accuracy
# the graph should have a legend, title, and axis labels
def train_and_evaluate_classifier(train_data, dev_data, dev_labels, data_percentages):
    """
    Trains a classifier on increasing portions of the training data and evaluates it on the dev set.
    
    Args:
    - train_data: list of tuples (features, label) for training
    - dev_data: list of dev examples for evaluation
    - dev_labels: list of true dev labels
    - data_percentages: list of percentages of training data to use for experiments
    
    Returns:
    - A dictionary of performance metrics (precision, recall, f1, accuracy) at each percentage
    """
    metrics = {'precision': [], 'recall': [], 'f1': [], 'accuracy': []}
    
    for percentage in data_percentages:
        # Calculate the amount of training data to use
        data_size = int(len(train_data) * (percentage / 100))
        current_train_data = train_data[:data_size]
        
        # Train the classifier
        classifier = nltk.NaiveBayesClassifier.train(current_train_data)
        
        # Get predictions on the dev set
        dev_preds = [classifier.classify(word_feats(review.split())) for review in dev_data]
        
        # Evaluate the classifier
        precision_score = precision(refsets, testsets)
        recall_score = recall(refsets, testsets)
        f1_score = f_measure(refsets, testsets)
        accuracy_score = accuracy(dev_labels, dev_preds)
        
        # Store the metrics for this percentage
        metrics['precision'].append(precision_score)
        metrics['recall'].append(recall_score)
        metrics['f1'].append(f1_score)
        metrics['accuracy'].append(accuracy_score)
    
    return metrics

# Example data percentages to experiment with
data_percentages = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Get dev data and labels (assuming dev_data and dev_labels are already loaded)
dev_data = df_dev['Review'].tolist()
dev_labels = df_dev['Label'].tolist()

# Assuming train_data is already prepared as (features, label) tuples
metrics = train_and_evaluate_classifier(train_data, dev_data, dev_labels, data_percentages)

# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(data_percentages, metrics['precision'], label='Precision', marker='o')
plt.plot(data_percentages, metrics['recall'], label='Recall', marker='o')
plt.plot(data_percentages, metrics['f1'], label='F1 Score', marker='o')
plt.plot(data_percentages, metrics['accuracy'], label='Accuracy', marker='o')

# Add labels and title
plt.title('Classifier Performance on Dev Set as a Function of Training Data')
plt.xlabel('Training Data Percentage (%)')
plt.ylabel('Performance')
plt.legend(loc='best')

# Show the plot
plt.grid(True)
plt.show()


Test your model using both a __binarized__ (bag of words representation where we put 1 [true] if the word is there and 0 [false] otherwise) and a __multinomial__ (bag of words representation where we put the count of the word if the word occurs, and 0 otherwise). Use whichever one gives you a better final f1 score on the dev set to produce your graphs.

- f1 score binarized: __YOUR ANSWER HERE__
- f1 score multinomial: __YOUR ANSWER HERE__