Homework 4: Sentiment Analysis - Task 4
----

Names
----
Names: __Adrian Criollo__ (Write these in every notebook you submit)

Task 4: Neural Networks (20 points)
----

Next, we'll train a feedforward neural net to work with this data. You'll train one neural net which takes the same input as your Logistic Regression model - a sparse vector representing documents as bags of words.

Take a look at these videos to understand forward and backward propagation in neural networks - 
* https://www.youtube.com/watch?v=HHbjpDHcJVw
* https://youtu.be/-Lavz_I4l2U?si=zi20DB3qKPLMEPt1
  
**10 points in Task 5 will be allocated for all 9 graphs (including the one generated here in Task 4 for Neural Networks) being:**
- Legible
- Present below
- Properly labeled
     - x and y axes labeled
     - Legend for accuracy measures plotted
     - Plot Title with which model and run number the graph represents

In [None]:
!pip install tensorflow
import sentiment_utils as sutils
import numpy as np

from keras.models import Sequential
from keras.layers import Dense

# you can experiment with having some Dropout layers if you'd like to
# this is not required
from keras.layers import Dropout

# if you want to use this again
from sklearn.feature_extraction.text import CountVectorizer



In [None]:
# define constants for the files we are using
TRAIN_FILE = "movie_reviews_train.txt"
DEV_FILE = "movie_reviews_dev.txt"

# load in your data and make sure you understand the format
# Do not print out too much so as to impede readability of your notebook
train_docs, train_labels = sutils.generate_tuples_from_file(TRAIN_FILE)
dev_docs, dev_labels = sutils.generate_tuples_from_file(DEV_FILE)

# you may use either your sparse vectors or sklearn's CountVectorizer's sparse vectors
# you will experiment with multinomial and binarized representations later

# Join tokenized words into a single string for each document (for CountVectorizer)
train_docs_joined = [' '.join(doc) for doc in train_docs]
dev_docs_joined = [' '.join(doc) for doc in dev_docs]

# Vectorize the data using CountVectorizer
vectorizer = CountVectorizer(binary=False)  # Set binary=True for binarized representation later
X_train = vectorizer.fit_transform(train_docs_joined).toarray()
X_dev = vectorizer.transform(dev_docs_joined).toarray()

# Convert labels to numpy arrays
y_train = np.array(train_labels)
y_dev = np.array(dev_labels)

In [None]:
# Create a feedforward neural network model
# that takes a sparse BoW representation of the data as input
# and makes a binary classification of positive/negative sentiment as output
# you may use any number of hidden layers >= 1 and any number of units in each hidden layer (we recommend between 50-200)
# you may use any activation function on the hidden layers 
# you should use a sigmoid activation function on the output layer
# you should use binary cross-entropy as your loss function
# sgd is an appropriate optimizer for this task
# you should report accuracy as your metric
# you may add Dropout layers if you'd like to

# create/compile your model in this cell

model = Sequential()
model.add(Dense(100, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dropout(0.3))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))

model.summary()
# call compile here
model.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['accuracy'])


How many trainable parameters does your model have? __2,269,901__

In [None]:
# train your model
# reports an accuracy of 0.78 at that point using the sgd optimizer
# Ensure labels are converted to NumPy arrays
y_train = np.array(train_labels)
y_dev = np.array(dev_labels)

# Train your model
history = model.fit(
    X_train,
    y_train,
    validation_data=(X_dev, y_dev),
    epochs=10,
    batch_size=32,
    verbose=1
)

# After training, evaluate the model
loss, accuracy = model.evaluate(X_dev, y_dev, verbose=0)
print(f"Validation Accuracy: {accuracy:.2f}")
# Failed to find data adapter that can handle input: <class 'numpy.ndarray'>, (<class 'list'> containing values of types {"<class 'int'>"})
# indicates you should change a list into a numpy array



In [None]:
# make a prediction on the dev set
# then make a classification decision based on that prediction

predictions = model.predict(X_dev)
predicted_labels = [1 if prob > 0.5 else 0 for prob in predictions]


In [None]:
# use the model.evaluate function to report the loss and accuracy on the dev set
loss, accuracy = model.evaluate(X_dev, y_dev, verbose=1)
print(f"Dev Set Loss: {loss}")
print(f"Dev Set Accuracy: {accuracy}")

In [None]:
# create the same graph as with NB and LR, with your neural network model instead!
# make sure to re-create your model each time you train it — you don't want to start with
# an already trained network!

# you should experiment with different numbers of epochs to see how performance varies
# you need not create an experiment that takes > 10 min to run (gradescope will run out of computing resources and give you a 0)

def prepare_features(train_docs, dev_docs, train_labels, dev_labels, binary=False):
    train_docs_joined = [' '.join(doc) for doc in train_docs]
    dev_docs_joined = [' '.join(doc) for doc in dev_docs]
    vectorizer = CountVectorizer(binary=binary)
    X_train = vectorizer.fit_transform(train_docs_joined).toarray()
    X_dev = vectorizer.transform(dev_docs_joined).toarray()
    train_feats = [(X_train[i], train_labels[i]) for i in range(len(train_labels))]
    dev_feats = [(X_dev[i], dev_labels[i]) for i in range(len(dev_labels))]
    return train_feats, dev_feats


train_feats_multinomial, dev_feats_multinomial = prepare_features(train_docs, dev_docs, train_labels, dev_labels, binary=False)
train_feats_binarized, dev_feats_binarized = prepare_features(train_docs, dev_docs, train_labels, dev_labels, binary=True)
sutils.create_training_graph(sutils.nn_helper, train_feats_multinomial, dev_feats_multinomial, kind="Neural Network Multinomial", savepath="nn_training_graph_multinomial.png")
sutils.create_training_graph(sutils.nn_helper, train_feats_binarized, dev_feats_binarized, kind="Neural Network Binarized", savepath="nn_training_graph_binary3.png")

precision_multinomial, recall_multinomial, f1_multinomial, accuracy_multinomial = sutils.nn_helper(
    train_feats_multinomial, dev_feats_multinomial, epochs=10
)
precision_binarized, recall_binarized, f1_binarized, accuracy_binarized = sutils.nn_helper(
    train_feats_binarized, dev_feats_binarized, epochs=10
)

print("F1 score with multinomial features:", f1_multinomial)
print("F1 score with binarized features:", f1_binarized)

Report the f1 scores for your model with the following settings, using the same number of epochs to train in both cases:
- number of epochs used: __10__
- multinomial features: __.7378__ 
- binarized features: __.82__