Homework 4: Sentiment Analysis - Task 3
----

Names & Sections
----
Names: Julia Geller (4120) and Shae Marks (4120)

Task 3: Train a Logistic Regression Model (20 points)
----

Using `sklearn`'s implementation of `LogisticRegression`, conduct a similar analysis on the performance of a Logistic Regression classifier on the provided data set.

Using the `time` module, you'll compare and contrast how long it takes your home-grown BoW vectorizing function vs. `sklearn`'s `CountVectorizer`.


In [1]:
from sklearn.linear_model import LogisticRegression
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import CountVectorizer

import time
import sentiment_utils as sutils
import numpy as np

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shaem\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# define constants for the files we are using
TRAIN_FILE = "movie_reviews_train.txt"
DEV_FILE = "movie_reviews_dev.txt"

# load in your data and make sure you understand the format
# Do not print out too much so as to impede readability of your notebook
train_tups = sutils.generate_tuples_from_file(TRAIN_FILE)
dev_tups = sutils.generate_tuples_from_file(DEV_FILE)

# some variables you may want to use
BINARIZED = True
# USE_COUNT_VECTORIZER = False

In [3]:
# split the train and devlopment tuples into X and y
X_train = train_tups[0]
y_train = train_tups[1]

X_dev = dev_tups[0]
y_dev = dev_tups[1]

In [4]:
# how much time does it take to featurize the all data with your implementation?

start = time.time()

# YOUR CODE HERE
vocab = sutils.create_index(X_train)
X_train_vects = sutils.featurize('own', vocab, X_train, binary = False, verbose = False)
X_dev_vects = sutils.featurize('own', vocab, X_dev, binary = False, verbose = False)

end = time.time()
print("That took:", end - start, "seconds")



That took: 6.692219257354736 seconds


In [5]:
# how much time does it take to featurize the all data with sklearn's CountVectorizer?
start = time.time()


# YOUR CODE HERE
vocab = sutils.create_index(X_train)
X_train_flat = [' '.join(row) for row in X_train]
X_dev_flat = [' '.join(row) for row in X_dev]
X_train_CV = sutils.featurize('CV', vocab, X_train_flat, binary = False)
X_dev_CV = sutils.featurize('CV', vocab, X_dev_flat, binary = False)

end = time.time()
print("That took:", end - start, "seconds")



That took: 1.4688332080841064 seconds


1. How big is your vocabulary using your vectorization function(s)? The vocabulary using my vectorization function 30,705 words.
2. How big is your vocabulary using the `CountVectorizer`? The vocabulary using my vectorization function 30,705 words.

In [7]:
#  write any code you need analyze the relative sparsity of your vectorized representations of the data
# YOUR CODE HERE
def pct_zeros(X):
    cnt = [(x.shape[0]-np.count_nonzero(x))/x.shape[0] for x in X]
    return sum(cnt)/len(cnt)

# Print out the average % of entries that are zeros in each vector in the vectorized training data
# YOUR CODE HERE
print('The average percent of entries that are zero in each vector for my vectorized training data:', sutils.percent_zeros(X_train_vects))
print('The average percent of entries that are zero in each vector for sklearn vectorized training data:', sutils.percent_zeros(X_train_CV))

The average percent of entries that are zero in each vector for my vectorized training data: 0.995092452369321
The average percent of entries that are zero in each vector for sklearn vectorized training data: 0.9957819369809477


In [None]:
# Using the provided dev set, evaluate your model with precision, recall, and f1 score as well as accuracy
# You may use nltk's implemented `precision`, `recall`, `f_measure`, and `accuracy` functions
# (make sure to look at the documentation for these functions!)
# you will be creating a similar graph for logistic regression and neural nets, so make sure
# you use functions wisely so that you do not have excessive repeated code
# write any helper functions you need in sentiment_utils.py (functions that you'll use in your other notebooks as well)


# create a graph of your classifier's performance on the dev set as a function of the amount of training data
# the x-axis should be the amount of training data (as a percentage of the total training data)
# the y-axis should be the performance of the classifier on the dev set
# the graph should have 4 lines, one for each of precision, recall, f1, and accuracy
# the graph should have a legend, title, and axis labels

# takes approx 30 sec on Felix's computer

Test the following 4 combinations to determine which has the best final f1 score for your Logistic Regression model:
- your vectorized features, multinomial: __enter your final f1 score here__
- CountVectorizer features, multinomial: __enter your final f1 score here__
- your vectorized features, binarized: __enter your final f1 score here__
- CountVectorizer features, binarized: __enter your final f1 score here__

Produce your graph(s) for the combination with the best final f1 score.




6120 REQUIRED
----

Find the top 100 most important features to your Logistic Regression classifier when using 100% of the training data. To access the weights of your model, you can access the `model.coef_` attribute. You'll want to use a `StandardScalar` preprocessor. This will help us deal with the fact that we expect counts of certain words to be higher (e.g. stop words).

To find the importance of a feature, calculate the absolute value of each weight in the model, then order your features according to the absolute values of these weights. The feature with the heighest absolute value weight has the most importance.

Use __your__ (not CountVectorizer) multinomial vectors for this analysis.

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

In [None]:
# YOUR CODE HERE
# train a model on the scaled inputs
# This takes Felix's computer about 6.5 sec to run




In [None]:
# print out the top 20 most informative features according to this model


In [None]:
# re-evalaute your LR model with inputs that have been filtered to only use the top 500 most informative features


In [None]:
# create the same graph as before, but with the filtered inputs
