# Assignment 2: Predicting sentiment
In this assignment, you will be using the same sentiment analysis dataset as for Assignment 1, but you'll be looking to actually predict sentiment based on a variety of text-derived features.

This dataset comes from [Mass et. al. (2011)](https://www.aclweb.org/anthology/P11-1015.pdf) and the full version is available [here](http://ai.stanford.edu/~amaas/data/sentiment/).

In [1]:
# setup
import sys
import subprocess
import pkg_resources
from collections import Counter
import re
from numpy import log, mean

required = {'spacy', 'scikit-learn', 'pandas', 'transformers==2.4.1'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)

import spacy
import pandas as pd
import numpy as np
import pickle
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity

## Read in data
I've saved a subset of the data in the data directory on the repository.  It is available as a pickled dictionary.


In [2]:
# you will need to change this to where ever the file is stored
data_location = './data/assignment_1_reviews.pkl'
with open(data_location, 'rb') as f:
    all_text = pickle.load(f)
# corpora size
print([(k, len(all_text[k])) for k in all_text])
neg, pos = all_text.values()
# for this assignment, let's combine all our data, but maintain the labels
all_text = neg+pos
first_pos = len(neg)
# array makes for easier indexing
is_positive = np.array([False]*len(neg)+[True]*len(pos))
# check that they're equivalent
print(np.bincount(is_positive))

[('neg', 1233), ('pos', 1266)]
[1233 1266]


## Creating document feature vectors
In this section, process all of your text data in order to create the following document-level feature vectors:

- Word Counts (using `CountVectorizer`)
- TF-IDF vectors (using `TfidfVectorizer`)
- Non-Negative Matrix Factorization-based representations (using `NMF`)
- Latent Dirichlet Allocation-based representations (using `LatentDirichletAllocation`)

All of the design elements are up to you (e.g. tokenization, vocabulary limits, number of components).  It may make sense to try out a few different designs.  In the next section we'll do some evaluation of our different strategies.

In [3]:
# Import and instantiate English model from Spacy
from spacy.lang.en import English
en = English()

# Import and instantiaite trained model from Spacy
nlp = spacy.load("en_core_web_sm")

In [4]:
def tokenizer(doc, model=en):
    '''Tokenizer based on example from from week_1_intro notebook.
    Filters non-alpha, url-like, and stopwords then lemmatizes each parsed token.'''
    parsed = model(doc)
    # Return list of lowercase parsed tokens that are alphanumeric and not urls 
    return([t.lemma_ for t in parsed if (t.is_alpha) and (not t.like_url) and (not t.is_stop)])

In [5]:
cv = CountVectorizer(tokenizer=tokenizer)

In [6]:
# Helper function to get count vectors using fit transform on a count vectorizer
def get_count_vectors(text, cv=cv):
    return cv.fit_transform(text).toarray()

In [7]:
# Helper function to get count dictionary using count vectors and count vectorizer
def get_count_dict(count_vectors, cv=cv):
    return dict(zip(cv.get_feature_names(), count_vectors.sum(axis=0)))

In [8]:
# Get word counts and represent as dict
count_vectors = get_count_vectors(all_text)
count_dict = get_count_dict(count_vectors)

In [9]:
# Instantiate simple tfidf vectorizer
tfidf_v = TfidfVectorizer(tokenizer=tokenizer)

In [10]:
def get_tfidf_vectors(text, tfidf_v=tfidf_v):
    return tfidf_v.fit_transform(text).toarray()

In [11]:
# TF-IDF using CountVectorizer and tokenizer
def get_tfidf_dict(tfidf_vectors, tfidf_v=tfidf_v):
    return dict(zip(tfidf_v.get_feature_names(), tfidf_vectors.sum(axis=0)))

In [15]:
# Get tfidf vectors and represent as dict - each word is the key and the tfidf number is the value
tfidf_vectors = get_tfidf_vectors(all_text)
all_text_tfidf_dict = get_tfidf_dict(tfidf_vectors)

In [16]:
# NMF using the tfidf vectors
# Using 10 components - will explore different numbers later in the assignment
nmf_n = 10
nmf = NMF(n_components=nmf_n)
nmf_vectors = nmf.fit_transform(tfidf_vectors)

In [17]:
# LDA using the count_vectors
# Using 10 components - will explore different numbers later in the assignment
lda_n = 10
lda = LatentDirichletAllocation(n_components=lda_n)
lda_vectors = lda.fit_transform(count_vectors)

## Exploratory analysis on vectors
It's important to do some initial exploration of the features you've engineered.  Remember the goal is to get some information out of text, so you want to ensure your features are informative.  In this case, informative would mean it gives some information about sentiment.

Perform the following analysis and any additional checks that might be useful for creating a set of informative features:
- Top words for positive versus negative (Counts and TF-IDF)
- Topic model performance measures (NMF=Reconstruction error, LDA=Evidence Lower BOund (ELBO))
- Average cosine similarity between negative review vecvtors and positive review vectors (for all vectors you've created)

Tip: You can use the is_positive vector to subset your vectors.  You will likely need to have them in dense array format (use the `.toarray()` method.)

In [18]:
def get_most_frequent_words(corpus, cv=cv, num_words=10):
    '''Gets the most frequent words in a corpus, using a count vectorizer on the generated corpus dict'''
    corpus_dict = get_corpus_dict(corpus, cv)
    return sorted(corpus_dict, key=corpus_dict.get, reverse=True)[:num_words]

In [19]:
def get_corpus_dict(corpus, cv=cv):
    '''Creates a dictionary of the words and their counts in the corpus using a count vectorizer'''
    v = cv.fit_transform(corpus).toarray()
    corpus_dict = dict(zip(cv.get_feature_names(), v.sum(axis=0)))
    return corpus_dict

In [20]:
# Top 10 words - counts
# Negative reviews
neg_words_top = get_most_frequent_words(neg)
print(f'The top words by count in negative reviews are: {neg_words_top}')

# Positive reviews
pos_words_top = get_most_frequent_words(pos)
print(f'The top words by count in positive reviews are: {pos_words_top}')

# Note: these are not very informative! As we've seen so far in the course, word counts are not the best 
# metric for determining sentiment, as evidenced by the overlap between these two lists.

The top words by count in negative reviews are: ['movie', 'film', 'like', 'bad', 'good', 'time', 'story', 'people', 'br', 'movies']
The top words by count in positive reviews are: ['film', 'movie', 'like', 'good', 'great', 'story', 'time', 'best', 'love', 'br']


In [21]:
# I tried using the trained nlp english mode, but there wasn't a significant improvement in performance.
# However, theere was a significant decrease in efficiency, so I'm removing it from the tf-idf comptutation below.
def nlp_tokenizer(doc, model=nlp):
    '''Tokenizer based on example from from week_1_intro notebook.
    Filters non-alpha, url-like, and stopwords then lemmatizes each parsed token.
    Uses advanced trained nlp model from spacy by default.'''
    parsed = model(doc)
    # Return list of lowercase parsed tokens that are alphanumeric and not urls 
    return([t.lemma_ for t in parsed if (t.is_alpha) and (not t.like_url) and (not t.is_stop)])

In [22]:
# Impose a min and max df to ensure the words aren't too frequent or infrequent
tfidf_v2 = TfidfVectorizer(tokenizer=tokenizer, min_df=0.01, max_df=0.9)

def get_top_tfidf_words(text, tfidf_v=tfidf_v2, num_words=10):
    '''Gets the top words from tfidf vectors, based on tfidf values'''
    tfidf_vectors = get_tfidf_vectors(text, tfidf_v=tfidf_v2)
    tfidf_dict = dict(zip(tfidf_v.get_feature_names(), tfidf_vectors.sum(axis=0)))
    return sorted(tfidf_dict, key=tfidf_dict.get, reverse=True)[:num_words]

In [23]:
# Top 10 words - TF-IDF
# Positive
pos_top_tfidf_words = get_top_tfidf_words(pos)
print(f'The top words by tfidf in positive reviews are: {pos_top_tfidf_words}')

# Negative
neg_top_tfidf_words = get_top_tfidf_words(neg)
print(f'The top words by tfidf in negative reviews are: {neg_top_tfidf_words}')

# Note: as you can see, these don't appear to be very helpful either! I'm thinking we will need a more 
# nuanced approach to understanding sentiment.

The top words by tfidf in positive reviews are: ['movie', 'film', 'good', 'like', 'great', 'story', 'time', 'best', 'love', 'watch']
The top words by tfidf in negative reviews are: ['movie', 'film', 'like', 'bad', 'good', 'time', 'story', 'br', 'people', 'movies']


In [24]:
# Code from week_2_vectors
def display_components(model, word_features, top_display=5):
    # utility for displaying respresentative words per component for topic models
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        top_words_idx = topic.argsort()[::-1][:top_display]
        top_words = [word_features[i] for i in top_words_idx]
        print(" ".join(top_words))

In [25]:
def init_nmf(text, n_components=10):
    '''Helper function that initializes an nmf'''
    # Set random state for reproducability
    # Init with 'nndsvd' because according to the documentation it is better for sparseness
    # Added alpha value to regularize, so calculation is more stable
    nmf = NMF(n_components=n_components, alpha=0.1, random_state=101, init='nndsvd')
    tfidf_vectors = get_tfidf_vectors(text, tfidf_v=tfidf_v2)
    nmf_vectors = nmf.fit_transform(tfidf_vectors)
    return nmf

In [26]:
# Compare reconstruction error for different number of components
for n_components in [30,50,75,100,150]:
    nmf = init_nmf(all_text, n_components)
    # Top model performance metrics for nmf 
    print(f'The reconstruction error for NMF with {n_components} components is: {nmf.reconstruction_err_}')
    
# Note: this takes a long time to run, so I've saved the results: I tried 3,7,10,15 components, but
# the error rate significantly decreased with 30+ components. I ultimately chose 75 because there was a slight elbow,
# indicating that the gains by further increases were not necessarily worth the extra computation.

The reconstruction error for NMF with 30 components is: 46.08429797250744
The reconstruction error for NMF with 50 components is: 45.027862802972145
The reconstruction error for NMF with 75 components is: 43.891218393080685
The reconstruction error for NMF with 100 components is: 42.85723091916044
The reconstruction error for NMF with 150 components is: 40.95881272754923


In [28]:
nmf_n = 75
nmf = NMF(n_components=nmf_n, random_state=101)
tfidf_vectors_all = get_tfidf_vectors(all_text, tfidf_v=tfidf_v2)
nmf_vectors_all = nmf.fit_transform(tfidf_vectors_all)

# Print out the components and analyze them
display_components(nmf, tfidf_v2.get_feature_names(), 10)

# Note: for 10 components, they appear to be split along genres and/or types of movies/TV.
# This split might be helpful, but it's still unclear exactly how this might be useful. Again, ultimately I landed on
# 75 as the number of components, which is what I've included above. 75 components also appears to split along
# similar lines, but also includes production type (e.g. low-quality, camera styles, etc.). Much more informative.

Topic 0:
film heard directed quality makes making director previous hollywood festival
Topic 1:
movie kind theater horrible main makes amazing nudity sure friend
Topic 2:
wife police husband murder crime woman french law tries car
Topic 3:
like look feel looks trying guys totally let cut makes
Topic 4:
br audience opera okay money line director screenplay actor credit
Topic 5:
series episodes television final carry excellent usual end season release
Topic 6:
plot twists better acting cast audience sub twist holes surprising
Topic 7:
book read novel books adaptation completely version changes totally parts
Topic 8:
bad acting badly gave guys ridiculous low flick usually pretty
Topic 9:
funny comedy laugh jokes comedies humor humour comedic hilarious romantic
Topic 10:
seen absolutely worse having times know read maybe far possible
Topic 11:
horror gore blood scary creepy slasher suspense scared genre flick
Topic 12:
great greatest acting ending brilliant job wonderful casting makes not


In [29]:
# Create new count vectorizer with min and max df
cv2 = CountVectorizer(tokenizer=tokenizer, min_df=0.01, max_df=0.9)

In [30]:
def init_lda(text, n_components):
    '''Helper function to initialize an lDA'''
    # Increase the learning offset slightly to favor later reviews in the learning process
    # I played aroun with learning decay, but didn't find anything better than the default rate of 0.7
    lda = LatentDirichletAllocation(n_components=n_components, learning_offset=15, random_state=101)
    count_vectors = cv2.fit_transform(text).toarray()
    lda_vectors = lda.fit_transform(count_vectors)
    return lda

In [31]:
# Compare lda bounds for different number of components
for n_components in range(1,9):
    lda = init_lda(all_text, n_components)
    # Top model performance metrics for nmf 
    print(f'The bound for LDA with {n_components} components is: {lda.bound_}')
    
# Note: it appears that the bound value increases with the number of components. I tried 30, 50, 100, 150, 300, 
# and higher and the values kept getting worse. So I tried 1-15 and then realized I could use 1-8 to 
# tune hyperparameters and found that 2 was the best option.

The bound for LDA with 1 components is: 893.6286335855411
The bound for LDA with 2 components is: 885.4677455027298
The bound for LDA with 3 components is: 901.1347060471696
The bound for LDA with 4 components is: 911.5578782461239
The bound for LDA with 5 components is: 926.3874138501176
The bound for LDA with 6 components is: 938.8797631098181
The bound for LDA with 7 components is: 953.125259602031
The bound for LDA with 8 components is: 968.8322869166778


In [32]:
# LDA
lda_n = 2
lda_all = LatentDirichletAllocation(n_components=lda_n, random_state=101)
count_vectors_all = cv2.fit_transform(all_text).toarray()
lda_vectors_all = lda_all.fit_transform(count_vectors_all)

# Print out the components and analyze them
display_components(lda_all, cv2.get_feature_names(), 20)

# Note: while 2 components might have the lowest bound, it doesn't give much information when displayed with the
# above function. 
# I also analyzed 10 components and 20 components. The split was more helpful in this case, but still didn't seem
# very useful for sentiment analysis.

Topic 0:
film story like br good films great life time man love character best characters way young world scenes director little
Topic 1:
movie like film good bad time people watch movies think acting know seen plot great watching way better story funny


In [33]:
# Cosine similarity - counts
count_sims = cosine_similarity(count_vectors_all)

# Cosine similarity - tfidf
tfidf_sims = cosine_similarity(tfidf_vectors_all)

# Cosine similarity - nmf
nmf_sims = cosine_similarity(nmf_vectors_all)

# Cosine similarity - lda
lda_sims = cosine_similarity(lda_vectors_all)

sims_data = {'(COUNT)': count_sims, '(TF-IDF)': tfidf_sims, '(NMF)': nmf_sims, '(LDA)': lda_sims}

# compare positive to negative average distance
# code from week_2_vectors
for key, s_matrix in sims_data.items():
    print(f'{key} neg-to-neg:', s_matrix[:first_pos, :first_pos].mean(axis=1).mean(),
          'neg-to-pos:', s_matrix[:first_pos, first_pos:].mean(axis=1).mean(),
          'pos-to-pos:', s_matrix[first_pos:, first_pos:].mean(axis=1).mean())

(COUNT) neg-to-neg: 0.12424753323103478 neg-to-pos: 0.11098543653407493 pos-to-pos: 0.11192883224959577
(TF-IDF) neg-to-neg: 0.056494811837617544 neg-to-pos: 0.048939370583884215 pos-to-pos: 0.05228368790762318
(NMF) neg-to-neg: 0.19259946727305174 neg-to-pos: 0.17267149168558535 pos-to-pos: 0.18170457858479086
(LDA) neg-to-neg: 0.7974850336027157 neg-to-pos: 0.6980764700131532 pos-to-pos: 0.7741226389173657


How do the above results look? Ideally you should see that your features give some information that might help a model discern negative from positive reviews.  That means lower similarity inter-class and different words showing up as most frequent/relevant.  Experiment with your design choices on the steps above.  Your goal should be to get to a set of vectors that have lower inter-class similarity than intra-class similarity (e.g. positive reviews should be more similar to positive reviews than negative reviews)

In [34]:
# Generally speaking for all of the comparison types, intra-class similarity is higher than inter-class, 
# but not by much for pos-to-pos. This is especially true for the count comparison. It seems like negatives are 
# more similar to each other than positives or positive-negatives. This might be explained by the type of review - 
# people are more likely to be emphatic and passionate about their negative reviews than their positive reviews. It
# might be the case that this sentiment became encoded in the analysis. 
#Overall, itappears to be successful collection of calculations. 

# I included my notes above on various approaches I explored to optimize the different vector outputs

## Predicting sentiment
As we did in week 2's notebook, we're now going to use these informative vectors to predict sentiment.  We'll be using `LinearSVC` in this exercise, but feel free to try out other models.

Start by creating a train/test split for the dataset (typically 70%/30%).  We'll use the same split for all feature vectors for comparability. 

Do the following steps for all the feature vectors you developed above:
- Start by creating a train/test split for the dataset (typically 70%/30%).  We'll use the same split for all feature vectors for comparability. 
- Train an SVM model on your feature vectors with the corresponding target values (positive/negative)
- Test the SVM model on the test set and output the accuracy

Tip: Sklearn has a train/test split functionality for generating train/test splits (`sklearn.model_selection.train_test_split`).  Since we want to use the same reviews, make sure you set a random_state (see the docs).

In [35]:
from sklearn.model_selection import train_test_split

In [36]:
y = np.array([0]*len(neg)+[1]*len(pos))

In [37]:
X_train, X_test, y_train, y_test = train_test_split(all_text, y, test_size=0.30, random_state=101)

Depending on how you've designed your vectors, you may find that the topic models perform worse than the count vectors.  You may want to try a couple different configurations.  

One key reason for this may be because if the goal is to use our test observations to simulate our "new observations", we haven't properly done that.  We've fit our vectorizers on the FULL corpus.  If our test observations are "unseen", that means our vectorizers should only be fit on the training corpus.

Try this out: Split the unprocessed reviews, fit the vectorizer, then the model and then transform the test observations and predict.  See how the accuracy changes

Tip: You may want to explore sklearn's `Pipelines`, which is designed for exactly this purpose

In [38]:
def get_vectorizer_accuracy(vectorizer, X_train, X_test, y_train, y_test):
    # Initialize linearsvc model
    model = LinearSVC(random_state=101)
    # Get vectors of training data
    train_vectors = vectorizer.fit_transform(X_train).toarray()
    # Fit the model to that data
    model.fit(train_vectors, y_train)
    # Get the test vectors for the data
    test_vectors = vectorizer.transform(X_test).toarray()
    # Get the test predictions
    test_preds = model.predict(test_vectors)
    return accuracy_score(y_test, test_preds)

In [39]:
svc_count = LinearSVC(random_state=101)
count_accuracy = get_vectorizer_accuracy(cv, X_train, X_test, y_train, y_test)
print(f'The accuracy for count is: {count_accuracy}')

The accuracy for count is: 0.8133333333333334


In [40]:
# Instantiate simple tfidf vectorizer
tfidf_v3 = TfidfVectorizer(tokenizer=tokenizer)
tfidf_accuracy = get_vectorizer_accuracy(tfidf_v3, X_train, X_test, y_train, y_test)
print(f'The accuracy for TF-IDF is: {tfidf_accuracy}')

The accuracy for TF-IDF is: 0.86


In [41]:
tfidf_v4 = TfidfVectorizer(tokenizer=tokenizer)

# Note: this function and the lda one can be combined with get_vectorizer_accuracy, I just ran out of time...

def get_nmf_accuracy(X_train, X_test, y_train, y_test):
    # Initialize svc model
    model = LinearSVC(random_state=101)
    # Initialize nmf
    nmf = NMF(n_components=7, random_state=101)
    # Get tfidf vectors for train data
    tfidf_train_vectors = tfidf_v4.fit_transform(X_train).toarray()
    # Use nmf to fit transform the tfidf vectors for train data
    train_vectors = nmf.fit_transform(tfidf_train_vectors)
    # Fit the svc model
    model.fit(train_vectors, y_train)
    # Use nmf to fit transform the tfidf vectors for test data
    tfidf_test_vectors = tfidf_v4.fit_transform(X_test).toarray()
    # Get the test vectors
    test_vectors = nmf.fit_transform(tfidf_test_vectors)
    # Make test predictions
    test_preds = model.predict(test_vectors)
    return accuracy_score(y_test, test_preds)

In [42]:
nmf_accuracy = get_nmf_accuracy(X_train, X_test, y_train, y_test)
print(f'The accuracy for NMF is: {nmf_accuracy}')

The accuracy for NMF is: 0.5293333333333333


In [43]:
cv4 = CountVectorizer(tokenizer=tokenizer)

def get_lda_accuracy(X_train, X_test, y_train, y_test):
    # Initialize svc model
    model = LinearSVC(random_state=101)
    # Initialize lda
    lda = LatentDirichletAllocation(n_components=98, learning_offset=15, random_state=101)
    # Get count vectors for train data
    count_train_vectors = cv4.fit_transform(X_train).toarray()
    # Use nmf to fit transform the count vectors for train data
    train_vectors = lda.fit_transform(count_train_vectors)
    # Fit the svc model
    model.fit(train_vectors, y_train)
    # Use nmf to fit transform the tfidf vectors for test data
    count_test_vectors = cv4.fit_transform(X_test).toarray()
    # Get the test vectors
    test_vectors = lda.fit_transform(count_test_vectors)
    # Make test predictions
    test_preds = model.predict(test_vectors)
    return accuracy_score(y_test, test_preds)

In [44]:
lda_accuracy = get_lda_accuracy(X_train, X_test, y_train, y_test)
print(f'The accuracy for LDA is: {lda_accuracy}')

The accuracy for LDA is: 0.516


In [45]:
# I found that updating the number of components for nmf and lda had significant impact on the accuracy.
# But it was the inverse of what I found in previous sections: for NMF, accuracy decreased with an increase of 
# components; for LDA accuracy increased with the increase of components. For LDA, I tested different learning 
# rates and decay and settled on the above configurations.

In [46]:
# It appears TF-IDF is the winner!