# Assignment 2: Predicting sentiment
In this assignment, you will be using the same sentiment analysis dataset as for Assignment 1, but you'll be looking to actually predict sentiment based on a variety of text-derived features.

This dataset comes from [Mass et. al. (2011)](https://www.aclweb.org/anthology/P11-1015.pdf) and the full version is available [here](http://ai.stanford.edu/~amaas/data/sentiment/).

In [1]:
# setup
import sys
import subprocess
import pkg_resources
from collections import Counter
import re
from numpy import log, mean

required = {'spacy', 'scikit-learn', 'pandas', 'transformers==2.4.1'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)

import spacy
import pandas as pd
import numpy as np
import pickle
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC

## Read in data
I've saved a subset of the data in the data directory on the repository.  It is available as a pickled dictionary.


In [31]:
# you will need to change this to where ever the file is stored
data_location = '../data/assignment_1_reviews.pkl'
with open(data_location, 'rb') as f:
    all_text = pickle.load(f)
# corpora size
print([(k, len(all_text[k])) for k in all_text])
neg, pos = all_text.values()
# for this assignment, let's combine all our data, but maintain the labels
all_text = neg+pos
# array makes for easier indexing
is_positive = np.array([False]*len(neg)+[True]*len(pos))
# check that they're equivalent
print(np.bincount(is_positive))

[('neg', 1233), ('pos', 1266)]
[1233 1266]


## Creating document feature vectors
In this section, process all of your text data in order to create the following document-level feature vectors:

- Word Counts (using `CountVectorizer`)
- TF-IDF vectors (using `TfidfVectorizer`)
- Non-Negative Matrix Factorization-based representations (using `NMF`)
- Latent Dirichlet Allocation-based representations (using `LatentDirichletAllocation`)

All of the design elements are up to you (e.g. tokenization, vocabulary limits, number of components).  It may make sense to try out a few different designs.  In the next section we'll do some evaluation of our different strategies.

In [3]:
from spacy.lang.en import English
en = English()

def simple_tokenizer(doc, model=en):
    # a simple tokenizer for individual documents (different from above)
    tokenized_docs = []
    parsed = model(doc)
    return([t.lower_ for t in parsed if (t.is_alpha)&(not t.like_url)])

def simple_vectorizer(data, vec_model):
    vecs = vec_model.fit_transform(data)
    return(vecs)

In [104]:
# initialize vectorizers
cv = CountVectorizer(tokenizer=simple_tokenizer)
tfidf = TfidfVectorizer(tokenizer=simple_tokenizer)
nmf = NMF(n_components=10)
lda = LatentDirichletAllocation(n_components=10)

In [105]:
# tfidf for nmf
tfidf_counts = tfidf.fit_transform(all_text).toarray()
nmf_vecs = nmf.fit_transform(tfidf_counts)
# count for lda
counts = cv.fit_transform(all_text).toarray()
lda_vecs = lda.fit_transform(counts)

## Exploratory analysis on vectors
It's important to do some initial exploration of the features you've engineered.  Remember the goal is to get some information out of text, so you want to ensure your features are informative.  In this case, informative would mean it gives some information about sentiment.

Perform the following analysis and any additional checks that might be useful for creating a set of informative features:
- Top words for positive versus negative (Counts and TF-IDF)
- Topic model performance measures (NMF=Reconstruction error, LDA=Evidence Lower BOund (ELBO))
- Average cosine similarity between negative review vecvtors and positive review vectors (for all vectors you've created)

Tip: You can use the is_positive vector to subset your vectors.  You will likely need to have them in dense array format (use the `.toarray()` method.)

In [106]:
# get top x words
top_words = 10
# for pos/neg set
for vectorizer, vecs  in [(cv, counts), (tfidf, tfidf_counts)]:
    for s in [is_positive, ~is_positive]:    
        # sum counts
        s_sum = vecs[s].sum(axis=0)
        # sort arguments
        s_sorted = np.argsort(s_sum)
        # print top words
        print([vectorizer.get_feature_names()[x] for x in s_sorted[-top_words:]])

['that', 'i', 'it', 'in', 'is', 'to', 'of', 'a', 'and', 'the']
['this', 'in', 'i', 'it', 'is', 'of', 'to', 'and', 'a', 'the']
['this', 'in', 'it', 'i', 'is', 'to', 'of', 'a', 'and', 'the']
['in', 'this', 'it', 'is', 'i', 'of', 'to', 'and', 'a', 'the']


In [107]:
# topic model performance
print('Reconstruction err:', nmf.reconstruction_err_)
print('ELBO:', lda.bound_)

Reconstruction err: 45.64506849016089
ELBO: 1169.3050783512356


In [108]:
# average cosine similarity
count_sims = cosine_similarity(counts)
tfidf_sims = cosine_similarity(tfidf_counts)
nmf_sims = cosine_similarity(nmf_vecs)
lda_sims = cosine_similarity(lda_vecs)
# compare positive to negative average distance
for s_matrix in [count_sims, tfidf_sims, nmf_sims, lda_sims]:
    print('neg-to-neg:', s_matrix[~is_positive][:, ~is_positive].mean(axis=1).mean(),
          'neg-to-pos:', s_matrix[~is_positive][:, is_positive].mean(axis=1).mean(),
          'pos-to-pos:', s_matrix[is_positive][:, is_positive].mean(axis=1).mean())

neg-to-neg: 0.5334077646265442 neg-to-pos: 0.5314326255417757 pos-to-pos: 0.5369682327485035
neg-to-neg: 0.1297415905734085 neg-to-pos: 0.1236987648556773 pos-to-pos: 0.12412758408737891
neg-to-neg: 0.5553866886819624 neg-to-pos: 0.5378613920269675 pos-to-pos: 0.5474888074545456
neg-to-neg: 0.5436592806791214 neg-to-pos: 0.4692693171162177 pos-to-pos: 0.4215521175388209


How do the above results look? Ideally you should see that your features give some information that might help a model discern negative from positive reviews.  That means lower similarity inter-class and different words showing up as most frequent/relevant.  Experiment with your design choices on the steps above.  Your goal should be to get to a set of vectors that have lower inter-class similarity than intra-class similarity (e.g. positive reviews should be more similar to positive reviews than negative reviews)

## Predicting sentiment
As we did in week 2's notebook, we're now going to use these informative vectors to predict sentiment.  We'll be using `LinearSVC` in this exercise, but feel free to try out other models.

Start by creating a train/test split for the dataset (typically 70%/30%).  We'll use the same split for all feature vectors for comparability. 

Do the following steps for all the feature vectors you developed above:
- Start by creating a train/test split for the dataset (typically 70%/30%).  We'll use the same split for all feature vectors for comparability. 
- Train an SVM model on your feature vectors with the corresponding target values (positive/negative)
- Test the SVM model on the test set and output the accuracy

Tip: Sklearn has a train/test split functionality for generating train/test splits (`sklearn.model_selection.train_test_split`).  Since we want to use the same reviews, make sure you set a random_state (see the docs).

In [109]:
test_size = 0.3
test_idxs = np.random.random(size=len(is_positive))
test_idxs = test_idxs<=test_size
def split_vecs(vecs, target=is_positive, test_idxs=test_idxs):
    X_test = vecs[test_idxs, :]
    X_train = vecs[~test_idxs, :]
    y_test = target[test_idxs]
    y_train = target[~test_idxs]
    return(X_test, X_train, y_train, y_test)
X_test, X_train, y_train, y_test = split_vecs(tfidf_counts)

In [110]:
from sklearn.model_selection import train_test_split
random_state = 42
test_size = 0.3
for v in [counts, tfidf_counts, nmf_vecs, lda_vecs]:
    X_train, X_test, y_train, y_test = train_test_split(v, is_positive, 
                                                        test_size=test_size, 
                                                        random_state=random_state)
    svc = LinearSVC()
    svc.fit(X_train, y_train)
    print(accuracy_score(y_test, svc.predict(X_test)))

0.8066666666666666
0.8293333333333334
0.6493333333333333
0.5853333333333334


In [111]:
from sklearn.model_selection import train_test_split
random_state = 42
test_size = 0.3
for v in [np.concatenate([counts, lda_vecs], axis=1),
          np.concatenate([tfidf_counts, nmf_vecs], axis=1)]:
    X_train, X_test, y_train, y_test = train_test_split(v, is_positive, 
                                                        test_size=test_size, random_state=42)
    svc = LinearSVC()
    svc.fit(X_train, y_train)
    print(accuracy_score(y_test, svc.predict(X_test)))

0.812
0.8253333333333334


Depending on how you've designed your vectors, you may find that the topic models perform worse than the count vectors.  You may want to try a couple different configurations.  

One key reason for this may be because if the goal is to use our test observations to simulate our "new observations", we haven't properly done that.  We've fit our vectorizers on the FULL corpus.  If our test observations are "unseen", that means our vectorizers should only be fit on the training corpus.

Try this out: Split the unprocessed reviews, fit the vectorizer, then the model and then transform the test observations and predict.  See how the accuracy changes

Tip: You may want to explore sklearn's `Pipelines`, which is designed for exactly this purpose

In [119]:
# split raw data
test_size = 0.3
test_idxs = np.random.random(len(all_text))<=test_size
test_text = np.array(all_text)[test_idxs]
test_is_positive = np.array(is_positive)[test_idxs]
train_text = np.array(all_text)[~test_idxs]
train_is_positive = np.array(is_positive)[~test_idxs]

In [158]:
from sklearn.pipeline import Pipeline
# tfidf for nmf
tfidf = TfidfVectorizer(tokenizer=simple_tokenizer)
nmf = NMF(n_components=10)
p = Pipeline([('tfidf', tfidf),
         ('nmf', nmf),
         ('svc', svc)])
p.fit(train_text, train_is_positive)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='...
                 NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0,
                     max_iter=200, n_components=10, random_state=None,
                     shuffle=False, solver='cd', tol=0.0001, verbose=0)),
                ('svc',
                 LinearSVC(C=1.0, class_weight=None, 

In [159]:
# performance train
print('Train accuracy:', p.score(train_text, train_is_positive))
# performance test
print('Test accuracy:', p.score(test_text, test_is_positive))

Train accuracy: 0.6398876404494382
Test accuracy: 0.6203059805285118
