# Assignment 2: Predicting sentiment
In this assignment, you will be using the same sentiment analysis dataset as for Assignment 1, but you'll be looking to actually predict sentiment based on a variety of text-derived features.

This dataset comes from [Mass et. al. (2011)](https://www.aclweb.org/anthology/P11-1015.pdf) and the full version is available [here](http://ai.stanford.edu/~amaas/data/sentiment/).

In [1]:
# setup
import sys
import subprocess
import pkg_resources
from collections import Counter
import re
from numpy import log, mean

required = {'spacy', 'scikit-learn', 'pandas', 'transformers==2.4.1'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)

import spacy
import pandas as pd
import numpy as np
import pickle
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC

## Read in data
I've saved a subset of the data in the data directory on the repository.  It is available as a pickled dictionary.


In [31]:
# you will need to change this to where ever the file is stored
data_location = '../data/assignment_1_reviews.pkl'
with open(data_location, 'rb') as f:
    all_text = pickle.load(f)
# corpora size
print([(k, len(all_text[k])) for k in all_text])
neg, pos = all_text.values()
# for this assignment, let's combine all our data, but maintain the labels
all_text = neg+pos
# array makes for easier indexing
is_positive = np.array([False]*len(neg)+[True]*len(pos))
# check that they're equivalent
print(np.bincount(is_positive))

[('neg', 1233), ('pos', 1266)]
[1233 1266]


## Creating document feature vectors
In this section, process all of your text data in order to create the following document-level feature vectors:

- Word Counts (using `CountVectorizer`)
- TF-IDF vectors (using `TfidfVectorizer`)
- Non-Negative Matrix Factorization-based representations (using `NMF`)
- Latent Dirichlet Allocation-based representations (using `LatentDirichletAllocation`)

All of the design elements are up to you (e.g. tokenization, vocabulary limits, number of components).  It may make sense to try out a few different designs.  In the next section we'll do some evaluation of our different strategies.

## Exploratory analysis on vectors
It's important to do some initial exploration of the features you've engineered.  Remember the goal is to get some information out of text, so you want to ensure your features are informative.  In this case, informative would mean it gives some information about sentiment.

Perform the following analysis and any additional checks that might be useful for creating a set of informative features:
- Top words for positive versus negative (Counts and TF-IDF)
- Topic model performance measures (NMF=Reconstruction error, LDA=Evidence Lower BOund (ELBO))
- Average cosine similarity between negative review vecvtors and positive review vectors (for all vectors you've created)

Tip: You can use the is_positive vector to subset your vectors.  You will likely need to have them in dense array format (use the `.toarray()` method.)

How do the above results look? Ideally you should see that your features give some information that might help a model discern negative from positive reviews.  That means lower similarity inter-class and different words showing up as most frequent/relevant.  Experiment with your design choices on the steps above.  Your goal should be to get to a set of vectors that have lower inter-class similarity than intra-class similarity (e.g. positive reviews should be more similar to positive reviews than negative reviews)

## Predicting sentiment
As we did in week 2's notebook, we're now going to use these informative vectors to predict sentiment.  We'll be using `LinearSVC` in this exercise, but feel free to try out other models.

Start by creating a train/test split for the dataset (typically 70%/30%).  We'll use the same split for all feature vectors for comparability. 

Do the following steps for all the feature vectors you developed above:
- Start by creating a train/test split for the dataset (typically 70%/30%).  We'll use the same split for all feature vectors for comparability. 
- Train an SVM model on your feature vectors with the corresponding target values (positive/negative)
- Test the SVM model on the test set and output the accuracy

Tip: Sklearn has a train/test split functionality for generating train/test splits (`sklearn.model_selection.train_test_split`).  Since we want to use the same reviews, make sure you set a random_state (see the docs).

Depending on how you've designed your vectors, you may find that the topic models perform worse than the count vectors.  You may want to try a couple different configurations.  

One key reason for this may be because if the goal is to use our test observations to simulate our "new observations", we haven't properly done that.  We've fit our vectorizers on the FULL corpus.  If our test observations are "unseen", that means our vectorizers should only be fit on the training corpus.

Try this out: Split the unprocessed reviews, fit the vectorizer, then the model and then transform the test observations and predict.  See how the accuracy changes

Tip: You may want to explore sklearn's `Pipelines`, which is designed for exactly this purpose