# Assignment 1: Tokenization and Word counts for sentiment analysis
In this assignment, you will be applying the techniques learned in week 1 of the course to perform and analyze sentiment on a dataset of movie reviews from IMDB.

This dataset comes from [Mass et. al. (2011)](https://www.aclweb.org/anthology/P11-1015.pdf) and the full version is available [here](http://ai.stanford.edu/~amaas/data/sentiment/).

In [1]:
# setup
import sys
import subprocess
import pkg_resources
from collections import Counter
import re
from numpy import log, mean

required = {'spacy', 'scikit-learn', 'pandas', 'transformers'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)

import spacy
import pandas as pd
import pickle
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
## NOTE: Below is just for reference for how I generated the data
## if you run this, it will not work!

from glob import glob
import numpy as np
pct_sample = 0.1
all_text = {}
for p in ['neg', 'pos']:
    all_text[p] = []
    for f in glob('/Users/batorsky/Downloads/aclImdb/test/%s/*.txt' % p):
        if np.random.rand()<=pct_sample:
            all_text[p].append(open(f, encoding='utf-8').read())
with open('../data/assignment_1_reviews.pkl', 'wb') as f:
    pickle.dump(all_text, f)

FileNotFoundError: [Errno 2] No such file or directory: '../data/assignment_1_reviews.pkl'

## Read in data

I've already processed the full dataset for you and saved it as a data file: `assignment_1_reviews.pkl`.  You don't need to generate it.

In [4]:
# you will need to change this to where ever the file is stored
# on colab, you can likely just put this as 'assignment_1_reviews.pkl'
data_location = './data/assignment_1_reviews.pkl'
with open(data_location, 'rb') as f:
    all_text = pickle.load(f)
# corpora size
print([(k, len(all_text[k])) for k in all_text])
# for simplicity, let's split these into separate sets
neg, pos = all_text.values()

[('neg', 1233), ('pos', 1266)]


## Tokenization
Use what you've developed in the week 1 notebook to tokenize each of the corpora.

In [5]:
from spacy.lang.en import English
en = English()

def simple_tokenizer(doc, model=en):
    # a simple tokenizer for individual documents (different from above)
    tokenized_docs = []
    parsed = model(doc)
    return([t.lower_ for t in parsed if (t.is_alpha)&(not t.like_url)])

In [6]:
token_neg = [simple_tokenizer(x) for x in neg]
token_pos = [simple_tokenizer(x) for x in pos]

## Word counts
Create a count of the number of words in each review.  Use scikit-learn's CountVectorizer.  Refer to the documentation as it has a few parameters you might want to think about.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
cv = CountVectorizer(tokenizer=simple_tokenizer)
# should probably fit on the combined
cv.fit(neg+pos)
count_neg = cv.transform(neg).toarray()
count_pos = cv.transform(pos).toarray()

In [9]:
# can use pandas DF here
neg_df = pd.DataFrame(count_neg, columns=cv.get_feature_names())
split_pos = len(neg_df)
pos_df = pd.DataFrame(count_pos, columns=cv.get_feature_names())
# combine for more ease
all_df = pos_df.append(neg_df)

## Most frequent words
What are the top 10 most frequent words in the positive reviews? The negative reviews?

In [10]:
def get_top(data, n=10, split_pos=len(neg_df)):
    top_df = pd.concat([data.iloc[:split_pos].sum().T.nlargest(n),
               data.iloc[split_pos:].sum().T.nlargest(n)],
              axis=1)
    top_df.columns = [True, False]
    return(top_df)
get_top(all_df)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  after removing the cwd from sys.path.


Unnamed: 0,True,False
a,7897.0,7784.0
and,8677.0,7206.0
i,3834.0,4357.0
in,4971.0,4325.0
is,5643.0,5102.0
it,4445.0,4462.0
of,7758.0,6586.0
that,3425.0,
the,16801.0,15746.0
this,,3918.0


It seems like there's a lot of pretty irrelevant words in the top here.  It's hard to really say anything about this.  Can you think of a way to get to more informative terms (i.e. ones that might give you some insight as to what words are positive versus negative?)

Hint: Think about which tokens might be less informative.  Is there a way we learned to remove those?

In [11]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

In [12]:
minus_stop = all_df.iloc[:, ~all_df.columns.isin(ENGLISH_STOP_WORDS)]

In [13]:
get_top(all_df.iloc[:, ~all_df.columns.isin(ENGLISH_STOP_WORDS)], n=90).reset_index().values

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  after removing the cwd from sys.path.


array([['acting', 224.0, 381.0],
       ['action', 152.0, 157.0],
       ['actors', 213.0, 272.0],
       ['actually', 160.0, 256.0],
       ['awful', nan, 147.0],
       ['bad', 191.0, 671.0],
       ['beautiful', 152.0, nan],
       ['best', 435.0, 192.0],
       ['better', 217.0, 302.0],
       ['big', 175.0, 166.0],
       ['bit', 163.0, nan],
       ['boring', nan, 151.0],
       ['br', 395.0, 421.0],
       ['ca', 166.0, 194.0],
       ['cast', 213.0, 179.0],
       ['character', 366.0, 372.0],
       ['characters', 366.0, 369.0],
       ['come', nan, 155.0],
       ['comedy', 158.0, 158.0],
       ['course', 150.0, nan],
       ['did', 413.0, 647.0],
       ['different', 142.0, nan],
       ['director', 239.0, 213.0],
       ['does', 490.0, 539.0],
       ['dvd', 182.0, nan],
       ['end', 243.0, 273.0],
       ['especially', 162.0, nan],
       ['excellent', 180.0, nan],
       ['fact', 160.0, 184.0],
       ['family', 187.0, nan],
       ['far', nan, 173.0],
       ['feel', 1

Check how often the top words from negative appear in the positive reviews and vice versa.  Do these seem like good candidates for determining whether a review is positive or negative? If not, maybe expand to the top 10, or more.  The idea here is to get a list of terms that are pretty distinct between the two sets.

One possible way to test is to use [log-likelihood ratio](https://wordhoard.northwestern.edu/userman/analysis-comparewords.html) as we discussed in class. In class we looked at texts with/without mentions of "hot dog".  What is our comparison text in this case?

In [14]:
def log_likelihood(analysis, reference, word):
    # count of word in source
    a = analysis[word].sum()
    # count of word in reference
    b = reference[word].sum()
    # count of all words in source
    c = analysis.sum().sum()
    # count of all words in reference
    d = reference.sum().sum()
    print('counts analysis:', a)
    print('counts reference:', b)
    e1 = c*(a+b)/(c+d)
    e2 = d*(a+b)/(c+d)
    g = 2*((a*log(a/e1)) + (b*log(b/e2)))
    print('G2: ', g)

In [15]:
# the above gives us some candidates
# function to do likelihood ratio test
words_to_try = ['good', 'character', 'story', 'acting', 'bad', 'great']
for w in words_to_try:
    print(w)
    log_likelihood(neg_df, pos_df, w)

good
counts analysis: 632
counts reference: 811
G2:  11.483395558996321
character
counts analysis: 364
counts reference: 374
G2:  0.34130117788180137
story
counts analysis: 485
counts reference: 625
G2:  9.25192431265053
acting
counts analysis: 369
counts reference: 236
G2:  39.54794928205044
bad
counts analysis: 668
counts reference: 194
G2:  309.98478292831265
great
counts analysis: 230
counts reference: 654
G2:  183.3398162287254


## Dictionary-based sentiment analysis 
Construct a list of the keywords you've found are good determinants if a review is positive or negative.  Use this list to "score" a review based on the number of times that word appears in the review.

(Optional) A quick and fancy way of doing this is to use CountVectorizer's vocabulary parameter.  Think how you might be able to do that.

In [16]:
pos_vocab = ['good', 'great', 'best', 'love', 'story']
neg_vocab = ['bad', 'worst', 'acting', 'poor']
sentiment_cv = CountVectorizer(tokenizer=simple_tokenizer, vocabulary=pos_vocab+neg_vocab)
sentiment_cv.fit(neg+pos)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function simple_tokenizer at 0x15e78d560>,
                vocabulary=['good', 'great', 'best', 'love', 'story', 'bad',
                            'worst', 'acting', 'poor'])

How did you do? How often do the negative reviews have a higher negative score than a positive score?

In [17]:
# average score
neg_sentiment_df = pd.DataFrame(sentiment_cv.transform(neg).toarray(),
                      columns=sentiment_cv.get_feature_names())
pos_sentiment_df = pd.DataFrame(sentiment_cv.transform(pos).toarray(),
                      columns=sentiment_cv.get_feature_names())
print('% of negative reviews with higher neg score:', 
      mean(neg_sentiment_df[neg_vocab].sum(axis=1)>neg_sentiment_df[pos_vocab].sum(axis=1)))
print('% of positive reviews with higher pos score:', 
      mean(pos_sentiment_df[pos_vocab].sum(axis=1)>pos_sentiment_df[neg_vocab].sum(axis=1)))

% of negative reviews with higher neg score: 0.28629359286293593
% of positive reviews with higher pos score: 0.7480252764612955


## Model-based sentiment analysis
Above we did some tinkering with our scoring and found it works to some extent, but it's likely not going to work the same on another dataset.  That is, it's not particularly generalizable.  However, modern sentiment analysis has moved away from dictionary-based scoring towards having sentiment be a "classification" problem.  

For this last section, take a look at the transformers [Pipelines](https://github.com/huggingface/transformers#quick-tour-of-pipelines) functionality.  You'll see that with a few lines of code you can bring in an advanced sentiment analysis model.  Run this against the positive/negative corpus and see how it works compared to your work above.

In [18]:
from transformers import pipeline
nlp = pipeline('sentiment-analysis')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




In [None]:
# this might take a bit, these models aren't light-weight
neg_parsed = [nlp(d) for d in neg]
pos_parsed = [nlp(d) for d in pos]

In [None]:
# using % labelled negative vs positive
print('% neg reviews labelled negative:', 
      mean([doc[0]['label']=='NEGATIVE' for doc in neg_parsed]))
print('% pos reviews labelled positive:', 
      mean([doc[0]['label']=='POSITIVE' for doc in pos_parsed]))