# Assignment 1: Tokenization and Word counts for sentiment analysis
In this assignment, you will be applying the techniques learned in week 1 of the course to perform and analyze sentiment on a dataset of movie reviews from IMDB.

This dataset comes from [Mass et. al. (2011)](https://www.aclweb.org/anthology/P11-1015.pdf) and the full version is available [here](http://ai.stanford.edu/~amaas/data/sentiment/).

In [1]:
# setup
import sys
import subprocess
import pkg_resources
from collections import Counter
import re
from numpy import log, mean

required = {'spacy', 'scikit-learn', 'pandas', 'transformers==2.4.1'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)

import spacy
import pandas as pd
import pickle
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
from spacy.lang.en import English
en = English()

## Read in data

I've already processed the full dataset for you and saved it as a data file: `assignment_1_reviews.pkl`.  You don't need to generate it.

In [4]:
# you will need to change this to where ever the file is stored
# on colab, you can likely just put this as 'assignment_1_reviews.pkl'
data_location = './data/assignment_1_reviews.pkl'
with open(data_location, 'rb') as f:
    all_text = pickle.load(f)
# corpora size
print([(k, len(all_text[k])) for k in all_text])
# for simplicity, let's split these into separate sets
neg, pos = all_text.values()

[('neg', 1233), ('pos', 1266)]


## Tokenization
Use what you've developed in the week 1 notebook to tokenize each of the corpora.

In [5]:
# Simple tokenizer based on example from from week_1_intro notebook
def simple_tokenizer(doc, model=en):
    parsed = model(doc)
    # Return list of lowercase parsed tokens that are alphanumeric and not urls 
    return([t.lower_ for t in parsed if (t.is_alpha) and (not t.like_url)])

In [6]:
# Tokenize the negative and positive reviews
neg_tokenized = [simple_tokenizer(doc) for doc in neg]
pos_tokenized = [simple_tokenizer(doc) for doc in pos]

## Word counts
Create a count of the number of words in each review.  Use scikit-learn's CountVectorizer.  Refer to the documentation as it has a few parameters you might want to think about.

In [7]:
# Create a simple count vectorizer with a simple tokenizer
simple_vectorizer = CountVectorizer(tokenizer=simple_tokenizer)

In [8]:
# Create fit-transformed vectors of the negative reviews
neg_vectors = simple_vectorizer.fit_transform(neg).toarray()
neg_dict = dict(zip(simple_vectorizer.get_feature_names(), neg_vectors.sum(axis=0)))
print(neg_dict)



In [9]:
# Create fit-transformed vectors of the negative reviews
pos_vectors = simple_vectorizer.fit_transform(pos).toarray()
pos_dict = dict(zip(simple_vectorizer.get_feature_names(), pos_vectors.sum(axis=0)))
print(pos_dict)



## Most frequent words
What are the top 10 most frequent words in the positive reviews? The negative reviews?

In [10]:
def get_corpus_dict(corpus, cv=simple_vectorizer):
    '''Creates a dictionary of the words and their counts in the corpus using a count vectorizer'''
    v = cv.fit_transform(corpus).toarray()
    corpus_dict = dict(zip(cv.get_feature_names(), v.sum(axis=0)))
    return corpus_dict

In [11]:
def get_most_frequent_words(corpus, cv=simple_vectorizer, num_words=10):
    '''Gets the most frequent words in a corpus, using a count vectorizer on the generated corpus dict'''
    corpus_dict = get_corpus_dict(corpus, cv)
    return sorted(corpus_dict, key=corpus_dict.get, reverse=True)[:num_words]

In [12]:
# Top 10 words in negative reviews
neg_words_top = get_most_frequent_words(neg)
print(neg_words_top)

['the', 'a', 'and', 'to', 'of', 'is', 'it', 'i', 'in', 'this']


In [13]:
# Top 10 words in positive reviews
pos_words_top = get_most_frequent_words(pos)
print(pos_words_top)

['the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i', 'that']


It seems like there's a lot of pretty irrelevant words in the top here.  It's hard to really say anything about this.  Can you think of a way to get to more informative terms (i.e. ones that might give you some insight as to what words are positive versus negative?)

Hint: Think about which tokens might be less informative.  Is there a way we learned to remove those?

In [14]:
# Tried using a trained model to improve lemmatization, among other things
# The performance trade-off was not worth the modest gains
# nlp = spacy.load("en_core_web_sm")

In [15]:
def advanced_tokenizer(doc, model=en):
    '''Advanced tokenizer based on example from from week_1_intro notebook.
    Filters non-alpha, url-like, and stopwords then lemmatizes each parsed token.'''
    parsed = model(doc)
    # Return list of lowercase parsed tokens that are alphanumeric and not urls 
    return([t.lemma_ for t in parsed if (t.is_alpha) and (not t.like_url) and (not t.is_stop)])

In [16]:
# Reinitialize count vectorizer with advanced parameters
# Added stop words to tokenizer rather than vectorizer so that stop words list is consistent with model used
advanced_vectorizer = CountVectorizer(tokenizer=advanced_tokenizer, min_df=0.1, max_df=0.9)

In [17]:
# Increased the top words to 30 to get a better sense of the options
neg_words_top_advanced = get_most_frequent_words(neg, cv=advanced_vectorizer, num_words=50)
print(neg_words_top_advanced)

['movie', 'film', 'like', 'bad', 'good', 'time', 'story', 'people', 'br', 'movies', 'acting', 'watch', 'plot', 'characters', 'character', 'way', 'know', 'think', 'films', 'seen', 'better', 'scenes', 'watching', 'scene', 'thing', 'actors', 'end', 'little', 'actually', 'man', 'life', 'great', 'going', 'funny', 'worst', 'love', 'director', 'look', 'want', 'real', 'thought', 'script', 'best', 'work', 'find', 'minutes', 'long', 'pretty', 'things', 'guy']


In [18]:
pos_words_top_advanced = get_most_frequent_words(pos, cv=advanced_vectorizer, num_words=50)
print(pos_words_top_advanced)

['film', 'movie', 'like', 'good', 'great', 'story', 'time', 'best', 'love', 'br', 'people', 'way', 'think', 'life', 'films', 'character', 'characters', 'watch', 'movies', 'seen', 'know', 'little', 'man', 'scenes', 'scene', 'real', 'years', 'end', 'makes', 'plot', 'music', 'director', 'acting', 'young', 'world', 'lot', 'better', 'actors', 'cast', 'find', 'new', 'work', 'funny', 'look', 'old', 'bad', 'thought', 'family', 'right', 'played']


Check how often the top words from negative appear in the positive reviews and vice versa.  Do these seem like good candidates for determining whether a review is positive or negative? If not, maybe expand to the top 10, or more.  The idea here is to get a list of terms that are pretty distinct between the two sets.

One possible way to test is to use [log-likelihood ratio](https://wordhoard.northwestern.edu/userman/analysis-comparewords.html) as we discussed in class. In class we looked at texts with/without mentions of "hot dog".  What is our comparison text in this case?

In [19]:
from numpy import log, mean
# Code from course notebook, adjusted to deal with words rather than dataframes and to return g rather than print it
def log_likelihood(analysis, reference, word):
    # count of word in source
    a = analysis.get(word, 0)
    # count of word in reference
    b = reference.get(word, 0.00000000000000001)
    # count of all words in source
    c = len(analysis)
    # count of all words in reference
    d = len(reference)
    e1 = c*(a+b)/(c+d)
    e2 = d*(a+b)/(c+d)
    g = 2*((a*log(a/e1)) + (b*log(b/e2)))
    return g

In [20]:
# Get corpus count dicts for positive and negative reviews
pos_corpus_dict = get_corpus_dict(pos, advanced_vectorizer)
neg_corpus_dict = get_corpus_dict(neg, advanced_vectorizer)

In [21]:
def get_log_likelihood_list(analysis_dict, reference_dict, words):
    '''Get sorted list of log likelihood values for each word in the dictionary of word counts in the analysis corpus.'''
    log_dict = {}
    for word in words:
        g = log_likelihood(analysis_dict, reference_dict, word)
        log_dict[word] = g
    return sorted(log_dict, key=log_dict.get, reverse=True)

In [22]:
pos_log_list = get_log_likelihood_list(pos_corpus_dict, neg_corpus_dict, pos_words_top_advanced)
print(pos_log_list)

['music', 'young', 'world', 'family', 'bad', 'played', 'great', 'best', 'love', 'life', 'film', 'movie', 'acting', 'good', 'plot', 'makes', 'story', 'years', 'like', 'better', 'new', 'real', 'films', 'man', 'cast', 'lot', 'think', 'actors', 'know', 'people', 'little', 'right', 'find', 'time', 'director', 'way', 'old', 'movies', 'work', 'watch', 'funny', 'seen', 'character', 'scene', 'end', 'characters', 'look', 'thought', 'br', 'scenes']


In [23]:
neg_log_list = get_log_likelihood_list(neg_corpus_dict, pos_corpus_dict, neg_words_top_advanced)
print(neg_log_list)

['worst', 'want', 'bad', 'script', 'minutes', 'pretty', 'guy', 'great', 'best', 'love', 'life', 'film', 'movie', 'thing', 'acting', 'good', 'plot', 'story', 'actually', 'watching', 'like', 'better', 'real', 'films', 'man', 'going', 'think', 'actors', 'know', 'people', 'little', 'find', 'time', 'director', 'way', 'movies', 'work', 'watch', 'funny', 'seen', 'character', 'scene', 'end', 'things', 'characters', 'look', 'long', 'thought', 'br', 'scenes']


## Dictionary-based sentiment analysis 
Construct a list of the keywords you've found are good determinants if a review is positive or negative.  Use this list to "score" a review based on the number of times that word appears in the review.

(Optional) A quick and fancy way of doing this is to use CountVectorizer's vocabulary parameter.  Think how you might be able to do that.

In [24]:
def get_keyword_scores(reviews, source_keywords, foreign_keywords):
    results = {'source_score': 0, 'foreign_score': 0}
    for review in reviews:
        source_score_dict = {}
        foreign_score_dict = {}
        for source_word in source_keywords:
            source_score_dict[source_word] = review.count(source_word)
        for foreign_word in foreign_keywords:
            foreign_score_dict[foreign_word] = review.count(foreign_word)
        source_score = sum(source_score_dict.values())
        foreign_score = sum(foreign_score_dict.values())
        if source_score > foreign_score:
            results['source_score'] += 1
        else:
            results['foreign_score'] += 1
    return results

In [25]:
pos_scores_all = get_keyword_scores(pos_tokenized, pos_keywords, neg_keywords)
pos_score = pos_scores_all['source_score']
neg_score = pos_scores_all['foreign_score']
print( f"In {pos_score} positive reviews, there were more positive keywords and in {neg_score} there were more negative keywords.")

NameError: name 'pos_keywords' is not defined

In [None]:
neg_scores_all = get_keyword_scores(neg_tokenized, neg_keywords, pos_keywords)
neg_score = neg_scores_all['source_score']
pos_score = neg_scores_all['foreign_score']
print( f"In {neg_score} negative reviews, there were more negative keywords and in {pos_score} there were more positive keywords.")

How did you do? How often do the negative reviews have a higher negative score than a positive score?

In [None]:
# Conclusion: it's a pretty close matchup! But overall, I think this is a poor approach. 
# Just looking at words really cannot give the analsyst an overall feel for the sentiment.

## Model-based sentiment analysis
Above we did some tinkering with our scoring and found it works to some extent, but it's likely not going to work the same on another dataset.  That is, it's not particularly generalizable.  However, modern sentiment analysis has moved away from dictionary-based scoring towards having sentiment be a "classification" problem.  

For this last section, take a look at the transformers [Pipelines](https://github.com/huggingface/transformers#quick-tour-of-pipelines) functionality.  You'll see that with a few lines of code you can bring in an advanced sentiment analysis model.  Run this against the positive/negative corpus and see how it works compared to your work above.

In [26]:
from transformers import pipeline
nlp = pipeline('sentiment-analysis')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




In [27]:
def get_keyword_scores_advanced(reviews):
    '''Get the results of sentiment analysis on the reviews using the transformers pipeline.'''
    results = {'POSITIVE': 0, 'NEGATIVE': 0}
    for review in reviews:
        label = nlp(review)[0]['label']
        results[label] += 1
    return results

In [28]:
pos_scores = get_keyword_scores_advanced(pos)
# This takes a while to run, so here are the results: {'POSITIVE': 1088, 'NEGATIVE': 178}
# This is very different than the dictionary analysis above!
print(pos_scores)

{'POSITIVE': 1088, 'NEGATIVE': 178}


In [29]:
neg_scores = get_keyword_scores_advanced(neg)
# This takes a while to run, so here are the results: {'POSITIVE': 101, 'NEGATIVE': 1132}
# This is very different than the dictionary analysis above!
print(neg_scores)

{'POSITIVE': 101, 'NEGATIVE': 1132}
