# Assignment 1: Tokenization and Word counts for sentiment analysis
In this assignment, you will be applying the techniques learned in week 1 of the course to perform and analyze sentiment on a dataset of movie reviews from IMDB.

This dataset comes from [Mass et. al. (2011)](https://www.aclweb.org/anthology/P11-1015.pdf) and the full version is available [here](http://ai.stanford.edu/~amaas/data/sentiment/).

In [24]:
# setup
import sys
import subprocess
import pkg_resources
from collections import Counter
import re

required = {'spacy', 'scikit-learn', 'pandas'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)

import spacy
import pandas as pd
import pickle
from sklearn.feature_extraction.text import CountVectorizer

## Read in data
I've saved a subset of the data in the data directory on the repository.  It is available as a pickled dictionary.


In [20]:
from glob import glob
import numpy as np
pct_sample = 0.1
all_text = {}
for p in ['neg', 'pos']:
    all_text[p] = []
    for f in glob('/Users/batorsky/Downloads/aclImdb/test/%s/*.txt' % p):
        if np.random.rand()<=pct_sample:
            all_text[p].append(open(f).read())
with open('../data/assignment_1_reviews.pkl', 'wb') as f:
    pickle.dump(all_text, f)

In [42]:
# you will need to change this to where ever the file is stored
data_location = '../data/assignment_1_reviews.pkl'
with open(data_location, 'rb') as f:
    all_text = pickle.load(f)
# corpora size
print([(k, len(all_text[k])) for k in all_text])
# for simplicity, let's split these into separate sets
neg, pos = all_text.values()

[('neg', 1298), ('pos', 1172)]


## Tokenization
Use what you've developed in the week 1 notebook to tokenize each of the corpora.

## Word counts
Create a count of the number of words in each review.  Use scikit-learn's CountVectorizer.  Refer to the documentation as it has a few parameters you might want to think about.

## Most frequent words
What are the top 10 most frequent words in the positive reviews? The negative reviews?

It seems like there's a lot of pretty irrelevant words in the top here.  It's hard to really say anything about this.  Can you think of a way to get to more informative terms (i.e. ones that might give you some insight as to what words are positive versus negative?)

Check how often the top words from negative appear in the positive reviews and vice versa.  Do these seem like good candidates for determining whether a review is positive or negative? If not, maybe expand to the top 10, or more.  The idea here is to get a list of terms that are pretty distinct between the two sets.

Extra: Look into [this resource](https://wordhoard.northwestern.edu/userman/analysis-comparewords.html) for a way to test the count difference.

## Dictionary-based sentiment analysis 
Construct a list of the keywords you've found are good determinants if a review is positive or negative.  Use this list to "score" a review based on the number of times that word appears in the review.

(Optional) A quick and fancy way of doing this is to use CountVectorizer's vocabulary parameter.  Think how you might be able to do that.

How did you do? How often do the negative reviews have a higher negative score than a positive score? Is there another metric you could use to assess how well this scoring does?

## Model-based sentiment analysis
Now use spaCy's en_core_web_md model to score the sentiment of the reviews.  You can access the document-level score with the `.polarity` attribute.  How does that compare to your score? How does that compare to the dataset's score?