# Sentiment analysis codealong using spacy and movie reviews

Sentiment analysis is one of the more popular topics in NLP. It is concerned with finding some kind of valence to written text. This could be positivity, negativity, subjectivity and many others. In this lesson we will just be looking at those three. 

First we will load in a dataset of pre-coded sentiment scores for positivity and negativity on words. These words are also divided up by their part of speech in the sentence.

Then we will load snippets of rottentomatoes reviews and explore the sentiment of the writing.

---

### Load packages and sentiment data

In [1]:
import pandas as pd
import numpy as np

In [2]:
sen = pd.read_csv('/Users/austinwhaley/Desktop/DSI-SF-4-austinmwhaley/datasets/sentiment_words/sentiment_words_simple.csv')

In [3]:
sen.head()

Unnamed: 0,pos,word,pos_score,neg_score
0,adj,.22-caliber,0.0,0.0
1,adj,.22-calibre,0.0,0.0
2,adj,.22_caliber,0.0,0.0
3,adj,.22_calibre,0.0,0.0
4,adj,.38-caliber,0.0,0.0


---

### Create a sentiment dataset that does not take into account part of speech tags

This will be what we use first, not knowing the part of speech a word is in. Later when we use spacy we will be able to determine the part of speech of each word and pair the scores accordingly.

In [4]:
sen_agg = sen[['word', 'pos_score', 'neg_score']].groupby('word').agg(np.mean).reset_index()
sen_agg.head(20)

Unnamed: 0,word,pos_score,neg_score
0,'hood,0.0,0.375
1,'s_gravenhage,0.0,0.0
2,'tween,0.0,0.0
3,'tween_decks,0.0,0.0
4,.22,0.125,0.0
5,.22-caliber,0.0,0.0
6,.22-calibre,0.0,0.0
7,.22_caliber,0.0,0.0
8,.22_calibre,0.0,0.0
9,.38-caliber,0.0,0.0


---

### Create a dictionary version of the sentiment data for both the part of speech and aggregate

The dictionary format of the data will be much easier to index into in our functions later. If we don't do this it's much harder to make those functions run quickly.

In [5]:
sen_dict = {
    'ADJ':{},
    'NOUN':{},
    'VERB':{},
    'ADV':{}
}

for i, row in enumerate(sen.itertuples()): #Will go line by line in the dataframe and pull out values in a tuple
    if (i % 10000) == 0:
        print i
    sen_dict[row[1].upper()][row[2]] = {'pos_score': row[3], 'neg_score':row[4]}

0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000


In [33]:
sen_dict['NOUN']['russia']

{'neg_score': 0.0, 'pos_score': 0.0}

In [12]:
agg_dict={}

for i, row in enumerate(sen_agg.itertuples()): #Will go line by line in the dataframe and pull out values in a tuple
    if (i % 10000) == 0:
        print i
    agg_dict[row[1]] = {'pos_score': row[2], 'neg_score':row[3]}

0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000


---

### Load the rotten tomatoes dataset

This dataset has:
    
    critic: critic's name
    fresh: fresh vs. rotten rating
    imdb: code for imdb
    publication: where the review was published
    quote: the review snippet
    review_date: date of review
    rtid: rottentomatoes id
    title: name of movie

In [13]:
rt = pd.read_csv('/Users/austinwhaley/Desktop/DSI-SF-4-austinmwhaley/datasets/rottentomatoes_critics/rt_critics.csv')

In [14]:
rt.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,fresh,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story
2,David Ansen,fresh,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story
3,Leonard Klady,fresh,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story
4,Jonathan Rosenbaum,fresh,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story


---

### Restrict data to reviews with valid ratings and reviews over 10 words long

Clean up the reviews, making a column with the case and punctuation removed.

In [15]:
rt.fresh.unique()
print rt.shape 
rt = rt[rt.fresh.isin(['fresh', 'rotten'])]
print rt.shape

(14072, 8)
(14049, 8)


In [16]:
rt['quote_len'] = rt.quote.map(lambda x: len(x.split()))

In [18]:
rt.sort_values('quote_len').head(20)

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title,quote_len
12151,Peter Brunette,rotten,160916.0,Film.com,Sloppy.,2000-01-01,13131.0,The Story of Us,1
9676,Michael O'Sullivan,rotten,120484.0,Washington Post,Formulaic!,2000-01-01,10169.0,The Waterboy,1
9926,Lisa Schwarzbaum,fresh,120595.0,Entertainment Weekly,Brilliant!,2000-01-01,9994.0,Babe: Pig in the City,1
99,Rita Kempley,fresh,112346.0,Washington Post,Frothy.,2000-01-01,10129.0,The American President,1
1314,Derek Adams,fresh,110057.0,Time Out,Unforgettable.,2006-06-24,12741.0,Hoop Dreams,1
11429,Todd McCarthy,fresh,167404.0,Variety,Interesting.,2000-01-01,10054.0,The Sixth Sense,1
11788,Robert Horton,rotten,120716.0,Film.com,Sluggish!,2000-01-01,11619.0,Jakob the Liar,1
1942,Geoff Andrew,rotten,110989.0,Time Out,Unendurable.,2006-06-24,10932.0,Ri¢hie Ri¢h,1
10104,Lisa Schwarzbaum,fresh,128853.0,Entertainment Weekly,Seductive!,2000-01-01,10050.0,You've Got Mail,1
10710,Lisa Schwarzbaum,fresh,126886.0,Entertainment Weekly,Cool!,2000-01-01,16574.0,Election,1


In [19]:
print rt.shape
rt = rt[rt.quote_len > 10]
print rt.shape

(14049, 9)
(11215, 9)


In [20]:
rt.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title,quote_len
0,Derek Adams,fresh,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story,24
2,David Ansen,fresh,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story,13
3,Leonard Klady,fresh,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story,17
4,Jonathan Rosenbaum,fresh,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story,14
5,Michael Booth,fresh,114709.0,Denver Post,"As Lion King did before it, Toy Story revived ...",2007-05-03,9559.0,Toy story,40


In [21]:
import string

In [23]:
keep = string.ascii_lowercase + ' -'

In [24]:
rt['quote_parsed'] = rt.quote.map(lambda x: ''.join([ch for ch in x.lower() if ch in keep]))
rt[['quote', 'quote_parsed']].head(12)

Unnamed: 0,quote,quote_parsed
0,"So ingenious in concept, design and execution ...",so ingenious in concept design and execution t...
2,A winning animated feature that has something ...,a winning animated feature that has something ...
3,The film sports a provocative and appealing st...,the film sports a provocative and appealing st...
4,"An entertaining computer-generated, hyperreali...",an entertaining computer-generated hyperrealis...
5,"As Lion King did before it, Toy Story revived ...",as lion king did before it toy story revived t...
6,The film will probably be more fully appreciat...,the film will probably be more fully appreciat...
7,Children will enjoy a new take on the irresist...,children will enjoy a new take on the irresist...
8,Although its computer-generated imagery is imp...,although its computer-generated imagery is imp...
9,How perfect that two of the most popular funny...,how perfect that two of the most popular funny...
11,"Disney's witty, wondrously imaginative, all-co...",disneys witty wondrously imaginative all-compu...


---

### Write a function to assign positive rating, negative, and objective based on words in review

We'll use the dictionary we constructed above (without the part of speech tags). 

Objectivity is calculated: 

    1. - (positive_score + negative_score)

In [25]:
def scorer(pq):
    words = pq.split()
    pos, neg, obj = [],[],[]
    for word in words:
        try:
            pos.append(agg_dict[word]['pos_score'])
            neg.append(agg_dict[word]['neg_score'])
            obj_rating = 1. - (pos[-1] + neg[-1])
            obj.append(obj_rating)
        except:
            pos.append(0.)
            neg.append(0.)
            obj.append(1.)
    return pos, neg, obj

In [26]:
agg_scores = map(scorer, rt.quote_parsed.values)

In [27]:
rt['quote_parsed'][100]

'capraesque has inevitably been affixed to the american president but that phrase with its implications of facile hokum doesnt do the film justice'

In [28]:
agg_scores[100]

([0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.025000000000000001,
  0.0625,
  0.017857142857150003,
  0.0,
  0.16369047619050001,
  0.0,
  0.0,
  0.0,
  0.017857142857099998,
  0.0,
  0.25,
  0.0,
  0.0,
  0.1875,
  0.0,
  0.0,
  0.0,
  0.0,
  0.70833333333299997,
  0.035714285714300006,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.046180555555550007,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.041666666666666664,
  0.0,
  0.0,
  0.017857142857099998,
  0.0,
  0.0,
  0.0,
  0.0,
  0.017857142857099998,
  0.15625,
  0.0],
 [0.0,
  0.0,
  0.20833333333299997,
  0.0,
  0.0,
  0.0,
  0.25,
  0.0029761904761900003,
  0.0,
  0.050595238095200001,
  0.0,
  0.0,
  0.0,
  0.035714285714300006,
  0.0,
  0.5,
  0.25,
  0.0,
  0.28125,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.017857142857150003,
  0.0,
  0.03125,
  0.0625,
  0.0,
  0.0,
  0.098958333333325002,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.19444444444433331,
  0.875,
  0.0,
  0.035714285714300006,
  0.0,
  0.0,
  0.0625,
  0.0,
  0.035714285714300006,
  0.01

---

### Calculate the sum and average ratings for positive, negative, and objective for each review

In [30]:
rt['pos_avg'] = [np.mean(x[0]) for x in agg_scores]
rt['neg_avg'] = [np.mean(x[1]) for x in agg_scores]
rt['obj_avg'] = [np.mean(x[2]) for x in agg_scores]

rt['pos_sum'] = [np.sum(x[0]) for x in agg_scores]
rt['neg_sum'] = [np.sum(x[1]) for x in agg_scores]
rt['obj_sum'] = [np.sum(x[2]) for x in agg_scores]

In [32]:
rt.sort_values('pos_avg', ascending=False, inplace=True)
for i in range(10):
    print rt.quote_parsed.values[i]
    print '-------------------------\n'

appropriately operatic chens visually spectacular epic is sumptuous in every respect intelligent enthralling rhapsodic
-------------------------

hilarious sexy clever playful and as initially teasing as it is ultimately satisfying
-------------------------

a very good film with some dazzling moments and one truly outstanding performance
-------------------------

remains a beautiful deftly directed and superbly acted version of a witty and poignant drama
-------------------------

all the excellent creative components do not add up to a whole
-------------------------

from russia with love is a preposterous skillful slab of hardhitting sexy hokum
-------------------------

the karate kid exhibits warmth and friendly predictable humor its greatest assets
-------------------------

sophisticated well not really but fast smart shrewdly directed and capably performed
-------------------------

part homage part spoof the deft balancing act is a clever engaging adaption
------------------

---

### Evaluate predictive ability using the sentiment scores

In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [35]:
X = rt[['pos_avg', 'neg_avg', 'obj_avg', 'quote_len']]
y = rt.fresh.map(lambda x: 1 if x == 'fresh' else 0).values

In [36]:
y.mean()

0.61506910387873381

In [38]:
from sklearn.preprocessing import StandardScaler

Xn = StandardScaler().fit_transform(X)

scores = cross_val_score(LogisticRegression(), Xn, y, cv=10)
print scores
print np.mean(scores)

[ 0.61853832  0.62566845  0.63458111  0.63101604  0.63368984  0.63368984
  0.63101604  0.65120428  0.60714286  0.51875   ]
0.618529678253


---

### Import spacy

The spacy package is the current gold standard for parsing text. We are going to use it to find the part of speech tags for the review words. 

Once we have parsed the tags with spacey, we can assign sentiment scores at a more granular level, using the correct part of speech version of the word.

In [52]:
import spacy
en_nlp = spacy.load('en')

---

### Parse the quotes using spacey's multithreaded parser

---

### Create columns for part of speech proportions

For each of the part of speech tags, create a column in the dataset that records the proportion of words in the quote that have that part of speech tag. We can try using these as predictors.

---

### Evaluate a model with the new part of speech predictors

---

### Print out the most likely fresh and most likely rotten reviews

Using the predicted probabilities from our model, we can see which reviews are most likely to be fresh or rotten. We can easily validate that our model is doing something that makes sense by looking at these (one of the benefits of doing NLP work!)

---

### Assign sentiment scores using the correct part of speech tag

We need to write another function that will take into account the part of speech tags using the parsed quotes we created earlier and the original sentiment data dictionary.

---

### Evaluate the new predictors with different models.

Does regularization help? Decision trees?

In [77]:
from sklearn.linear_model import SGDClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import StandardScaler