# Sentiment analysis codealong using spacy and movie reviews

Sentiment analysis is one of the more popular topics in NLP. It is concerned with finding some kind of valence to written text. This could be positivity, negativity, subjectivity and many others. In this lesson we will just be looking at those three. 

First we will load in a dataset of pre-coded sentiment scores for positivity and negativity on words. These words are also divided up by their part of speech in the sentence.

Then we will load snippets of rottentomatoes reviews and explore the sentiment of the writing.

---

### Load packages and sentiment data

In [1]:
import pandas as pd
import numpy as np

In [2]:
sen = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/sentiment_words/sentiment_words_simple.csv')

In [3]:
sen.head()

Unnamed: 0,pos,word,pos_score,neg_score
0,adj,.22-caliber,0.0,0.0
1,adj,.22-calibre,0.0,0.0
2,adj,.22_caliber,0.0,0.0
3,adj,.22_calibre,0.0,0.0
4,adj,.38-caliber,0.0,0.0


In [4]:
sen.pos = sen.pos.map(lambda x: x.upper())

---

### Create a sentiment dataset that does not take into account part of speech tags

This will be what we use first, not knowing the part of speech a word is in. Later when we use spacy we will be able to determine the part of speech of each word and pair the scores accordingly.

In [5]:
sen_agg = sen[['word','pos_score','neg_score']].groupby('word').agg(np.mean).reset_index()

In [6]:
sen_agg.head()

Unnamed: 0,word,pos_score,neg_score
0,'hood,0.0,0.375
1,'s_gravenhage,0.0,0.0
2,'tween,0.0,0.0
3,'tween_decks,0.0,0.0
4,.22,0.125,0.0


---

### Create a dictionary version of the sentiment data for both the part of speech and aggregate

The dictionary format of the data will be much easier to index into in our functions later. If we don't do this it's much harder to make those functions run quickly.

In [7]:
sen_dict = {'ADJ':{},'NOUN':{},'VERB':{},'ADV':{}}

for i, row in enumerate(sen.itertuples()):
    if (i % 10000) == 0:
        print i
    sen_dict[row[1]][row[2]] = {'pos_score':row[3], 'neg_score':row[4]}


0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000


In [8]:
sen_agg_dict = {}
for i, row in enumerate(sen_agg.itertuples()):
    if (i % 10000) == 0:
        print i
    sen_agg_dict[row[1]] = {'pos_score':row[2], 'neg_score':row[3]}

0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000


---

### Load the rotten tomatoes dataset

This dataset has:
    
    critic: critic's name
    fresh: fresh vs. rotten rating
    imdb: code for imdb
    publication: where the review was published
    quote: the review snippet
    review_date: date of review
    rtid: rottentomatoes id
    title: name of movie

In [9]:
rt = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/rottentomatoes_critics/rt_critics.csv')

In [10]:
rt.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,fresh,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story
2,David Ansen,fresh,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story
3,Leonard Klady,fresh,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story
4,Jonathan Rosenbaum,fresh,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story


---

### Restrict data to reviews with valid ratings and reviews over 10 words long

Clean up the reviews, making a column with the case and punctuation removed.

In [11]:
rt = rt[rt.fresh.isin(['fresh','rotten'])]
rt.fresh = rt.fresh.map(lambda x: 1 if x == 'fresh' else 0)

In [12]:
rt['quote_len'] = rt.quote.map(lambda x: len(x.split()))
rt = rt[rt.quote_len > 10]
rt.shape

(11215, 9)

In [13]:
import string
rt['qt'] = rt.quote.map(lambda x: unicode(''.join([y for y in list(x.lower()) if y in string.ascii_lowercase+" -'"])))
rt.qt = rt.qt.map(lambda x: x.replace('-',' '))

---

### Write a function to assign positive rating, negative, and objective based on words in review

We'll use the dictionary we constructed above (without the part of speech tags). 

Objectivity is calculated: 

    1. - (positive_score + negative_score)

In [14]:
def agg_scorer(x):
    x = x.split()
    pos_scores, neg_scores, obj_scores = [], [], []
    for token in x:
        try:
            pos_scores.append(sen_agg_dict[token]['pos_score'])
            neg_scores.append(sen_agg_dict[token]['neg_score'])
            obj_scores.append((1. - (pos_scores[-1] + neg_scores[-1])))
        except:
            pos_scores.append(0.)
            neg_scores.append(0.)
            obj_scores.append(1.)
    return [pos_scores, neg_scores, obj_scores]
    

In [15]:
agg_scores = map(agg_scorer, rt.qt)

---

### Calculate the sum and average ratings for positive, negative, and objective for each review

In [16]:
rt['pos_avg'] = [np.mean(x[0]) for x in agg_scores]
rt['neg_avg'] = [np.mean(x[1]) for x in agg_scores]
rt['obj_avg'] = [np.mean(x[2]) for x in agg_scores]

rt['pos_sum'] = [np.sum(x[0]) for x in agg_scores]
rt['neg_sum'] = [np.sum(x[1]) for x in agg_scores]
rt['obj_sum'] = [np.sum(x[2]) for x in agg_scores]

In [17]:
rt.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title,quote_len,qt,pos_avg,neg_avg,obj_avg,pos_sum,neg_sum,obj_sum
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story,24,so ingenious in concept design and execution t...,0.046599,0.027885,0.925517,1.164969,0.697115,23.137916
2,David Ansen,1,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story,13,a winning animated feature that has something ...,0.062271,0.021978,0.915751,0.809524,0.285714,11.904762
3,Leonard Klady,1,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story,17,the film sports a provocative and appealing st...,0.057831,0.024271,0.917897,0.983135,0.412608,15.604257
4,Jonathan Rosenbaum,1,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story,14,an entertaining computer generated hyperrealis...,0.067496,0.039307,0.893197,0.94494,0.550298,12.504762
5,Michael Booth,1,114709.0,Denver Post,"As Lion King did before it, Toy Story revived ...",2007-05-03,9559.0,Toy story,40,as lion king did before it toy story revived t...,0.028408,0.021935,0.949657,1.136316,0.877397,37.986287


---

### Evaluate predictive ability using the sentiment scores

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

In [19]:
X = rt[['pos_avg','neg_avg','obj_avg','quote_len']]
y = rt.fresh.values

lr_scores = cross_val_score(LogisticRegression(), X, y, cv=10)
print np.mean(lr_scores), rt.fresh.mean()

0.624431210669 0.615069103879


---

### Import spacy

The spacy package is the current gold standard for parsing text. We are going to use it to find the part of speech tags for the review words. 

Once we have parsed the tags with spacey, we can assign sentiment scores at a more granular level, using the correct part of speech version of the word.

In [20]:
import spacy
en_nlp = spacy.load('en')

In [21]:
tmp = en_nlp(rt.qt.values[0])

In [22]:
tmp

so ingenious in concept design and execution that you could watch it on a postage stamp sized screen and still be engulfed by its charm

In [23]:
tmp[3]

concept

In [24]:
for token in tmp:
    print token.pos_

ADV
ADJ
ADP
NOUN
NOUN
CONJ
NOUN
ADJ
PRON
VERB
VERB
PRON
ADP
DET
NOUN
NOUN
VERB
NOUN
CONJ
ADV
VERB
VERB
ADP
ADJ
NOUN


In [25]:
'ADV' == tmp[0].pos_

True

In [26]:
tmp[0].pos_ in ['ADV','NOUN']

True

In [27]:
tok = tmp[0]

In [28]:
rt.shape

(11215, 16)

---

### Parse the quotes using spacey's multithreaded parser

In [29]:
parsed_quotes = []
for i, parsed in enumerate(en_nlp.pipe(rt.qt.values, batch_size=50, n_threads=4)):
    assert parsed.is_parsed
    if (i % 1000) == 0:
        print i
    parsed_quotes.append(parsed)        

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000


---

### Create columns for part of speech proportions

For each of the part of speech tags, create a column in the dataset that records the proportion of words in the quote that have that part of speech tag. We can try using these as predictors.

In [30]:
unique_pos = []
for parsed in parsed_quotes:
    unique_pos.extend([t.pos_ for t in parsed])
unique_pos = np.unique(unique_pos)
print unique_pos

[u'ADJ' u'ADP' u'ADV' u'CONJ' u'DET' u'INTJ' u'NOUN' u'NUM' u'PART' u'PRON'
 u'PROPN' u'PUNCT' u'SPACE' u'SYM' u'VERB' u'X']


In [31]:
for pos in unique_pos:
    rt[pos+'_prop'] = 0.
       

In [32]:
rt = rt.reset_index(drop=True)
for i, parsed in enumerate(parsed_quotes):
    if (i % 1000) == 0:
        print i
    parsed_len = len(parsed)
    for pos in unique_pos:
        count = len([x for x in parsed if x.pos_ == pos])
        rt.ix[i, pos+'_prop'] = float(count)/parsed_len
    
    

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000


In [33]:
rt.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title,quote_len,qt,...,NOUN_prop,NUM_prop,PART_prop,PRON_prop,PROPN_prop,PUNCT_prop,SPACE_prop,SYM_prop,VERB_prop,X_prop
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story,24,so ingenious in concept design and execution t...,...,0.28,0.0,0.0,0.08,0.0,0.0,0.0,0.0,0.2,0.0
1,David Ansen,1,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story,13,a winning animated feature that has something ...,...,0.384615,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.153846,0.0
2,Leonard Klady,1,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story,17,the film sports a provocative and appealing st...,...,0.277778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0
3,Jonathan Rosenbaum,1,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story,14,an entertaining computer generated hyperrealis...,...,0.4375,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.1875,0.0
4,Michael Booth,1,114709.0,Denver Post,"As Lion King did before it, Toy Story revived ...",2007-05-03,9559.0,Toy story,40,as lion king did before it toy story revived t...,...,0.325581,0.0,0.023256,0.046512,0.0,0.0,0.0,0.0,0.139535,0.0


---

### Evaluate a model with the new part of speech predictors

In [34]:
X = rt[['pos_avg','neg_avg','obj_avg','quote_len']+[x for x in rt.columns if x.endswith('_prop')]]
y = rt.fresh.values

lr_scores = cross_val_score(LogisticRegression(), X, y, cv=10)
print np.mean(lr_scores), rt.fresh.mean()

0.637539340142 0.615069103879


In [35]:
X.columns

Index([   u'pos_avg',    u'neg_avg',    u'obj_avg',  u'quote_len',
         u'ADJ_prop',   u'ADP_prop',   u'ADV_prop',  u'CONJ_prop',
         u'DET_prop',  u'INTJ_prop',  u'NOUN_prop',   u'NUM_prop',
        u'PART_prop',  u'PRON_prop', u'PROPN_prop', u'PUNCT_prop',
       u'SPACE_prop',   u'SYM_prop',  u'VERB_prop',     u'X_prop'],
      dtype='object')

In [36]:
lr = LogisticRegression().fit(X, y)
for var, coef in zip(X.columns, lr.coef_[0]):
    print var, coef

pos_avg 8.76697959874
neg_avg -7.33385858761
obj_avg -0.634641104609
quote_len 0.0116093439938
ADJ_prop 0.961052406301
ADP_prop 0.0732935970155
ADV_prop -2.05180196625
CONJ_prop 1.60400681916
DET_prop -0.723469079008
INTJ_prop -0.105599029459
NOUN_prop 0.905379519771
NUM_prop 1.02922089112
PART_prop -1.52186110803
PRON_prop 1.55185184176
PROPN_prop 1.80559660429
PUNCT_prop -0.468802105115
SPACE_prop -0.167968297252
SYM_prop 0.0290852861332
VERB_prop -2.26374980084
X_prop 0.14224432693


---

### Print out the most likely fresh and most likely rotten reviews

Using the predicted probabilities from our model, we can see which reviews are most likely to be fresh or rotten. We can easily validate that our model is doing something that makes sense by looking at these (one of the benefits of doing NLP work!)

In [37]:
pp = pd.DataFrame(lr.predict_proba(X), columns=['rotten_prob','fresh_prob'])
pp['ind'] = range(pp.shape[0])
pp['quote'] = rt.quote.values

In [38]:
pp.head()

Unnamed: 0,rotten_prob,fresh_prob,ind,quote
0,0.336702,0.663298,0,"So ingenious in concept, design and execution ..."
1,0.309864,0.690136,1,A winning animated feature that has something ...
2,0.263624,0.736376,2,The film sports a provocative and appealing st...
3,0.401442,0.598558,3,"An entertaining computer-generated, hyperreali..."
4,0.280006,0.719994,4,"As Lion King did before it, Toy Story revived ..."


In [39]:
pp.sort_values('fresh_prob', ascending=False, inplace=True)

In [40]:
for quote in pp.quote[0:10]:
    print quote
    print '============================================================\n'

An ingenious script, excellent special effects and photography, and superior acting, make it an endearing winner.

The Karate Kid exhibits warmth and friendly, predictable humor, its greatest assets.

Like chamber music, Metropolitan is sprightly, intimate and all too self-aware.

Appropriately operatic, Chen's visually spectacular epic is sumptuous in every respect. Intelligent, enthralling, rhapsodic.

From Russia with Love is a preposterous, skillful slab of hardhitting, sexy hokum.

An unashamed study of selfish, sadistic criminality, and all the better for it.

A very good film with some dazzling moments and one truly outstanding performance!

Some nifty celestial surfing and a good finale compensate for a dead midsection.

The Whole Nine Yards a bright crowd-pleaser with a clever script.

A flawed but powerful Shine-esque tale of love, music, and madness!



In [41]:
pp.sort_values('rotten_prob', ascending=False, inplace=True)

In [42]:
for quote in pp.quote[0:10]:
    print quote
    print '============================================================\n'

Bringing Out the Dead is stunning to look at; unfortunately, it's not terribly satisfying to watch.

Don't bother to hang around for the outtakes. They're not funny either.

If inspiration is lacking, talent is not. Count Lynch down but never out.

Unfortunately, and perhaps not unexpectedly, it doesn't live up to the hype.

I've rarely laughed so much at a movie I generally disliked.

Not only about PCs; it appears to have been dictated by one.

Things might be bad, the movie suggests, but they're not so bad you can't laugh.

You can never expect to see a film more handsomely played.

It would be difficult to imagine material more wrong for Spade than Lost & Found.

Not even Eddie can save this ill-conceived mess of a movie.



---

### Assign sentiment scores using the correct part of speech tag

We need to write another function that will take into account the part of speech tags using the parsed quotes we created earlier and the original sentiment data dictionary.

In [43]:
def scorer(parsed):
    pos_scores, neg_scores, obj_scores = [], [], []
    for token in [t for t in parsed if t.pos_ in ['NOUN','VERB','ADV','ADJ']]:
        try:
            pos_scores.append(sen_dict[token.pos_][str(token)]['pos_score'])
            neg_scores.append(sen_dict[token.pos_][str(token)]['neg_score'])
            obj_scores.append((1. - (pos_scores[-1] + neg_scores[-1])))
        except:
            pos_scores.append(0.)
            neg_scores.append(0.)
            obj_scores.append(1.)
    return [pos_scores, neg_scores, obj_scores]
    

In [44]:
scores = []
for i, parsed in enumerate(parsed_quotes):
    if (i % 1000) == 0:
        print i
    scores.append(scorer(parsed))

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000


In [45]:
rt['pos_part_avg'] = [np.mean(x[0]) for x in scores]
rt['neg_part_avg'] = [np.mean(x[1]) for x in scores]
rt['obj_part_avg'] = [np.mean(x[2]) for x in scores]

In [46]:
rt['pos_part_sum'] = [np.sum(x[0]) for x in scores]
rt['neg_part_sum'] = [np.sum(x[1]) for x in scores]
rt['obj_part_sum'] = [np.sum(x[2]) for x in scores]

In [47]:
rt.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title,quote_len,qt,...,SPACE_prop,SYM_prop,VERB_prop,X_prop,pos_part_avg,neg_part_avg,obj_part_avg,pos_part_sum,neg_part_sum,obj_part_sum
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story,24,so ingenious in concept design and execution t...,...,0.0,0.0,0.2,0.0,0.06551,0.021877,0.912613,1.113668,0.371909,15.514423
1,David Ansen,1,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story,13,a winning animated feature that has something ...,...,0.0,0.0,0.153846,0.0,0.020833,0.0,0.979167,0.1875,0.0,8.8125
2,Leonard Klady,1,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story,17,the film sports a provocative and appealing st...,...,0.0,0.0,0.055556,0.0,0.085227,0.030475,0.884298,0.9375,0.335227,9.727273
3,Jonathan Rosenbaum,1,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story,14,an entertaining computer generated hyperrealis...,...,0.0625,0.0,0.1875,0.0,0.015152,0.014015,0.970833,0.166667,0.154167,10.679167
4,Michael Booth,1,114709.0,Denver Post,"As Lion King did before it, Toy Story revived ...",2007-05-03,9559.0,Toy story,40,as lion king did before it toy story revived t...,...,0.0,0.0,0.139535,0.0,0.03787,0.013122,0.949009,1.060352,0.367403,26.572245


---

### Evaluate the new predictors with different models.

Does regularization help? Decision trees?

In [48]:
X = rt[['pos_part_avg','neg_part_avg','obj_part_avg','quote_len']+[x for x in rt.columns if x.endswith('_prop')]]
y = rt.fresh.values

lr_scores = cross_val_score(LogisticRegression(), X, y, cv=10)
print np.mean(lr_scores), rt.fresh.mean()

0.639679252428 0.615069103879


In [55]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier

In [52]:
X = rt[['pos_part_avg','neg_part_avg','obj_part_avg',
        'pos_part_sum','neg_part_sum','obj_part_sum',
        'quote_len']+[x for x in rt.columns if x.endswith('_prop')]]

ss = StandardScaler()
Xn = ss.fit_transform(X)

In [59]:
sgd_params = {
    'loss':['log'],
    'penalty':['elasticnet'],
    'alpha':np.logspace(-4,1,50),
    'l1_ratio':np.linspace(0.01,1.0,25)
}

sgd_gs = GridSearchCV(SGDClassifier(), sgd_params, cv=5, verbose=1)
sgd_gs.fit(Xn, y)

Fitting 5 folds for each of 1250 candidates, totalling 6250 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    1.1s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:    4.3s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:    9.9s
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:   17.4s
[Parallel(n_jobs=1)]: Done 1249 tasks       | elapsed:   27.7s
[Parallel(n_jobs=1)]: Done 1799 tasks       | elapsed:   39.6s
[Parallel(n_jobs=1)]: Done 2449 tasks       | elapsed:   51.4s
[Parallel(n_jobs=1)]: Done 3199 tasks       | elapsed:  1.1min
[Parallel(n_jobs=1)]: Done 4049 tasks       | elapsed:  1.4min
[Parallel(n_jobs=1)]: Done 4999 tasks       | elapsed:  1.7min
[Parallel(n_jobs=1)]: Done 6049 tasks       | elapsed:  2.1min
[Parallel(n_jobs=1)]: Done 6250 out of 6250 | elapsed:  2.2min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['elasticnet'], 'loss': ['log'], 'l1_ratio': array([ 0.01   ,  0.05125,  0.0925 ,  0.13375,  0.175  ,  0.21625,
        0.2575 ,  0.29875,  0.34   ,  0.38125,  0.4225 ,  0.46375,
        0.505  ,  0.54625,  0.5875 ,  0.62875,  0.67   ,  0.71125,
        0.7525 ,  0.79375,  0.8...    3.08884e+00,   3.90694e+00,   4.94171e+00,   6.25055e+00,
         7.90604e+00,   1.00000e+01])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [60]:
print sgd_gs.best_params_

{'penalty': 'elasticnet', 'alpha': 0.0054286754393238594, 'loss': 'log', 'l1_ratio': 0.25750000000000001}


In [70]:
print sgd_gs.best_score_

0.644315648685


In [62]:
#lr = LogisticRegression().fit(X, y)
sgd_best = sgd_gs.best_estimator_
for var, coef in zip(X.columns, sgd_best.coef_[0]):
    print var, coef

pos_part_avg 0.41936535519
neg_part_avg -0.230345654405
obj_part_avg 0.0
pos_part_sum 0.144675979279
neg_part_sum 0.0
obj_part_sum 0.0
quote_len 0.141753935347
ADJ_prop 0.00439368864367
ADP_prop 0.0
ADV_prop -0.097246719695
CONJ_prop 0.0921578255495
DET_prop -0.051296444684
INTJ_prop 0.0
NOUN_prop 0.0464299458926
NUM_prop 0.0
PART_prop -0.128019386619
PRON_prop 0.0894064877574
PROPN_prop 0.0618345363725
PUNCT_prop 0.0
SPACE_prop 0.0
SYM_prop 0.0
VERB_prop -0.10001171179
X_prop 0.0


In [74]:
from sklearn.tree import DecisionTreeClassifier

In [75]:
X = rt[['pos_part_avg','neg_part_avg','obj_part_avg',
        'pos_part_sum','neg_part_sum','obj_part_sum',
        'quote_len']+['ADJ_prop','ADV_prop','CONJ_prop','DET_prop','NOUN_prop',
                      'PART_prop','PRON_prop','PROPN_prop','VERB_prop']]

In [77]:
dtc_params = {
    'max_depth':[1,2,3,4,5,6,7,8,None],
    'max_features':['sqrt','log2',None],
    'min_samples_split':[2,4,8,16,32,64,124,256]
}

dtc_gs = GridSearchCV(DecisionTreeClassifier(), dtc_params, cv=10, verbose=1)
dtc_gs.fit(X, y)

Fitting 10 folds for each of 216 candidates, totalling 2160 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.6s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:    2.6s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:    7.0s
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:   15.6s
[Parallel(n_jobs=1)]: Done 1249 tasks       | elapsed:   29.6s
[Parallel(n_jobs=1)]: Done 1799 tasks       | elapsed:   50.6s
[Parallel(n_jobs=1)]: Done 2160 out of 2160 | elapsed:  1.2min finished


GridSearchCV(cv=10, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_features': ['sqrt', 'log2', None], 'min_samples_split': [2, 4, 8, 16, 32, 64, 124, 256], 'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, None]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [78]:
print dtc_gs.best_params_
print dtc_gs.best_score_
dtc_best = dtc_gs.best_estimator_

{'max_features': None, 'min_samples_split': 2, 'max_depth': 3}
0.629781542577


In [79]:
for var, fi in zip(X.columns, dtc_best.feature_importances_):
    print var, fi

pos_part_avg 0.538993823507
neg_part_avg 0.0312775260933
obj_part_avg 0.0
pos_part_sum 0.189424331195
neg_part_sum 0.108052239792
obj_part_sum 0.0
quote_len 0.0
ADJ_prop 0.0
ADV_prop 0.132252079413
CONJ_prop 0.0
DET_prop 0.0
NOUN_prop 0.0
PART_prop 0.0
PRON_prop 0.0
PROPN_prop 0.0
VERB_prop 0.0


In [80]:
pp = pd.DataFrame(dtc_best.predict_proba(X), columns=['rotten_prob','fresh_prob'])
pp['ind'] = range(pp.shape[0])
pp['quote'] = rt.quote.values

In [82]:
pp.sort_values('rotten_prob', ascending=False, inplace=True)

In [83]:
for quote in pp.quote[0:10]:
    print quote
    print '============================================================\n'

Limiting the gore, but not the carnage, in pursuit of a PG-13 rating and more youngsters, pic remains a cluttered, nasty exercise that seems principally intent on selling action figures.

Sarah Thorpe's screenplay is a compendium of by-the-book cliches; Kaufman's direction leaves the material stranded in a limbo between po-faced and trashy; Judd's approximation of drunkenness is worrying to behold.

This family comedy finds unearned laughs in old women and dog flatulence.

Adequate in every way and oddly subversive in spots, but that's about it.

A series of emotionally wrenching moments that made My Life as a Dog a transatlantic hit when it arrived in 1985.

Movies like Hard Eight remind me of what original, compelling characters the movies can sometimes give us.

There's something almost hypnotic about the way Hard Eight develops -- even in its slowest, most tedious moments, it keeps our attention.

Noirish thrillers live or die by their plot twists and dialogue... Unfortunately, the