# Movie Review Classification

Here we show how a computer can tell apart good and bad movie reviews. To do so, we create a "classifier": A device that learns data categorization patterns from a "labeled" data set.

First we get a lot of movie reviews and label them as good or bad. We will use these real reviews to train the classifier.

In [1]:
# SETUP MAGIC
f = open("train.tsv")
cur = -1
all_X = []
all_y = []
for line in f:
    if cur == -1: # Skip first line
        cur += 1
        continue
    parts = line.split('\t')
    if cur < int(parts[1]):
        all_X.append(parts[2])
        all_y.append(int(parts[3]))
        cur += 1

# Split up 0-1: negative, 2(neutral) - skip, 3-4: positive
all_X2 = []
all_y2 = []
all_bad = []
all_good = []
for i in range(len(all_X)):
    if all_y[i] == 2:
        continue
    
    new_y = -1 if all_y[i] < 2 else 1
    
    all_y2.append(new_y)
    all_X2.append(all_X[i])
    
    if new_y == -1:
        all_bad.append(all_X[i])
    else:
        all_good.append(all_X[i])

Here are examples of one negative and one positive movie review from the sample we just loaded:

In [87]:
# A bad example
print("Negative:")
print(all_bad[8])
print
# A good one
print("Positive:")
print(all_good[100])

Negative:
As inept as big-screen remakes of The Avengers and The Wild Wild West .
Positive:
Robert Harmon 's less-is-more approach delivers real bump-in - the-night chills -- his greatest triumph is keeping the creepy crawlies hidden in the film 's thick shadows .


## 'Training' the computer
Now we have to transform the movie review sentences into a format that the computer can understand and process efficiently.

In [109]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=5, stop_words='english', max_df=0.9, lowercase=True, ngram_range=(1,3))
vX = vectorizer.fit_transform(all_X2)

print("In human:", all_X2[763])
print("In computer:", vX[763])


In human: Demme finally succeeds in diminishing his stature from Oscar-winning master to lowly studio hack .
In computer:   (0, 2840)	0.3680401183831172
  (0, 1771)	0.357838759868937
  (0, 970)	0.35556629724363537
  (0, 1565)	0.3626961808947826
  (0, 1136)	0.4295872245084288
  (0, 2490)	0.3739789902976097
  (0, 2467)	0.39257676087603205


Just like a teacher gives you problems you've never seen before on a test, we want to save some movie reviews to test the classifier with after we train it.

In [110]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(vX, all_y2, test_size=0.2, random_state=123)


Now it is time to train the classifier. X_train is a list of movie reviews in "computer language" and y_train is a of "good" and "bad labels". We show the classifier both (with the fit) function and it learns the patterns in the data.

In [111]:
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(max_iter=200)
clf.fit(X_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=200, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)

## How well did we do?
Let's see how well the classifier did: we show it movie reviews and it tells us what label it should have. The score shows what percent it got correct.  "something about train and test data"....

In [112]:
# How well we are doing:
print("Accuracy on examples used in training", clf.score(X_train, y_train) * 100, "%")
print("Accuracy on new examples", clf.score(X_test, y_test) * 100, "%")

Accuracy on examples used in training 90.64996368917939 %
Accuracy on new examples 76.6158315177923 %


It may look like the computer did not do very well, but many sentences are confusing. Let's see how it does on specific examples

In [113]:
def explainer(sent, tv, clf):
    id_map = {v:k for k,v in tv.vocabulary_.items()}
    vec = tv.transform([sent]).nonzero()[1]
    score = clf.decision_function(tv.transform([sent]))[0]
    
    if score < -1.:
        print("This movie is bad")
    elif score > 0.5:
        print("I think its good")
    else:
        print("I am not sure")
        
    print("Let me tell you why:")
    print("")
    for word in vec:
        print(id_map[word], "is", round(clf.coef_[0, word] * tv.idf_[word], 1))

In [None]:
explainer(all_good[10], vectorizer, clf)


# THE MAGIC COMPUTER MACHINE

In [125]:
# Let's look at some specific examples
bad = "I have never seen a movie so dull and uninspired"
print("I have never seen a movie so dull and uninspired")
explainer(bad, vectorizer, clf)

print()
print()
good = "What a masterpiece, everyone must see it"
print("What a masterpiece, everyone must see it")
explainer(good, vectorizer, clf)

I have never seen a movie so dull and uninspired
This movie is bad
Let me tell you why:

uninspired is -8.9
seen is 0.0
movie is -2.6
dull is -13.5


What a masterpiece, everyone must see it
I think its good
Let me tell you why:

masterpiece is 11.0


In [126]:
# User sentence
user_sentence = "I hate this actor"
textVX = vectorizer.transform([user_sentence])
print(user_sentence)
explainer(user_sentence, vectorizer, clf)

I hate this actor
I am not sure
Let me tell you why:

hate is -6.9
actor is 5.6


The algorithm does poorly on the confusing sentence, because it does not consider the word order:

In [122]:
good = "Although the plot was poor, the execution was good"
print("Although the plot was poor, the execution was good")
explainer(good, vectorizer, clf)

Although the plot was poor, the execution was good
This movie is bad
Let me tell you why:

poor is -7.5
plot is -7.2
good is 4.2
execution is -8.2
None


## Underlying mechanism
What words would you use to tell apart good and bad movie reviews? (Boring, exhilarating...)
Well, we can see what words computers identified as "good-movie" and "bad-movie":

In [None]:
# What does it mean: -- try to do relative thing instead
terms = [term for _, term in sorted((i, term) for term, i in vectorizer.vocabulary_.iteritems())]
term_weights = sorted(zip(clf.coef_.toarray()[0], terms))
from pprint import pprint

# Top 100 terms:
pprint(term_weights[-10:][::-1])

In [None]:
pprint(term_weights[:10])

In [None]:
%matplotlib notebook

In [None]:
import matplotlib.pyplot as plt
import numpy as np
nterms = 25
randpoints = np.random.choice(len(terms), [1, nterms])[0]
randweights = clf.coef_.toarray()[0][randpoints]
randterms = np.array(terms)[randpoints]
plt.plot(randweights, np.arange(0, nterms), '+')
plt.yticks([])
plt.title('Term weights')
plt.xlabel('Positivity weight')
for i in xrange(nterms):
    plt.text(randweights[i] + 0.05, i - 0.15, randterms[i], size='smaller')
plt.show()

## Takeaways

- Using basic Machine Learning techniques, we were able to correctly predict the sentiment/rating of 3 out of 4 movie reviews.
- The algorithm was able to identify clearly "good" (hilarious, masterpiece) and "bad" (worst, stupid) words, and use them to guess the rating.
- The algorithm was confused by words that contained both good and bad words.
    - This happened because the algorithm looked only at individual words, not at how they were connected by words like 'although' and 'but'
    