## Sentiment Analysis for IMDB Movie Reviews using SKLearn

The [dataset](http://ai.stanford.edu/~amaas/data/sentiment/) was compiled by Andrew Maas and had **50,000** movie reviews from IMDB. It is split into *25k* for training and *25k* for testing. The movie ratings on IMDB can range from 1 to 10. Movies with (less than or equal to) ≤ 4 stars are labeled as negative while movies with (greater than or equal to) ≥ 7 stars are labeled as positive. Reviews with 5 or 6 starts were left out of the dataset.

I will use a cleaned up version of the dataset with just the ratings and reviews

In [19]:
# create Train and Test 
reviews_train = [line.strip() for line in open('movie_data/full_train.txt', 'r')]
reviews_test = [line.strip() for line in open('movie_data/full_test.txt', 'r')]

In [20]:
# clean the data
import re

REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub('', line.lower()) for line in reviews]
    reviews = [REPLACE_SPACE.sub(' ', line) for line in reviews]
    return reviews

reviews_train_clean = preprocess_reviews(reviews_train)
reviews_test_clean = preprocess_reviews(reviews_test)

In [21]:
print(reviews_train_clean[6])

yes its an art to successfully make a slow paced thriller the story unfolds in nice volumes while you dont even notice it happening fine performance by robin williams the sexuality angles in the film can seem unnecessary and can probably affect how much you enjoy the film however the core plot is very engaging the movie doesnt rush onto you and still grips you enough to keep you wondering the direction is good use of lights to achieve desired affects of suspense and unexpectedness is good very nice 1 time watch if you are looking to lay back and hear a thrilling short story


In [22]:
# count vectorization 
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary=True)
cv.fit(reviews_train_clean)
X = cv.transform(reviews_train_clean)
X_final_test = cv.transform(reviews_test_clean)

In [27]:
# building the model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# first 12.5k are positive and next 12.5k are negative
labels = [1 if i < 12500 else 0 for i in range(25000)]

X_train, X_test, y_train, y_test = train_test_split(X, labels, train_size=0.75)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    lr = LogisticRegression(C=c, solver='liblinear')
    lr.fit(X_train, y_train)
    print('Accuracy for C = {}: {}%'.format(c, accuracy_score(y_test, lr.predict(X_test)) * 100))


Accuracy for C = 0.01: 87.29599999999999%
Accuracy for C = 0.05: 88.336%
Accuracy for C = 0.25: 88.27199999999999%
Accuracy for C = 0.5: 88.0%
Accuracy for C = 1: 87.792%


In [28]:
# Testing the model
final_lr = LogisticRegression(C=0.05)
final_lr.fit(X, labels)
print('Final Accuracy for {}%'.format(accuracy_score(labels, lr.predict(X_final_test)) * 100))

Final Accuracy for 86.616%


In [38]:
# Sanity Check
feature_to_coef = {word: coef for word, coef in zip(cv.get_feature_names(), final_lr.coef_[0])}

print('Best Positive Scores')
for best_pos in sorted(feature_to_coef.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(best_pos)

print('\nBest Negative Scores')
for best_neg in sorted(feature_to_coef.items(), key=lambda x: x[1])[:5]:
    print(best_neg)                                             

Best Positive Scores
('excellent', 0.9292549111870664)
('perfect', 0.7907005791290077)
('great', 0.6745323547742257)
('amazing', 0.6127039928254254)
('superb', 0.6019368002203161)

Best Negative Scores
('worst', -1.3645958977380297)
('waste', -1.166424205957553)
('awful', -1.032418942642618)
('poorly', -0.8752018765326353)
('boring', -0.8563543412064868)
