## Sentiment Analysis for IMDB Movie Reviews using SKLearn

The [dataset](http://ai.stanford.edu/~amaas/data/sentiment/) was compiled by Andrew Maas and had **50,000** movie reviews from IMDB. It is split into *25k* for training and *25k* for testing. The movie ratings on IMDB can range from 1 to 10. Movies with (less than or equal to) ≤ 4 stars are labeled as negative while movies with (greater than or equal to) ≥ 7 stars are labeled as positive. Reviews with 5 or 6 starts were left out of the dataset.

I will use a cleaned up version of the dataset with just the ratings and reviews

In [47]:
# versions 
import sklearn, nltk
print(sklearn.__version__)
print(nltk.__version__)

0.21.3
3.4.5


In [48]:
# create Train and Test 
reviews_train = [line.strip() for line in open('movie_data/full_train.txt', 'r')]
reviews_test = [line.strip() for line in open('movie_data/full_test.txt', 'r')]
print(reviews_train[0])

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!


In [49]:
# clean the data
import re

REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub('', line.lower()) for line in reviews]
    reviews = [REPLACE_SPACE.sub(' ', line) for line in reviews]
    return reviews

reviews_train_clean = preprocess_reviews(reviews_train)
reviews_test_clean = preprocess_reviews(reviews_test)

In [50]:
print(reviews_train_clean[0])

bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt


In [51]:
# count vectorization 
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary=True)
cv.fit(reviews_train_clean)
X = cv.transform(reviews_train_clean)
X_final_test = cv.transform(reviews_test_clean)

In [52]:
# building the model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# first 12.5k are positive and next 12.5k are negative
labels = [1 if i < 12500 else 0 for i in range(25000)]

X_train, X_test, y_train, y_test = train_test_split(X, labels, train_size=0.75)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    lr = LogisticRegression(C=c, solver='liblinear')
    lr.fit(X_train, y_train)
    print('Accuracy for C = {}: {}%'.format(c, accuracy_score(y_test, lr.predict(X_test)) * 100))


Accuracy for C = 0.01: 87.29599999999999%
Accuracy for C = 0.05: 87.968%
Accuracy for C = 0.25: 87.712%
Accuracy for C = 0.5: 87.376%
Accuracy for C = 1: 87.136%


In [53]:
# Testing the model
final_lr = LogisticRegression(C=0.05, solver='liblinear')
final_lr.fit(X, labels)
print('Final Accuracy of {}%'.format(accuracy_score(labels, lr.predict(X_final_test)) * 100))

Final Accuracy of 86.64%


In [54]:
# Sanity Check
feature_to_coef = {word: coef for word, coef in zip(cv.get_feature_names(), final_lr.coef_[0])}

print('Best Positive Scores')
for best_pos in sorted(feature_to_coef.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(best_pos)

print('\nBest Negative Scores')
for best_neg in sorted(feature_to_coef.items(), key=lambda x: x[1])[:5]:
    print(best_neg)                                             

Best Positive Scores
('excellent', 0.9292549111870664)
('perfect', 0.7907005791290077)
('great', 0.6745323547742257)
('amazing', 0.6127039928254254)
('superb', 0.6019368002203161)

Best Negative Scores
('worst', -1.3645958977380297)
('waste', -1.166424205957553)
('awful', -1.032418942642618)
('poorly', -0.8752018765326353)
('boring', -0.8563543412064868)


In [55]:
# improving the model by removing stopwords
from nltk.corpus import stopwords

english_stopwords = stopwords.words('english')

def remove_stopwords(corpus):
    ret = []
    for review in corpus:
        ret.append(' '.join([word for word in review.split() if word not in english_stopwords]))
    return ret    
        
no_stopwords = remove_stopwords = remove_stopwords(reviews_train_clean)
print(no_stopwords[0])

bromwell high cartoon comedy ran time programs school life teachers 35 years teaching profession lead believe bromwell highs satire much closer reality teachers scramble survive financially insightful students see right pathetic teachers pomp pettiness whole situation remind schools knew students saw episode student repeatedly tried burn school immediately recalled high classic line inspector im sack one teachers student welcome bromwell high expect many adults age think bromwell high far fetched pity isnt


In [56]:
# stemming 
def stem_corpus(corpus):
    from nltk.stem.porter import PorterStemmer
    stemmer = PorterStemmer()
    return [' '.join([stemmer.stem(word)for word in review.split()]) for review in corpus]

stemmed = stem_corpus(reviews_train_clean)
print(stemmed[0])

bromwel high is a cartoon comedi it ran at the same time as some other program about school life such as teacher my 35 year in the teach profess lead me to believ that bromwel high satir is much closer to realiti than is teacher the scrambl to surviv financi the insight student who can see right through their pathet teacher pomp the petti of the whole situat all remind me of the school i knew and their student when i saw the episod in which a student repeatedli tri to burn down the school i immedi recal at high a classic line inspector im here to sack one of your teacher student welcom to bromwel high i expect that mani adult of my age think that bromwel high is far fetch what a piti that it isnt


In [57]:
# lemmatization
def lemmatize_corpus(corpus):
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    return [' '.join([lemmatizer.lemmatize(word)for word in review.split()]) for review in corpus]

lemmatized = lemmatize_corpus(reviews_train_clean)  
print(lemmatized[0])

bromwell high is a cartoon comedy it ran at the same time a some other program about school life such a teacher my 35 year in the teaching profession lead me to believe that bromwell high satire is much closer to reality than is teacher the scramble to survive financially the insightful student who can see right through their pathetic teacher pomp the pettiness of the whole situation all remind me of the school i knew and their student when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector im here to sack one of your teacher student welcome to bromwell high i expect that many adult of my age think that bromwell high is far fetched what a pity that it isnt


In [58]:
# n-grams
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1,2))

# todo ngram with different training sets that were developed above 
ngram_vectorizer.fit(reviews_train_clean)
X = ngram_vectorizer.transform(reviews_train_clean)
X_final_test = ngram_vectorizer.transform(reviews_test_clean)

X_train, X_test, y_train, y_test = train_test_split(X, labels, train_size=0.75)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    lr = LogisticRegression(C=c, solver='liblinear')
    lr.fit(X_train, y_train)
    print('Accuracy for C = {}: {}%'.format(c, accuracy_score(y_test, lr.predict(X_test)) * 100))

Accuracy for C = 0.01: 88.4%
Accuracy for C = 0.05: 89.28%
Accuracy for C = 0.25: 89.744%
Accuracy for C = 0.5: 89.68%
Accuracy for C = 1: 89.75999999999999%


In [59]:
ngrams_lr = LogisticRegression(C=1, solver='liblinear')
ngrams_lr.fit(X, labels)
print('Final Accuracy of {}%'.format(accuracy_score(labels, ngrams_lr.predict(X_final_test)) * 100))

Final Accuracy of 89.74%


In [60]:
# word counts 
wc_vectorizer = CountVectorizer(binary=False)
wc_vectorizer.fit(reviews_train_clean)
X = wc_vectorizer.transform(reviews_train_clean)
X_final_test = wc_vectorizer.transform(reviews_test_clean)

X_train, X_test, y_train, y_test = train_test_split(X, labels, train_size=0.75)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    lr = LogisticRegression(C=c, solver='liblinear')
    lr.fit(X_train, y_train)
    print('Accuracy for C = {}: {}%'.format(c, accuracy_score(y_test, lr.predict(X_test)) * 100))

Accuracy for C = 0.01: 86.992%
Accuracy for C = 0.05: 87.83999999999999%
Accuracy for C = 0.25: 87.728%
Accuracy for C = 0.5: 87.856%
Accuracy for C = 1: 87.664%


In [61]:
wc_lr = LogisticRegression(C=0.05, solver='liblinear')
wc_lr.fit(X, labels)
print('Final Accuracy of {}%'.format(accuracy_score(labels, wc_lr.predict(X_final_test)) * 100))

Final Accuracy of 88.22%


In [64]:
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(reviews_train_clean)
X = tfidf_vectorizer.transform(reviews_train_clean)
X_final_test = tfidf_vectorizer.transform(reviews_test_clean)

X_train, X_test, y_train, y_test = train_test_split(X, labels, train_size=0.75)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    lr = LogisticRegression(C=c, solver='liblinear')
    lr.fit(X_train, y_train)
    print('Accuracy for C = {}: {}%'.format(c, accuracy_score(y_test, lr.predict(X_test)) * 100))

Accuracy for C = 0.01: 79.408%
Accuracy for C = 0.05: 83.12%
Accuracy for C = 0.25: 86.60799999999999%
Accuracy for C = 0.5: 87.616%
Accuracy for C = 1: 88.256%


In [65]:
tfidf_lr = LogisticRegression(C=1, solver='liblinear')
tfidf_lr.fit(X, labels)
print('Final Accuracy of {}%'.format(accuracy_score(labels, tfidf_lr.predict(X_final_test)) * 100))

Final Accuracy of 88.24%


In [70]:
import warnings
warnings.filterwarnings("ignore")

# Svm
from sklearn.svm import LinearSVC

ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1,2))
ngram_vectorizer.fit(reviews_train_clean)
X = ngram_vectorizer.transform(reviews_train_clean)
X_final_test = ngram_vectorizer.transform(reviews_test_clean)

X_train, X_test, y_train, y_test = train_test_split(X, labels, train_size=0.75)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    svm = LinearSVC(C=c)
    svm.fit(X_train, y_train)
    print('Accuracy for C = {}: {}%'.format(c, accuracy_score(y_test, svm.predict(X_test)) * 100))


Accuracy for C = 0.01: 89.92%
Accuracy for C = 0.05: 89.60000000000001%
Accuracy for C = 0.25: 89.55199999999999%
Accuracy for C = 0.5: 89.50399999999999%
Accuracy for C = 1: 89.50399999999999%


In [71]:
svm_final = LinearSVC(C=0.01)
svm_final.fit(X, labels)
print('Final Accuracy of {}%'.format(accuracy_score(labels, svm_final.predict(X_final_test)) * 100))

Final Accuracy of 89.708%


In [72]:
import warnings
warnings.filterwarnings("ignore")

# final Model
stop_words = ['a', 'at', 'in', 'of', 'the']
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1,3), stop_words=stop_words)
ngram_vectorizer.fit(reviews_train_clean)
X = ngram_vectorizer.transform(reviews_train_clean)
X_final_test = ngram_vectorizer.transform(reviews_test_clean)

X_train, X_test, y_train, y_test = train_test_split(X, labels, train_size=0.75)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    svm = LinearSVC(C=c)
    svm.fit(X_train, y_train)
    print('Accuracy for C = {}: {}%'.format(c, accuracy_score(y_test, svm.predict(X_test)) * 100))

Accuracy for C = 0.01: 88.4%
Accuracy for C = 0.05: 88.4%
Accuracy for C = 0.25: 88.368%
Accuracy for C = 0.5: 88.4%
Accuracy for C = 1: 88.384%


In [73]:
final = LinearSVC(C=0.01)
final.fit(X, labels)
print('Final Accuracy of {}%'.format(accuracy_score(labels, final.predict(X_final_test)) * 100))

Final Accuracy of 90.024%
