# IMDB review sentiment prediction
create a classification model to predict the labels of movie reviews.

## objectives
* read all text data
* text preprocessing
* text vectorization
* train classification models
* evaluation models
* analysis most import features

In [22]:
import os
import numpy as np
import pandas as pd

## read all text data

In [17]:
def read_data(data_folder):
    text_lst = []
    for file in os.listdir(data_folder):
        if '.txt' in file:
            with open(os.path.join(data_folder, file), 'r', encoding='utf-8') as f:
                review = f.read()
                text_lst.append(review)
    return text_lst

In [27]:
train_neg = read_data(r'D:\Google Drive receive\Datasets\IMDB movie review data\train\neg')
train_pos = read_data(r'D:\Google Drive receive\Datasets\IMDB movie review data\train\pos')
test_neg = read_data(r'D:\Google Drive receive\Datasets\IMDB movie review data\test\neg')
test_pos = read_data(r'D:\Google Drive receive\Datasets\IMDB movie review data\test\neg')

In [28]:
train_data = list(zip(train_neg, np.zeros(len(train_neg)))) + list(zip(train_pos, np.ones(len(train_pos))))
test_data = list(zip(test_neg, np.zeros(len(test_neg)))) + list(zip(test_pos, np.ones(len(test_pos))))

In [30]:
train_df = pd.DataFrame(train_data, columns=['review', 'label'])
test_df = pd.DataFrame(test_data, columns=['review', 'label'])

In [33]:
# shuffle data frames
train_df = train_df.sample(frac=1).reset_index(drop=True)
test_df = train_df.sample(frac=1).reset_index(drop=True)

In [37]:
train_df.tail()

Unnamed: 0,review,label
24995,I remember seeing this movie a long time ago o...,0.0
24996,"In my knowledge, Largo winch was a famous Belg...",0.0
24997,Police Story is one of Jackie Chan's classic f...,1.0
24998,The name (Frau) of the main character is the G...,0.0
24999,"One of the best musicals ever made, this is an...",1.0


## text preprocessing

In [52]:
import re
from nltk.corpus import stopwords
import spacy

In [56]:
nlp = spacy.load('en_core_web_sm')

In [63]:
for word in nlp('My name is William'):
    print(word.lemma_)

-PRON-
name
be
William


In [73]:
stopwords_lst = stopwords.words('english')

def text_preprocess(text):
    """
    :param text: a string
    :return: a string
    """
    text = text.lower() # lowercasing text
    text = re.sub(r'[/(){}\[\]|@,;]', ' ', text) # replace by white space
    text = re.sub(r'[^0-9a-z #+_]', '', text) # remove other special characters
    text = " ".join([str(word.lemma_) if word.lemma_ != '-PRON-' else str(word) 
                     for word in nlp(text)]) # lemmatize tokens
    text = " ".join([word for word in text.split() if word not in stopwords_lst]
                   ) # remove stopwords
    return text

In [74]:
# test preprocess function
print(train_df.iloc[0,0])
print()
print(text_preprocess(train_df.iloc[0,0]))

I saw this ages ago when I was younger and could never remember the title, until one day I was scrolling through John Candy's film credits on IMDb and noticed an entry named "Once Upon a Crime...". Something rang a bell and I clicked on it, and after reading the plot summary it brought back a lot of memories.<br /><br />I've found it has aged pretty well despite the fact that it is not by any means a "great" comedy. It is, however, rather enjoyable and is a good riff on a Hitchcock formula of mistaken identity and worldwide thrills.<br /><br />The movie has a large cast of characters, amongst them an American couple who find a woman's dog while vacationing in Europe and decide to return it to her for a reward - only to find her dead body upon arrival. From there the plot gets crazier and sillier and they go on the run after the police think they are the killers.<br /><br />Kind of a mix between "It's a Mad Mad Mad Mad World" and a lighter Hitchcock feature, this was directed by Eugene 

## text vecterization

In [77]:
X_train = train_df['review']
y_train = train_df['label']
X_test = test_df['review']
y_test = test_df['label']

In [76]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [78]:
# bag of words vectorization
bow_cv = CountVectorizer(max_df=0.8, min_df=5, ngram_range=(1,2))
X_train_bow = bow_cv.fit_transform(X_train)
X_test_bow = bow_cv.transform(X_test)
bow_cv_vocab = bow_cv.get_feature_names()

In [80]:
# tfidf vectorization
tfidf = TfidfVectorizer(max_df=0.8, min_df=5, ngram_range=(1,2))
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
tfidf_vocab = tfidf.get_feature_names()

## model training

In [89]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

In [99]:
# train a logistic regression model
logis_ml = LogisticRegression()
logis_ml.fit(X_train_bow, y_train)
y_pred_bow = logis_ml.predict(X_test_bow)
print(f"score of bow: {logis_ml.score(X_test_bow, y_test)}")
logis_ml.fit(X_train_tfidf, y_train)
y_pred_tfidf = logis_ml.predict(X_test_tfidf)
print(f"score of tfidf: {logis_ml.score(X_test_tfidf, y_test)}")



score of bow: 1.0
score of tfidf: 0.95192


## analysis most import features

In [132]:
model = logis_ml.fit(X_train_bow, y_train)

In [134]:
top10_pos_features = [bow_cv_vocab[index] for index in model.coef_[0].argsort()[::-1][:10]]
top10_neg_features = [bow_cv_vocab[index] for index in model.coef_[0].argsort()[:10]]

In [135]:
print(f"Top10 positive features: {top10_pos_features}")
print(f"Top10 negative features: {top10_neg_features}")

Top10 positive features: ['excellent', 'perfect', 'superb', 'wonderful', 'enjoyable', 'amazing', 'well worth', 'incredible', 'brilliant', 'today']
Top10 negative features: ['worst', 'awful', 'boring', 'waste', 'disappointment', 'poorly', 'disappointing', 'poor', 'lacks', 'not worth']
