# IMDB reviews sentiment analysis

This project is a demonstration of how one can use the sentiment anlaysis features of NLP language using the reviews of IMDB. We have gathered a dataset with 748 obersvations or reviews and their sentiments. Using this data, we first performed the preprocessing to clean the data which is then fed to the analyzer using 2 machine learning algorithms; Logistic Regression and SVM. The best model is then saved and later used for testing use the live data.

## Importing relevant libraries

In [2]:
import numpy as np
import pandas as pd

import preprocess_kgptalkie as pp # Self created package which is used for preprocessing.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.feature_extraction.text import TfidfVectorizer # Used to convert the text data to numerical data for computer.

from sklearn.model_selection import GridSearchCV # Helps us in selecting best possible combinations of hyperparameters to achieve highest accuracy.
from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [3]:
df = pd.read_csv('imdb_reviews.txt', sep = '\t', header = None)
df.columns = ['reviews', 'sentiment']

In [4]:
df.head()

Unnamed: 0,reviews,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [51]:
df.shape

(748, 2)

## Data Cleaning and Preprocessing

In [6]:
df['reviews'] = df['reviews'].apply(lambda x: pp.cont_exp(x)) # Converting 'I'm' to 'I am'
df['reviews'] = df['reviews'].apply(lambda x: pp.remove_accented_chars(x)) # Removing different language chars
df['reviews'] = df['reviews'].apply(lambda x: pp.remove_emails(x)) 
df['reviews'] = df['reviews'].apply(lambda x: pp.remove_urls(x))
df['reviews'] = df['reviews'].apply(lambda x: pp.remove_html_tags(x))
df['reviews'] = df['reviews'].apply(lambda x: pp.remove_special_chars(x))

In [9]:
df['reviews'] = df['reviews'].apply(lambda x: str(x).lower())

In [10]:
df.head()

Unnamed: 0,reviews,sentiment
0,a very very very slowmoving aimless movie abou...,0
1,not sure who was more lost the flat characters...,0
2,attempting artiness with black white and cleve...,0
3,very little music or anything to speak of,0
4,the best scene in the movie was when gerardo i...,1


## Training the model and building the model

In [11]:
X = df['reviews']
y = df['sentiment']

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)

In [13]:
X_train.shape, X_test.shape

((598,), (150,))

In [30]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(solver = 'liblinear'))
])

In [31]:
hyperparameters = {
    'tfidf__max_df': (0.5, 1.0),
    'tfidf__use_idf' : (True, False),
    'tfidf__ngram_range': ((1,1), (1,2)),
    'tfidf__analyzer': ('word', 'char', 'char_wb'),
    'clf__penalty': ('l1', 'l2'),
    'clf__C': (1, 2)
}

In [32]:
clf = GridSearchCV(pipe, hyperparameters, n_jobs = -1, cv = None)

In [33]:
clf.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                       ('clf',
                                        LogisticRegression(solver='liblinear'))]),
             n_jobs=-1,
             param_grid={'clf__C': (1, 2), 'clf__penalty': ('l1', 'l2'),
                         'tfidf__analyzer': ('word', 'char', 'char_wb'),
                         'tfidf__max_df': (0.5, 1.0),
                         'tfidf__ngram_range': ((1, 1), (1, 2)),
                         'tfidf__use_idf': (True, False)})

In [34]:
 clf.best_estimator_

Pipeline(steps=[('tfidf', TfidfVectorizer()),
                ('clf', LogisticRegression(C=2, solver='liblinear'))])

In [35]:
clf.best_params_

{'clf__C': 2,
 'clf__penalty': 'l2',
 'tfidf__analyzer': 'word',
 'tfidf__max_df': 1.0,
 'tfidf__ngram_range': (1, 1),
 'tfidf__use_idf': True}

In [36]:
clf.best_score_

0.7558543417366946

In [37]:
y_pred = clf.predict(X_test)

In [38]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.70      0.68      0.69        73
           1       0.71      0.73      0.72        77

    accuracy                           0.71       150
   macro avg       0.71      0.71      0.71       150
weighted avg       0.71      0.71      0.71       150



In [39]:
from sklearn.svm import LinearSVC

In [40]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LinearSVC())
])

In [41]:
hyperparameters = {
    'tfidf__max_df': (0.5, 1.0),
    'tfidf__use_idf' : (True, False),
    'tfidf__ngram_range': ((1,1), (1,2)),
    'tfidf__analyzer': ('word', 'char', 'char_wb'),
    'clf__penalty': ('l1', 'l2'),
    'clf__C': (1, 2)
}

In [43]:
clf = GridSearchCV(pipe, hyperparameters, n_jobs = -1, cv = 5)

In [44]:
clf.fit(X_train, y_train)

        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
 0.75743697 0.72904762 0.762493   0.73570028 0.76078431 0.72901961
 0.75745098 0.72731092 0.50326331 0.51492997 0.73906162 0.73232493
 0.58191877 0.5869888  0.71735294 0.70228291 0.50326331 0.51492997
 0.74579832 0.73736695 0.59029412 0.58696078 0.71732493 0.69560224
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
 0.74910364 0.72233894 0.76078431 0.73736695 0.7507423  0.72067227
 0.76078431 0.73065826 0.50158263 0.51659664 0.72739496 0.73732493
 0.58361345 0.58029412 0.72740896 0.70061625 0.50158263 0.51659664
 0.73910364 0.7290056  0.58696078 0.58529412 0.71740896 0.6888

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                       ('clf', LinearSVC())]),
             n_jobs=-1,
             param_grid={'clf__C': (1, 2), 'clf__penalty': ('l1', 'l2'),
                         'tfidf__analyzer': ('word', 'char', 'char_wb'),
                         'tfidf__max_df': (0.5, 1.0),
                         'tfidf__ngram_range': ((1, 1), (1, 2)),
                         'tfidf__use_idf': (True, False)})

In [45]:
clf.best_params_

{'clf__C': 1,
 'clf__penalty': 'l2',
 'tfidf__analyzer': 'word',
 'tfidf__max_df': 0.5,
 'tfidf__ngram_range': (1, 2),
 'tfidf__use_idf': True}

In [46]:
clf.best_score_

0.7624929971988796

In [49]:
import pickle as pkl

In [50]:
pkl.dump(clf, open('model.pkl','wb'))