https://kgptalkie.com/sentiment-analysis-using-scikit-learn/ 



In this project we will use a dataset stored in this github repository https://github.com/laxmimerit/IMDB-Movie-Reviews-Large-Dataset-50k

In [None]:
import pandas as pd
import numpy as np

**git clone** is a Git command line utility which is used to target an existing repository and create a clone, or copy of the target repository.

In [None]:
!git clone https://github.com/laxmimerit/IMDB-Movie-Reviews-Large-Dataset-50k.git

Cloning into 'IMDB-Movie-Reviews-Large-Dataset-50k'...
remote: Enumerating objects: 10, done.[K
remote: Counting objects: 100% (10/10), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 10 (delta 1), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (10/10), done.


Now we have the whole repository named as **'IMDB-Movie-Reviews-Large-Dataset-50k'** in this colab.

## Reading an Excel train file into a pandas DataFrame.

In [None]:
df = pd.read_excel('/content/IMDB-Movie-Reviews-Large-Dataset-50k/train.xlsx')

In [None]:
df.head(10)

Unnamed: 0,Reviews,Sentiment
0,"When I first tuned in on this morning news, I ...",neg
1,"Mere thoughts of ""Going Overboard"" (aka ""Babes...",neg
2,Why does this movie fall WELL below standards?...,neg
3,Wow and I thought that any Steven Segal movie ...,neg
4,"The story is seen before, but that does'n matt...",neg
5,"Like so many media experiments, this amateuris...",neg
6,This game has the(dis)honor of being the first...,neg
7,I think this still is the best routine. There ...,pos
8,"As far as parody films go, there are few that ...",pos
9,Big Bad Ralph is also on the not so squeazy tr...,neg


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Reviews    25000 non-null  object
 1   Sentiment  25000 non-null  object
dtypes: object(2)
memory usage: 390.8+ KB


In [None]:
df['Sentiment'].value_counts()

neg    12500
pos    12500
Name: Sentiment, dtype: int64

## **TF-IDF**


TF-IDF is a measure of originality of a word by comparing the number of times a word appears in a doc with the number of docs the word appears in.

Some semantic information is preserved as uncommon words are given more importance than common words in TF-IDF.

**E.g. 'She is beautiful'**, Here 'beautiful will have more importance than 'she' or 'is'.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# **Text Preprocessing**

In [None]:
!pip install spacy==2.2.3
!python -m spacy download en_core_web_sm
!pip install beautifulsoup4==4.9.1
!pip install textblob==0.15.3

Collecting spacy==2.2.3
[?25l  Downloading https://files.pythonhosted.org/packages/91/76/1f30264c433f9c3c84171fa03f4b6bb5f3303df7781d21554d25045873f4/spacy-2.2.3-cp37-cp37m-manylinux1_x86_64.whl (10.4MB)
[K     |████████████████████████████████| 10.4MB 9.7MB/s 
Collecting thinc<7.4.0,>=7.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/32/53/d11d2faa6921e55c37ad2cd56b0866a9e6df647fb547cfb69a50059d759c/thinc-7.3.1-cp37-cp37m-manylinux1_x86_64.whl (2.2MB)
[K     |████████████████████████████████| 2.2MB 50.4MB/s 
Installing collected packages: thinc, spacy
  Found existing installation: thinc 7.4.0
    Uninstalling thinc-7.4.0:
      Successfully uninstalled thinc-7.4.0
  Found existing installation: spacy 2.2.4
    Uninstalling spacy-2.2.4:
      Successfully uninstalled spacy-2.2.4
Successfully installed spacy-2.2.3 thinc-7.3.1
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
Collecting beautifulsoup4==4.9.

In [None]:
!pip install git+https://github.com/laxmimerit/preprocess_kgptalkie.git

Collecting git+https://github.com/laxmimerit/preprocess_kgptalkie.git
  Cloning https://github.com/laxmimerit/preprocess_kgptalkie.git to /tmp/pip-req-build-x0mzpsad
  Running command git clone -q https://github.com/laxmimerit/preprocess_kgptalkie.git /tmp/pip-req-build-x0mzpsad
Building wheels for collected packages: preprocess-kgptalkie
  Building wheel for preprocess-kgptalkie (setup.py) ... [?25l[?25hdone
  Created wheel for preprocess-kgptalkie: filename=preprocess_kgptalkie-0.1.3-cp37-none-any.whl size=11743 sha256=b4a027e63a587b959cc19913b282e23a6dbb3f14288e0ed9dae4ed4829b2a34f
  Stored in directory: /tmp/pip-ephem-wheel-cache-m_dpcktk/wheels/a8/18/22/90afa4bd43247fb9a75b710a4a3fcd94966c022ce9e3c7d0a6
Successfully built preprocess-kgptalkie
Installing collected packages: preprocess-kgptalkie
Successfully installed preprocess-kgptalkie-0.1.3


Defining get_clean function which is taking argument as ‘Reviews’ column then after performing some steps:

In [None]:
"""
Step 1: Lowering the letter then after replacing backward slash from nothing and underscore from space.
Step 2: Remove emails from the Reviews column.
Step 3: Removing html tags from the Reviews column.
Step 4: Removing special character.
Step 5: If you have multiple repeated character then it converted into single character and make meaningful.
E.g. x = 'lllooooovvveeee youuuu'
x = re.sub("(.){2,}", "", x)
print(x)
-------
love you
"""

In [None]:
import preprocess_kgptalkie as ps
import re

def get_clean(x):
    x = str(x).lower().replace('\\', '').replace('_', ' ')
    x = ps.cont_exp(x)
    x = ps.remove_emails(x)
    x = ps.remove_urls(x)
    x = ps.remove_html_tags(x)
    x = ps.remove_accented_chars(x)
    x = ps.remove_special_chars(x)
    x = re.sub("(.)\1{2,}", "\1", x)    # removal of multiple characters
    return x

df['Reviews'] = df['Reviews'].apply(lambda x: get_clean(x))
df.head()

Unnamed: 0,Reviews,Sentiment
0,when i first tuned in on this morning news i t...,neg
1,mere thoughts of going overboard aka babes aho...,neg
2,why does this movie fall well below standards ...,neg
3,wow and i thought that any steven segal movie ...,neg
4,the story is seen before but that doesand matt...,neg


In [None]:
# Example 
x = 'lllooooovvveeee youuuu'
x = re.sub("(.)\\1{2,}", "\\1", x)
print(x)

love you


In [None]:
df.iloc[3].Reviews

'wow and i thought that any steven segal movie was bad every time i thought that the movie could not get worse it proved me wrong the story was good but the actors could not carry it off also they made a lot of mistakes on how proper archiological digs are done for instance you do not handle artifacts untill they are catologed and accounted for the biggest crime in casting was the archiologist girl she is a weak actress and i felt that her acting really made the movie less realistic then it already was the whole concept of the knights templar being underground all these years seemed pretty stupid to me i like the idea of how they disappeared and stuff so that almost seemed depressing i thought that the characters wernt explained well enough you did not find out much background and that made it harder to relate to them'

### **TfidfVectorizer**

In [None]:
tfidf = TfidfVectorizer(max_features=5000)

In [None]:
X = df['Reviews']
y = df['Sentiment']

X = tfidf.fit_transform(X)
X

<25000x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 2845128 stored elements in Compressed Sparse Row format>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### **Linear SVC**

In [None]:
clf = LinearSVC()
clf.fit(X_train, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         neg       0.88      0.87      0.87      2480
         pos       0.87      0.88      0.88      2520

    accuracy                           0.87      5000
   macro avg       0.87      0.87      0.87      5000
weighted avg       0.87      0.87      0.87      5000



### **Random Forest Classifier**

In [None]:
clf = RandomForestClassifier(criterion='entropy', random_state=223)

In [None]:
clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=223,
                       verbose=0, warm_start=False)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         neg       0.82      0.83      0.83      2480
         pos       0.83      0.82      0.83      2520

    accuracy                           0.83      5000
   macro avg       0.83      0.83      0.83      5000
weighted avg       0.83      0.83      0.83      5000



TEST

In [None]:
x = 'this movie is really good. thanks a lot for making it'

x = get_clean(x)
vec = tfidf.transform([x])

In [None]:
vec.shape

(1, 5000)

In [None]:
clf.predict(vec)

array(['pos'], dtype=object)

In [None]:
clf.predict(vec)

array(['pos'], dtype=object)