# This notebook to build a sentiment analysis for Movies dataset.

### Import necessary libraries

In [3]:
import pandas as pd # for reading csv file
import re # for preprocessing text
import string # for preprocessing text
from sklearn.feature_extraction.text import CountVectorizer # to create Bag of words
from sklearn.model_selection import train_test_split  # for splitting data
from sklearn.naive_bayes import GaussianNB # to bulid classifier model
from sklearn.preprocessing import LabelEncoder # to convert classes to number 
from sklearn.metrics import accuracy_score # to calculate accuracy
import nltk # for processing texts
from nltk.corpus import stopwords # list of stop words
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# douwnload data
!wget https://raw.githubusercontent.com/ZarahShibli/sentiment_analysis/master/data/IMDB_Dataset.csv

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
--2020-03-05 11:14:24--  https://raw.githubusercontent.com/ZarahShibli/sentiment_analysis/master/data/IMDB_Dataset.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 66212309 (63M) [text/plain]
Saving to: ‘IMDB_Dataset.csv’


2020-03-05 11:14:27 (231 MB/s) - ‘IMDB_Dataset.csv’ saved [66212309/66212309]



## Model Architecture

![Model steps](https://user-images.githubusercontent.com/42017072/75925841-7890d480-5e7a-11ea-8311-576a5ec34e20.png)


## Read data
[IMDB dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) having 50K movie reviews. This is a dataset for binary sentiment classification.

In [8]:
data = pd.read_csv(r"IMDB_Dataset.csv")
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [9]:
# number of samples in dataset
data.shape

(50000, 2)

In [10]:
# count of each type 
data.sentiment.value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [11]:
# count of missing values in dataset 
data.isna().sum()

review       0
sentiment    0
dtype: int64

## Clean text

In [12]:
data['review'][20]

"After the success of Die Hard and it's sequels it's no surprise really that in the 1990s, a glut of 'Die Hard on a .....' movies cashed in on the wrong guy, wrong place, wrong time concept. That is what they did with Cliffhanger, Die Hard on a mountain just in time to rescue Sly 'Stop or My Mom Will Shoot' Stallone's career.<br /><br />Cliffhanger is one big nit-pickers dream, especially to those who are expert at mountain climbing, base-jumping, aviation, facial expressions, acting skills. All in all it's full of excuses to dismiss the film as one overblown pile of junk. Stallone even managed to get out-acted by a horse! However, if you an forget all the nonsense, it's actually a very lovable and undeniably entertaining romp that delivers as plenty of thrills, and unintentionally, plenty of laughs.<br /><br />You've got to love John Lithgows sneery evilness, his tick every box band of baddies, and best of all, the permanently harassed and hapless 'turncoat' agent, Rex Linn as Travers

In [0]:
def clean_text(text):
  '''
  DESCRIPTION:
  This function to clean text 
  INPUT: 
  text: string
  OUTPUT: 
  text: string after clean it
  ''' 
  text = text.lower() # convert letters to lower case
  text = re.sub("[^a-zA-Z]", " ", text) # remove non-letters
  text = re.sub(r'\d+', '', text) # remove number
  text = re.sub(r'http\S+', '', text) # remove links
  text = text.translate(str.maketrans('','', string.punctuation)) # remove punctuation
  text = re.sub(' +', ' ',text) # remove extra space
  text = text.strip() # remove whitespaces

  #text = ' '.join([word for word in text.split() if word not in stopwords.words("english")]) # remove stop words
  #lemma = nltk.WordNetLemmatizer() # define lemmatizer
  #text = ' '.join([lemma.lemmatize(word) for word in text.split()]) 
  return text

In [0]:
# The cleaning function applied in all reviews
data['review'] = data['review'].apply(clean_text)

In [15]:
data['review'][20]

'after the success of die hard and it s sequels it s no surprise really that in the s a glut of die hard on a movies cashed in on the wrong guy wrong place wrong time concept that is what they did with cliffhanger die hard on a mountain just in time to rescue sly stop or my mom will shoot stallone s career br br cliffhanger is one big nit pickers dream especially to those who are expert at mountain climbing base jumping aviation facial expressions acting skills all in all it s full of excuses to dismiss the film as one overblown pile of junk stallone even managed to get out acted by a horse however if you an forget all the nonsense it s actually a very lovable and undeniably entertaining romp that delivers as plenty of thrills and unintentionally plenty of laughs br br you ve got to love john lithgows sneery evilness his tick every box band of baddies and best of all the permanently harassed and hapless turncoat agent rex linn as travers br br he may of been henry in portrait of a seri

## Create Bag of words

In [16]:
max_features = 1500
count_vector = CountVectorizer(max_features = max_features)  
X = count_vector.fit_transform(data['review']).toarray() 
X

array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 2, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

In [17]:
print("most using {} words: {} ".format(max_features, count_vector.get_feature_names()))

most using 1500 words: ['ability', 'able', 'about', 'above', 'absolutely', 'accent', 'across', 'act', 'acted', 'acting', 'action', 'actor', 'actors', 'actress', 'actresses', 'acts', 'actual', 'actually', 'adaptation', 'add', 'added', 'adds', 'admit', 'adult', 'adventure', 'affair', 'after', 'again', 'against', 'age', 'agent', 'ago', 'agree', 'ahead', 'air', 'alien', 'alive', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'amazing', 'america', 'american', 'americans', 'among', 'amount', 'amusing', 'an', 'and', 'animated', 'animation', 'annoying', 'another', 'answer', 'anti', 'any', 'anyone', 'anything', 'anyway', 'apart', 'apparently', 'appeal', 'appear', 'appearance', 'appeared', 'appears', 'appreciate', 'approach', 'are', 'aren', 'army', 'around', 'art', 'as', 'aside', 'ask', 'aspect', 'aspects', 'at', 'atmosphere', 'attack', 'attempt', 'attempts', 'attention', 'audience', 'audiences', 'available', 'average', 'avoid', 'award', 'away', 'awesome', 'awf

In [18]:
print(count_vector.vocabulary_)

{'one': 922, 'of': 910, 'the': 1298, 'other': 934, 'has': 591, 'mentioned': 828, 'that': 1297, 'after': 26, 'watching': 1425, 'just': 700, 'episode': 408, 'you': 1493, 'll': 768, 'be': 112, 'they': 1308, 'are': 74, 'right': 1081, 'as': 79, 'this': 1315, 'is': 679, 'exactly': 425, 'what': 1439, 'happened': 584, 'with': 1458, 'me': 817, 'br': 162, 'first': 496, 'thing': 1309, 'about': 2, 'was': 1419, 'its': 685, 'and': 54, 'scenes': 1107, 'violence': 1406, 'which': 1444, 'set': 1136, 'in': 659, 'from': 525, 'word': 1466, 'go': 559, 'not': 898, 'show': 1154, 'for': 508, 'or': 930, 'no': 891, 'to': 1330, 'drugs': 370, 'sex': 1140, 'classic': 232, 'use': 1383, 'it': 683, 'called': 182, 'given': 555, 'state': 1227, 'mainly': 793, 'on': 920, 'city': 230, 'an': 53, 'prison': 1017, 'where': 1442, 'all': 37, 'have': 593, 'face': 442, 'so': 1182, 'high': 611, 'home': 624, 'many': 804, 'more': 852, 'death': 313, 'never': 885, 'far': 460, 'away': 96, 'would': 1477, 'say': 1101, 'main': 792, 'appeal

In [19]:
d = pd.DataFrame(X,columns=count_vector.get_feature_names())
d

Unnamed: 0,ability,able,about,above,absolutely,accent,across,act,acted,acting,action,actor,actors,actress,actresses,acts,actual,actually,adaptation,add,added,adds,admit,adult,adventure,affair,after,again,against,age,agent,ago,agree,ahead,air,alien,alive,all,almost,alone,...,without,woman,women,won,wonder,wonderful,word,words,work,worked,working,works,world,worse,worst,worth,worthy,would,wouldn,wow,write,writer,writers,writing,written,wrong,wrote,yeah,year,years,yes,yet,york,you,young,younger,your,yourself,zombie,zombies
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,1,0,0,0
1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,2,0
4,0,0,2,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,...,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,3,0,0
49996,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
49997,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
49998,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


In [20]:
# convert classes to number
encoder = LabelEncoder()
y = encoder.fit_transform(data['sentiment'])
y

array([1, 1, 1, ..., 0, 0, 0])

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size =0.2, random_state=42)


## Build classifier model

### Training model

In [22]:
# Define Gauusian Naive bayes
model = GaussianNB()

# train model
model.fit(X_train, y_train) 

GaussianNB(priors=None, var_smoothing=1e-09)

### Evaluate model

In [23]:
# Predicting the Test set results 
y_pred = model.predict(X_test) 
y_pred

array([0, 1, 0, ..., 1, 0, 0])

In [24]:

print('Test model accuracy: ',accuracy_score(y_test, y_pred))

Test model accuracy:  0.799


### Test with new review

In [25]:
# input statment
test_review = ['This is a bad movie'] 

# convert to number
test_vector = count_vector.transform(test_review)
test_vector = test_vector.toarray()

## encodeing predict class
text_predict_class = encoder.inverse_transform(model.predict(test_vector))
print(test_review[0], 'is: ',text_predict_class[0])

This is a bad movie is:  negative
