## Learning about data

In [1]:
import numpy as np 
import pandas as pd
data = pd.read_csv("C:/Users/91983/IMDB Dataset.csv/IMDB Dataset.csv")
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


__observation(obv)__ 
1) There are 50,000 entries in dataset \
2) There are two features __review__ and __sentiment__ \
3) Both features are __Object__ so they are basically __"text"__ \
4) Both features have 50K data in there respective column, so no __null__ value(i.e. there are no empty rows in features)

In [3]:
data.isnull().sum()

review       0
sentiment    0
dtype: int64

In [4]:
data.sentiment.value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [5]:
data.review.value_counts().head(2)

Loved today's show!!! It was a variety and not solely cooking (which would have been great too). Very stimulating and captivating, always keeping the viewer peeking around the corner to see what was coming up next. She is as down to earth and as personable as you get, like one of us which made the show all the more enjoyable. Special guests, who are friends as well made for a nice surprise too. Loved the 'first' theme and that the audience was invited to play along too. I must admit I was shocked to see her come in under her time limits on a few things, but she did it and by golly I'll be writing those recipes down. Saving time in the kitchen means more time with family. Those who haven't tuned in yet, find out what channel and the time, I assure you that you won't be disappointed.                                                                                                                                                                                                                

In [6]:
data.duplicated().value_counts()

False    49582
True       418
dtype: int64

## Data Cleaning

In [7]:
data = data.sample(30000)

In [8]:
data.shape, data.info(), data.sentiment.value_counts()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 39272 to 29263
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     30000 non-null  object
 1   sentiment  30000 non-null  object
dtypes: object(2)
memory usage: 703.1+ KB


((30000, 2),
 None,
 positive    15045
 negative    14955
 Name: sentiment, dtype: int64)

In [9]:
data.drop_duplicates(inplace=True)

In [10]:
data.duplicated().value_counts()

False    29843
dtype: int64

In [13]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from bs4 import BeautifulSoup
def clean_review(review, stemmer = PorterStemmer(), stop_words = set(stopwords.words("english"))):
    soup = BeautifulSoup(review, "html.parser")
    no_html_review = soup.get_text().lower()
    clean_text = []
    for word in review.split():
        if word not in stop_words and word.isalpha():
            clean_text.append(stemmer.stem(word))
    return " ".join(clean_text)

In [14]:
data.review = data.review.apply(clean_review)

In [15]:
data.review.iloc[3537]

'thi film everi child see grow get distort often pass idea gener gener i grew two differ place although mile i went school friend everi color creed religion first year then i move hillbilli countri unusu even one kid my graduat class high school i say you call honki whitey polit correct peev anyway back film give tri see happen peopl get distort view ignor lack understand cultur thi excel film everyon see especi'

In [16]:
data

Unnamed: 0,review,sentiment
39272,wast time mani much better movi start ok plot ...,negative
41228,as film i obvious read excruci review sarcast ...,positive
46270,unfortun bore obvious way see butcher bootleg ...,negative
27324,along rocket still repeat bbc tv earli mid if ...,positive
13406,there much bad say movi littl plot enough hole...,negative
...,...,...
11345,thi movi next segment pokemon movi suppli ever...,positive
1948,the small california town diablo plagu mysteri...,negative
23591,i saw film i franc i must say confus it stori ...,negative
4057,i heck good time view splendidli surpris erudi...,positive


## Vectorizer

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000)

In [18]:
X = cv.fit_transform(data.review).toarray()

In [19]:
X.shape

(29843, 5000)

In [20]:
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
data.sentiment = lb.fit_transform(data.sentiment)

In [21]:
y = data.iloc[:,-1].values

In [22]:
y.shape

(29843,)

## Model Building

In [23]:
from sklearn.model_selection import train_test_split
train_X, test_X, train_Y, test_Y = train_test_split(X, y, test_size=0.2, random_state=42, stratify=data.sentiment)

In [24]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
clf1 = GaussianNB()
clf2 = MultinomialNB()
clf3 = BernoulliNB()
clf1.fit(train_X, train_Y)
clf2.fit(train_X, train_Y)
clf3.fit(train_X, train_Y)

BernoulliNB()

In [25]:
predict1 = clf1.predict(test_X)
predict2 = clf2.predict(test_X)
predict3 = clf3.predict(test_X)

In [26]:
from sklearn.metrics import accuracy_score
print("Gaussin NaiveBayes:", accuracy_score(predict1, test_Y))
print("Multinomial NaiveBayes:", accuracy_score(predict2, test_Y))
print("Benouli NaiveBayes:", accuracy_score(predict3, test_Y))

Gaussin NaiveBayes: 0.6952588373261853
Multinomial NaiveBayes: 0.823756073044061
Benouli NaiveBayes: 0.8210755570447311


## Deployment

In [27]:
features_dict = {}
for i in range(len(cv.get_feature_names())):
    features_dict[cv.get_feature_names()[i]] = i



In [28]:
import pickle

In [29]:
pickle.dump(data, open("dataframe.pkl", "wb"))

In [30]:
pickle.dump(features_dict, open("features_dict.pkl", "wb"))