In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")

In [2]:
data = pd.read_csv("IMDB Dataset.csv")

In [3]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
data.review[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

## Text Cleaning

1. Sample 10000 rows
2. Remove html tags
3. Remove special characters
4. Converting every thing to lower case
5. Removing Stop words
6. Stemming

In [5]:
data.shape

(50000, 2)

In [6]:
df = data.sample(10000 , random_state = 42)

In [7]:
df.reset_index(inplace= True , drop = True)

In [8]:
df

Unnamed: 0,review,sentiment
0,I really liked this Summerslam due to the look...,positive
1,Not many television shows appeal to quite as m...,positive
2,The film quickly gets to a major chase scene w...,negative
3,Jane Austen would definitely approve of this o...,positive
4,Expectations were somewhat high for me when I ...,negative
...,...,...
9995,Although Casper van Dien and Michael Rooker ar...,negative
9996,I liked this movie. I wasn't really sure what ...,positive
9997,Yes non-Singaporean's can't see what's the big...,positive
9998,"As far as films go, this is likable enough. En...",negative


In [9]:
## Replacing the sentiment column

In [10]:
df['sentiment'].replace({
    'negative' : 0 ,
    'positive' : 1
} , inplace = True)

In [11]:
df

Unnamed: 0,review,sentiment
0,I really liked this Summerslam due to the look...,1
1,Not many television shows appeal to quite as m...,1
2,The film quickly gets to a major chase scene w...,0
3,Jane Austen would definitely approve of this o...,1
4,Expectations were somewhat high for me when I ...,0
...,...,...
9995,Although Casper van Dien and Michael Rooker ar...,0
9996,I liked this movie. I wasn't really sure what ...,1
9997,Yes non-Singaporean's can't see what's the big...,1
9998,"As far as films go, this is likable enough. En...",0


In [12]:
## 2. Removal of HTML Tags

from bs4 import BeautifulSoup

In [13]:
def remove_html_tags(text):
    soup = BeautifulSoup(text , 'html.parser')
    return soup.get_text()

In [14]:
df['review'] = df['review'].apply(remove_html_tags)

In [15]:
df.review[0]

"I really liked this Summerslam due to the look of the arena, the curtains and just the look overall was interesting to me for some reason. Anyways, this could have been one of the best Summerslam's ever if the WWF didn't have Lex Luger in the main event against Yokozuna, now for it's time it was ok to have a huge fat man vs a strong man but I'm glad times have changed. It was a terrible main event just like every match Luger is in is terrible. Other matches on the card were Razor Ramon vs Ted Dibiase, Steiner Brothers vs Heavenly Bodies, Shawn Michaels vs Curt Hening, this was the event where Shawn named his big monster of a body guard Diesel, IRS vs 1-2-3 Kid, Bret Hart first takes on Doink then takes on Jerry Lawler and stuff with the Harts and Lawler was always very interesting, then Ludvig Borga destroyed Marty Jannetty, Undertaker took on Giant Gonzalez in another terrible match, The Smoking Gunns and Tatanka took on Bam Bam Bigelow and the Headshrinkers, and Yokozuna defended th

In [16]:
## Remove Special Characters and convert to lowercase

import re

In [17]:
def clean_text(text):
    text = re.sub(r'[^a-zA-Z0-9]' , ' ' , text) # remove all the special charater and substitute it with " "
    text = text.lower() # convert the text to lowercase
    return text

In [18]:
df['review'] = df['review'].apply(clean_text)

In [19]:
df.review[0]

'i really liked this summerslam due to the look of the arena  the curtains and just the look overall was interesting to me for some reason  anyways  this could have been one of the best summerslam s ever if the wwf didn t have lex luger in the main event against yokozuna  now for it s time it was ok to have a huge fat man vs a strong man but i m glad times have changed  it was a terrible main event just like every match luger is in is terrible  other matches on the card were razor ramon vs ted dibiase  steiner brothers vs heavenly bodies  shawn michaels vs curt hening  this was the event where shawn named his big monster of a body guard diesel  irs vs 1 2 3 kid  bret hart first takes on doink then takes on jerry lawler and stuff with the harts and lawler was always very interesting  then ludvig borga destroyed marty jannetty  undertaker took on giant gonzalez in another terrible match  the smoking gunns and tatanka took on bam bam bigelow and the headshrinkers  and yokozuna defended th

## Removing stop word

Stop words are common words that are often filtered out during text processing because they are considered to be of little value in terms of meaning. These words include common English words such as "the," "and," "is," "in," etc. Since they appear frequently in a language but don't carry much semantic meaning, they are often excluded from text data to focus on the more informative words.

In [20]:
!pip install nltk



In [21]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ankit\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [22]:
from nltk.corpus import stopwords

In [23]:
stop_words = stopwords.words('english')

In [24]:
def remove_stopwords(text):
    words = text.split()
    new_words = [word for word in words if word not in stop_words]
    return " ".join(new_words)

In [25]:
df['review'] = df['review'].apply(remove_stopwords)

In [26]:
df

Unnamed: 0,review,sentiment
0,really liked summerslam due look arena curtain...,1
1,many television shows appeal quite many differ...,1
2,film quickly gets major chase scene ever incre...,0
3,jane austen would definitely approve one gwyne...,1
4,expectations somewhat high went see movie thou...,0
...,...,...
9995,although casper van dien michael rooker genera...,0
9996,liked movie really sure started watching enjoy...,1
9997,yes non singaporean see big deal film referenc...,1
9998,far films go likable enough entertaining chara...,0


## Stemming 

Stemming is the process of reducing words to their base or root form. It involves removing suffixes from words to obtain a common base form. The goal of stemming is to group together words that have the same meaning but may appear in different forms.

For example:

"running" -> "run"
"jumps" -> "jump"
"happily" -> "happi"

In [27]:
from nltk.stem.porter import PorterStemmer

In [28]:
ps = PorterStemmer()

In [29]:
def stemming(text):
    words = text.split()
    stem_words = [ps.stem(word) for word in words]
    return " ".join(stem_words)

In [30]:
df['review'] = df['review'].apply(stemming)

In [31]:
df

Unnamed: 0,review,sentiment
0,realli like summerslam due look arena curtain ...,1
1,mani televis show appeal quit mani differ kind...,1
2,film quickli get major chase scene ever increa...,0
3,jane austen would definit approv one gwyneth p...,1
4,expect somewhat high went see movi thought ste...,0
...,...,...
9995,although casper van dien michael rooker gener ...,0
9996,like movi realli sure start watch enjoy noneth...,1
9997,ye non singaporean see big deal film refer fil...,1
9998,far film go likabl enough entertain charact go...,0


In [32]:
## Dividing into x and y variables

x = df.iloc[: , 0].values
y = df.iloc[: , 1].values

In [33]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features = 5000)

In [34]:
x_ = cv.fit_transform(x).toarray()

In [35]:
x_.shape

(10000, 5000)

In [36]:
## Doing train_test split

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


In [37]:
x_train , x_test , y_train , y_test = train_test_split(x_ , y , test_size = 0.2 , random_state = 42)

In [38]:
## Applying different Naive Bayes algo 

from sklearn.naive_bayes import GaussianNB , BernoulliNB , MultinomialNB

In [39]:
gb = GaussianNB()
bb = BernoulliNB()
mb = MultinomialNB()

In [40]:
gb.fit(x_train , y_train)
bb.fit(x_train , y_train)
mb.fit(x_train , y_train)

In [41]:
print("The accuracy using Gaussian Naive Bayes is" , accuracy_score(y_test , gb.predict(x_test)))
print("The accuracy using Bernoulli Naive Bayes is" , accuracy_score(y_test , bb.predict(x_test)))
print("The accuracy using Mutinomial Naive Bayes is" , accuracy_score(y_test , mb.predict(x_test)))

The accuracy using Gaussian Naive Bayes is 0.687
The accuracy using Bernoulli Naive Bayes is 0.8475
The accuracy using Mutinomial Naive Bayes is 0.838


In [62]:
## Bad review

review_1 = "I have a migraine watching this cringe movie. Literally every aspect of this movie is cringe. And in terms of originality, this is Rajneeti meets Kabir Singh with an overdose of papa, papa. Give it a rest, you can love your father without being abusive to the rest of the world. And the great thing is that there is still no closure for the audiences who've been wound up for 3.5 hours in the hope of at least a detente. The 'visionary' directory Sandeep Reddy Vanga goes to town bringing every debunked cliche to life - the cliched 'alpha' shtick, the slap-happy father who favors the son-in-law over his own son, the subjugated wives and mother, the 1 'alpha' killing 100 of people (not an exaggeration) armed with a couple of axes, creating 'body doubles' as if that's a thing. The funniest part is that the climax fight with Bobby Deol literally has its own narration ongoing in the form of a song in the background - as if singing ‘Ye dekho kaisa thappad maara~’ with every slap landed. No closure is offered to any of the sub plots ongoing with the protagonists - the relation with the wife, the kids or even the father. To compensate for this, Sandeep Reddy Vanga overcompensates with gratuitous violence to satisfy the enraged crowd watching this trash, never having been involved in as much as a scuffle, much less an actual fight. The plot is as simple as it gets, and it's clear Sandeep Reddy Vanga didn't have a story to tell, as much as a couple of ideas for scenes he wanted to execute based on his diminutive understanding of Marathi music, fetishization of Sikh Youth and culture and more importantly, his bloodlust to satisfy. Such is the bloodlust, that even the post-credits scene is drenched in more blood than most animes, and that's saying a lot."

In [63]:
text = clean_text(review_1)

In [64]:
text = remove_stopwords(text)

In [65]:
text_ = stemming(text)

In [66]:
text_ = cv.transform([text_]).toarray()

In [69]:
bb.predict(text_) ## -> Hence a Bad review

array([0], dtype=int64)

In [70]:
## Good Review

review_2 = "A must watch Movie, from direction to acting, superb job done. Very strong emotions involved. This movie will be Mega Hit and with this Ranbir has emerged as the roaring Lion in the industry and many records will be broken .Do not miss the opportunity to watch in the cinemas to enjoy the Animal experience.. I always thought no one can come close to how Yash oozes power and charisma in KGF but Ranbir outshines all superstars in this role by showcasing different looks from youth to old age with such perfection, once again showing what a phenomenal actor he is. While there are super stars who can make a movie work just from looks, dialogues and style, Ranbir is an acting powerhouse and the scenes he is in, you just cannot take your eyes off him. This is one of the few movies you can watch more than once as there is so much to absorb in almost every scene, you end up missing some aspects! My eyes fixated on Ranbir - totally smitten with the way he emotes just with his eyes, perfection in overall look at different stages of life, his natural subtle dialogue delivery and my ears on music, will need one or two times more to process other aspects!"

In [71]:
text = clean_text(review_2)

In [72]:
text = remove_stopwords(text)

In [73]:
text_ = stemming(text)

In [74]:
text_ = cv.transform([text_]).toarray()

In [75]:
bb.predict(text_) ## Hence a Good review

array([1], dtype=int64)