<a href="https://colab.research.google.com/github/Yash919/TCS-Project-Sentimental-Analysis/blob/main/sentimental_analysis_on_movie_review_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Importing Packages
import numpy as np #linear algebra
import pandas as pd #data processing. CSV file I/O (e.g. pd.read_csv)

In [None]:
df = pd.read_csv('/content/IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
# One review
df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

# Text Cleaning

1. Sample 10000 rows
2. Remove HTML Tags
3. Remove special Characters
4. Converting every thing to lower case
5. Removing stop words
6. Stemming

In [None]:
df = df.sample(10000) # Accessing Random 10000 rows first for testing.

In [None]:
df.shape # Shows the format of the table. In this case we have 10k rows and 2 coloumns.

(10000, 2)

In [None]:
df.info() # Shows if a value is missing or not. We can use isnull() function to check missing values also ["df.isnull()"]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 34277 to 551
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     10000 non-null  object
 1   sentiment  10000 non-null  object
dtypes: object(2)
memory usage: 234.4+ KB


In [None]:
# Replacing positive & negative with 1 and 0 because in the end algorithm only recognize numerical data.
df['sentiment'].replace({'positive':1,'negative':0},inplace=True) #default inplace=True
df.head()

Unnamed: 0,review,sentiment
34277,There can be no worse criticism for a movie th...,0
13183,Hidden Frontier has been talked about and repo...,1
42383,I just finished watching this film and found i...,1
9490,"Alright, so maybe the impersonations of Jay Le...",1
25750,"The film was shot at Movie Flats, just off rou...",1


In [None]:
import re
clean = re.compile('<.*?>') # Removing HTML Tags.
re.sub(clean, '', df.iloc[2].review)

'I just finished watching this film and found it very enjoyable. It is a quiet, little film that doesn\'t overwhelm you with special effects or "big" performances. It simply takes you into the lives of the people living in a small hamlet in the backwoods of North Carolina. Henry Thomas gives a good performance as Raymond Toker, a young loner who finds a baby abandoned in the woods. Toker\'s search for the baby\'s parents takes him on a journey that will have a profound impact on his life. David Srathairn plays Truman Lester, a slimy conman with an ulterior motive. And David plays the bad guy to perfection. There is much more to this film than first meets the eye. Filmed on location in North Carolina and with a wonderful sound track of traditional music, it is worth watching.'

# Step 1 of Data Cleaning
Removing HTML Tags.

In [None]:
# Defining a function to clean all the HTML tags within all the rows.

def clean_html(text):
  clean = re.compile('<.*?')
  return re.sub(clean, '', text)

In [None]:
df['review'] = df['review'].apply(clean_html)

# Now all the rows have been cleaned in review coloumn!

# Step 2 of Data Cleaning
Converting into lowercase.

In [None]:
# Defining a function to covert all the rows in lowercase format.

def convert_lower(text):
  return text.lower()

In [None]:
df['review'] = df['review'].apply(convert_lower)

# Now all the rows have been lowercased in review coloumn!

#Step 3 of Data Cleaning
Removing special characters.

In [None]:
# Defining a function to remove the special charcaters in all the rows.

def remove_special(text):
  x=''

  for i in text:
    if i.isalnum(): # Whether the review is alphanumeric
      x = x+i
    else: # If not
      x = x + ' ' # Replacing it by a string containing spaces hence replacing it with the special characters.
  return x # Returning



In [None]:
df['review'] = df['review'].apply(remove_special)
# Now all the rows have been ridden of special characters in review coloumn

In [None]:
df

Unnamed: 0,review,sentiment
34277,there can be no worse criticism for a movie th...,0
13183,hidden frontier has been talked about and repo...,1
42383,i just finished watching this film and found i...,1
9490,alright so maybe the impersonations of jay le...,1
25750,the film was shot at movie flats just off rou...,1
...,...,...
16205,i always look forward to this movie when its o...,1
23899,the polar express was an awful movie what ...,0
19676,very poor quality and the acting is equally as...,0
7990,the movie takes place in a little swedish town...,1


#Step 4 of Data Cleaning
Removing stop words.

In [None]:
# Removing the stop words.
import nltk # A natural language processing library known as natural language tool kit in python.
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
stopwords.words('english') # It will get a list of 179 stopwords in english language.

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
# Defining a function to remove the special charcaters in all the rows.

def remove_stopwords(text): # Calling text from hard review
  x=[]
  for i in text.split(): # Splitting and converting into lists.

    if i not in stopwords.words('english'): # Whether the stopwords is present in a list or not.
      x.append(i) # If not then appending it into empty list.
  y=x[:] # To clear x, sending all the content of the x to y so that if the next review comes in so that the
         # check list should be empty.
  x.clear()
  return y # Sending y repeatedly
df['review'] = df['review'].apply(remove_stopwords)

# Now all the rows have been ridden of stopwords in review coloumn

In [None]:
df

Unnamed: 0,review,sentiment
34277,"[worse, criticism, movie, word, boring, br, br...",0
13183,"[hidden, frontier, talked, reported, several, ...",1
42383,"[finished, watching, film, found, enjoyable, q...",1
9490,"[alright, maybe, impersonations, jay, leno, da...",1
25750,"[film, shot, movie, flats, route, 395, near, l...",1
...,...,...
16205,"[always, look, forward, movie, tv, get, dvd, g...",1
23899,"[polar, express, awful, movie, makes, movie, w...",0
19676,"[poor, quality, acting, equally, bad, movie, p...",0
7990,"[movie, takes, place, little, swedish, town, e...",1


#Step 5 of Data Cleaning
Stemming

In [None]:
# Performing Stemming

from nltk.stem.porter import PorterStemmer # Shortens the word (For ex. Loving = Love)
ps=PorterStemmer() # Making an object of Porter Stemmer

In [None]:
# Defing a funciton to perform stemming in all the rows.

y = []
def stem_words(text):
  for i in text:
    y.append(ps.stem(i))
  z=y[:] # To clear y, sending all the content of the y to z so that if the next review comes in so that the
         # check list should be empty.
  y.clear()
  return z # Sending z repeatedly

  df['review'] = df['review'].apply(stem_words)

# Now all the rows have been stemmed in review coloumn

In [None]:
df

Unnamed: 0,review,sentiment
34277,"[worse, criticism, movie, word, boring, br, br...",0
13183,"[hidden, frontier, talked, reported, several, ...",1
42383,"[finished, watching, film, found, enjoyable, q...",1
9490,"[alright, maybe, impersonations, jay, leno, da...",1
25750,"[film, shot, movie, flats, route, 395, near, l...",1
...,...,...
16205,"[always, look, forward, movie, tv, get, dvd, g...",1
23899,"[polar, express, awful, movie, makes, movie, w...",0
19676,"[poor, quality, acting, equally, bad, movie, p...",0
7990,"[movie, takes, place, little, swedish, town, e...",1


In [None]:
# Defining Joining back function to join back the strings in reviews.

def join_back(list_input):
  return " ".join(list_input)

In [None]:
df['review'] = df['review'].apply(join_back)

In [None]:
df['review']

34277    worse criticism movie word boring br br bad mo...
13183    hidden frontier talked reported several news a...
42383    finished watching film found enjoyable quiet l...
9490     alright maybe impersonations jay leno david le...
25750    film shot movie flats route 395 near lone pine...
                               ...                        
16205    always look forward movie tv get dvd guess ran...
23899    polar express awful movie makes movie worst hy...
19676    poor quality acting equally bad movie prime ex...
7990     movie takes place little swedish town everybod...
551      expecting lot mr amitabh bachan role sarkar di...
Name: review, Length: 10000, dtype: object

In [None]:
X=df.iloc[:,0:1].values

In [None]:
X.shape

(10000, 1)

# Converting into tabular data
We will use a format in which columns will contain all the words which will be used in reviews and rows will contain each review.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 500) #(max_features = 500))

In [None]:
X=cv.fit_transform(df['review']).toarray()

In [None]:
X.shape

(10000, 500)

In [None]:
y=df.iloc[:,-1].values

In [None]:
y

array([0, 1, 1, ..., 0, 1, 0])

In [None]:
y.shape

(10000,)

# Splitting data into 2 parts

In [None]:
# X,Y
# Training set
# Test Set (Already knew the result)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [None]:
X_train.shape

(8000, 500)

In [None]:
X_test.shape

(2000, 500)

In [None]:
y_train.shape

(8000,)

In [None]:
y_test.shape

(2000,)

In [None]:
import urllib

from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

# Creating 3 different models

In [None]:
# Creating 3 different models.
clf1 = GaussianNB()
clf2 = MultinomialNB()
clf3 = BernoulliNB()

# Training the models

In [None]:
# Training the models
clf1.fit(X_train,y_train)
clf2.fit(X_train,y_train)
clf3.fit(X_train,y_train)

BernoulliNB()

# Prediction

In [None]:
# Predicting
y_pred1=clf1.predict(X_test)
y_pred2=clf2.predict(X_test)
y_pred3=clf3.predict(X_test)

In [None]:
y_test.shape

(2000,)

In [None]:
y_pred1.shape

(2000,)

# Calculating Accuracy

In [None]:
# Calculating Accuracy

from sklearn.metrics import accuracy_score

In [None]:
print("Gaussian",accuracy_score(y_test,y_pred1))
print("Multinomial",accuracy_score(y_test,y_pred2))
print("Bernoulli",accuracy_score(y_test,y_pred3))

Gaussian 0.7825
Multinomial 0.828
Bernoulli 0.8245


In [None]:
#import pickle
#pickle dump