# Problem Statement

Task: Give the candidate a dataset of political speeches or text excerpts along with corresponding labels (e.g., positive/negative sentiment). 

Ask them to:

1.Build a machine learning model (e.g., NLP-based) to classify the sentiment of text data.
2.Train and evaluate the model using appropriate metrics.
3.Discuss the choice of algorithms and feature engineering techniques

In [1]:
import pandas as pd
import nltk

nltk.download("popular")

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]   

True

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
import numpy as np
!pip install spacy 



In [4]:
import  spacy
from spacy.lang.en.stop_words import STOP_WORDS as stopwords

In [5]:
df = pd.read_csv('C:/Users/Admin/Desktop/ML Deployment/3. Natural Language Processing/Projects/IMDB Kaggle/IMDBSentimentalAnalysis-main/IMDB_Dataset.csv');
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [6]:
df.shape

(50000, 2)

In [7]:
df["sentiment"].unique()

array(['positive', 'negative'], dtype=object)

In [8]:
df["sentiment"].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [10]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

**1. Converting the "review" column values in Lower Case**

In [11]:
df.head(3)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive


In [12]:
df["review"] = df["review"].apply(lambda x: str(x).lower())
df.head(3)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive


**2. Contractions to Expansions**

In [13]:
contractions = { 
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "i had / i would",
"I'd've": "i would have",
"I'll": "i shall / i will",
"I'll've": "i shall have / i will have",
"I'm": "i am",
"I've": "i have",
"i'd": "i had / i would",
"i'd've": "i would have",
"i'll": "i shall / i will",
"i'll've": "i shall have / i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

In [14]:
#creating a function
def count_to_exp(x):
    if type(x) is str:
        for key in contractions:
            value = contractions[key]
            x=x.replace(key, value)
        return x
    else:
        return x

In [15]:
%%timeit
df["review"] = df["review"].apply(lambda x: count_to_exp(x))

13.1 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [16]:
df.head

<bound method NDFrame.head of                                                   review sentiment
0      one of the other reviewers has mentioned that ...  positive
1      a wonderful little production. <br /><br />the...  positive
2      i thought this was a wonderful way to spend ti...  positive
3      basically there has / there is a family where ...  negative
4      petter mattei's "love in the time of money" is...  positive
...                                                  ...       ...
49995  i thought this movie did a down right good job...  positive
49996  bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  i am a catholic taught in parochial elementary...  negative
49998  i am going to have to disagree with the previo...  negative
49999  no one expects the star trek movies to be high...  negative

[50000 rows x 2 columns]>

**3. Count and Remove Emails**

In [17]:
email_count = df[df["review"].str.contains("hotmail.com","gmail.com")]
print(email_count)

                                                  review sentiment
8176   i could never remember the name of this show. ...  positive
14570  i posted on imdb on this series recently, givi...  positive
16961  i was lucky enough to get a dvd copy of this m...  positive
26880  does anybody know why this movie is called the...  positive
38645  i have been looking for this mini-series for a...  positive
44408  i just wanted to say i liked this movie a lot,...  positive
45292  i have been looking for the name of this film ...  positive


In [18]:
len(email_count)

7

In [19]:
df["review"][8176]

'i could never remember the name of this show. i use to watch it when i was 8. i remember staying up late when i was not suppose to just so i could watch this show. it was the best show to me. from what i remember of it, it is still great. this showed starred lucas black making him the first boy i ever had a crush on. i am from the country, therefore boys with an accent have no appeal to me, but for him i would definitely make an exception. which after seeing crazy in alabama, friday night lights, and tokyo drift you should see why. he is a great actor and has been since he was a kid. i miss this show and wish it would come back out. if anyone ever sees where they are selling the season please email me. kywildflower16@hotmail.com'

In [20]:
df["review"][14570]

'i posted on imdb on this series recently, giving a snail mail address at the commercial arm of the bbd where one would write to appeal release. i wrote to that address, mentioning sam waterson and his popularity prominently. i just received the following reply: <br /><br />from: emilyfussell@hotmail.com subject: oppenheimer date: may 14, 2006 1:44:00 pm mdt to: kk2840@earthlink.net <br /><br />dear kate, <br /><br />i work for the bbfc, the british equivalent to the mpaa, and we classify dvds and videos as well as films in this country. anyway, i am currently in the process of giving a certificate to the 1980 miniseries \'oppenheimer.\' while researching the work on the imdb, i noticed your post and thought you might like to know that the work is about to be released (hence the need for a certificate). <br /><br />i do not know which company is distributing it, but keep your eyes peeled! <br /><br />kind regards, <br /><br />emily +++++++++++++++++ <br /><br />hooray! <br /><br />i al

In [21]:
import re

In [22]:
x = 'if anyone ever sees where they are selling the season please email me. kywildflower16@hotmail.com'

In [23]:
re.findall(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+._-]+\b)', x)

['kywildflower16@hotmail.com']

In [24]:
df["email"] = df["review"].apply(lambda x: re.findall(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+._-]+\b)', x))

In [25]:
df["email_count"] = df["email"].apply(lambda x: len(x))

In [26]:
df[df["email_count"]>0].head(20)

Unnamed: 0,review,sentiment,email,email_count
1281,i like many others saw this as a child and i l...,positive,[tcampo23@aol.com],1
3568,i have noticed that people have asked if anyon...,positive,[creator67@pipinternet.net],1
5068,brilliant adaptation of the largely interior m...,positive,[invinoveritas1@aol.com],1
8176,i could never remember the name of this show. ...,positive,[kywildflower16@hotmail.com],1
8474,i am like the rest of the fans who love this c...,positive,[stone_stew@yahoo.co.uk],1
9510,i cant believe how many excellent actors can b...,positive,[lkhubble2@talkamerica.net],1
9978,i really cannot say too much more about the pl...,positive,[tawnyteel@yahoo.com],1
12166,robert jordan is a television star. robert jor...,positive,[iamaseal2@yahoo.com],1
13471,this movie was everything but boring. it deals...,positive,[ottenbreit2@netzero.net],1
13741,i love this movie. and disney channel is ridic...,positive,[cristin6891@aim.com],1


In [27]:
df[df["email_count"]>0].tail(20)

Unnamed: 0,review,sentiment,email,email_count
26880,does anybody know why this movie is called the...,positive,[killer2511@hotmail.com],1
28373,every once in a while you stumble across a mov...,positive,[contact@fightrunner.co.uk],1
29355,robert jordan is a television star. robert jor...,positive,[iamaseal2@yahoo.com],1
29880,granny is the best movie ever ganny is the bes...,positive,[iloverot@aol.com],1
33151,i saw this movie once as a kid on the late-lat...,positive,[cartwrightbride@yahoo.com],1
34518,"i desperately need this on a tape, not a dvd, ...",positive,[l.swanberg@yahoo.com],1
35989,i know this sounds odd coming from someone bor...,positive,[darkangel_1627@yahoo.com],1
37153,i have never seen one of these scifi originals...,negative,[deusexmachina529@aol.com],1
37363,this movie was great and i would like to buy i...,positive,[movie.deniselacey2000@yahoo.com],1
38645,i have been looking for this mini-series for a...,positive,[yurets777@hotmail.com],1


In [28]:
df["review"] = df["review"].apply(lambda x: re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+._-]+\b)', "" ,x))

In [29]:
df.head()

Unnamed: 0,review,sentiment,email,email_count
0,one of the other reviewers has mentioned that ...,positive,[],0
1,a wonderful little production. <br /><br />the...,positive,[],0
2,i thought this was a wonderful way to spend ti...,positive,[],0
3,basically there has / there is a family where ...,negative,[],0
4,"petter mattei's ""love in the time of money"" is...",positive,[],0


In [30]:
df["review"][26880]

"does anybody know why this movie is called the couch trip? i was just watching it and am still not sure why this title was picked the movie was very funny and its probably my favorite dan aykroyd performance it even beats out his ghostbusters performance i had never heard of the movie before i seen it in a sears store i read the back and thought it sounded good so i bought and when i finally got a chance to watch it, i thought it was better than what i had originally expected. this movie rates as good as animal house and national lampoon's vacation in my mind i wish comedies that have come out lately were written as well as this one was nothing sad happens in it and the bad stuff that does happen are also funny parts if anyone else feels this way and would like to read a comedy script for a movie that does not have a sad situation in it email me at "

In [31]:
df.drop("email", axis=1, inplace=True)

In [32]:
df.drop("email_count", axis=1, inplace=True)

In [33]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there has / there is a family where ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


**4. Count URL's and Remove it**

In [34]:
x = "hi, thanks for watching https://youtube.com/Bloomberg"

In [35]:
re.findall(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)

[('https', 'youtube.com', '/Bloomberg')]

In [36]:
df["url_flags"] = df["review"].apply(lambda x: len(re.findall(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)))

In [37]:
df[df["url_flags"]>0]

Unnamed: 0,review,sentiment,url_flags
907,following directly from where the story left o...,positive,1
1088,this quasi j-horror film followed a young woma...,negative,1
1972,the basic plot of 'marigold' boasts of a roman...,negative,1
2132,"i, too, found ""oppenheimer"" to be a brilliant ...",positive,1
3038,"i really love this movie , i saw it for the fi...",positive,1
...,...,...,...
47334,do not watch this serbian documentary and serb...,negative,1
48079,i think that most people would agree with me i...,negative,1
48887,trite and unoriginal. it has / it is like some...,negative,1
49596,"this is absolutely the best 80s cartoon ever, ...",positive,1


In [38]:
df["review"][907]

"following directly from where the story left off in part one, the second half which sets about telling the inevitable downfall and much more grim side of the man's legacy is exactly as such. in direct contrast to the first feature, part two represents a shift from che the pride and glory of a revolutionised country, to che\x97struggling liberator of a country to which he has no previous ties. the change of setting is not just aesthetic; from the autumn and spring greys of the woodlands comes a change of tone and heart to the feature, replacing the optimism of the predecessor with a cynical, battered and bruised reality aligned to an all new struggle. yet, as che would go on to say himself\x97such a struggle is best told exactly as that\x97a struggle. while part one certainly helped document that initial surge to power that the revolutionary guerrilla acquired through just that, part two takes a much more refined, callous and bleak segment of che has / he is life and ambition, and give

In [39]:
print(x)

hi, thanks for watching https://youtube.com/Bloomberg


In [40]:
re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', x)

'hi, thanks for watching '

In [41]:
df["review"] = df["review"].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', x))

In [42]:
#df["review"] = df["review"].apply(lambda x: len(re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '',  x)))

In [43]:
df.sample(5)

Unnamed: 0,review,sentiment,url_flags
31950,i first saw martin's day when i was just 10 ye...,positive,0
37227,i can think of no movie that better captures t...,positive,0
446,"down at the movie gallery, i saw a flick i jus...",negative,0
40041,do not see this movie. bad acting and stupid g...,negative,0
26714,"i am a big mark for the music of neil young, a...",negative,0


In [44]:
df["review"][907]

"following directly from where the story left off in part one, the second half which sets about telling the inevitable downfall and much more grim side of the man's legacy is exactly as such. in direct contrast to the first feature, part two represents a shift from che the pride and glory of a revolutionised country, to che\x97struggling liberator of a country to which he has no previous ties. the change of setting is not just aesthetic; from the autumn and spring greys of the woodlands comes a change of tone and heart to the feature, replacing the optimism of the predecessor with a cynical, battered and bruised reality aligned to an all new struggle. yet, as che would go on to say himself\x97such a struggle is best told exactly as that\x97a struggle. while part one certainly helped document that initial surge to power that the revolutionary guerrilla acquired through just that, part two takes a much more refined, callous and bleak segment of che has / he is life and ambition, and give

**5. Special Chars Removal or Punctuation Removal**

In [45]:
x = "legacy is exactly as such. in direct contrast to the first feature, part two represents a shift from che the pride and glory of a revolutionised country, to che\x97struggling liberator of a country to which he has no previous ties."

In [46]:
re.sub(r'[^\w ]+', "",x)

'legacy is exactly as such in direct contrast to the first feature part two represents a shift from che the pride and glory of a revolutionised country to chestruggling liberator of a country to which he has no previous ties'

In [47]:
df["review"] = df["review"].apply(lambda x: re.sub(r'[^\w ]+', "",x))

In [48]:
df.sample(5)

Unnamed: 0,review,sentiment,url_flags
31355,i liked this show from the first episode i saw...,positive,0
24884,i have seen this movie and even though i kind ...,positive,0
22558,the sun was not shining it was too wet to play...,negative,0
2526,well this was my first imax experience so i wa...,negative,0
41948,this is one of those movies where the acting s...,negative,0


In [49]:
df.head()

Unnamed: 0,review,sentiment,url_flags
0,one of the other reviewers has mentioned that ...,positive,0
1,a wonderful little production br br the filmin...,positive,0
2,i thought this was a wonderful way to spend ti...,positive,0
3,basically there has there is a family where a...,negative,0
4,petter matteis love in the time of money is a ...,positive,0


In [50]:
df["review"][1]

'a wonderful little production br br the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece br br the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great masters of comedy and his life br br the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwells murals decorating every surface are terribly well done'

**6. Remove Multiple Spaces** 

In [51]:
x = "hi    hello how are  you"

In [52]:
' '.join(x.split())

'hi hello how are you'

In [53]:
df["review"] = df["review"].apply(lambda x: ' '.join(x.split()))

In [54]:
df.head()

Unnamed: 0,review,sentiment,url_flags
0,one of the other reviewers has mentioned that ...,positive,0
1,a wonderful little production br br the filmin...,positive,0
2,i thought this was a wonderful way to spend ti...,positive,0
3,basically there has there is a family where a ...,negative,0
4,petter matteis love in the time of money is a ...,positive,0


In [55]:
df.drop("url_flags", axis=1, inplace=True)
df.head(2)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production br br the filmin...,positive


**7. Remove HTML Tags**

In [56]:
!pip install beautifulsoup4



In [57]:
from bs4 import BeautifulSoup

In [58]:
x = '<html><h1> thanks for watching <h1><html>'

In [59]:
BeautifulSoup (x, "lxml").get_text().strip()

'thanks for watching'

In [60]:
df["review"] = df["review"].apply(lambda x: BeautifulSoup (x, "lxml").get_text().strip())

In [61]:
df.head(5)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production br br the filmin...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there has there is a family where a ...,negative
4,petter matteis love in the time of money is a ...,positive


**8. Remove Accented Charts**

In [62]:
x = "rappé, naïve, soufflé"

In [63]:
import unicodedata

In [64]:
def remove_accented_chars(x):
    x = unicodedata.normalize("NFKD", x).encode("ascii","ignore").decode("utf-8", "ignore")
    return x

In [65]:
remove_accented_chars(x)

'rappe, naive, souffle'

In [66]:
df["review"] = df["review"].apply(lambda x: remove_accented_chars(x))

In [67]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production br br the filmin...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there has there is a family where a ...,negative
4,petter matteis love in the time of money is a ...,positive


**9. Removing StopWords**

In [68]:
# Stop words Removal
import nltk
from nltk.corpus import stopwords

In [69]:
nltk_stopwords = set(stopwords.words('english'))
print(nltk_stopwords)

{'d', 'mightn', 'being', 'while', 'or', 'couldn', "wasn't", "doesn't", 'because', 'doesn', 'just', "it's", "you'll", 'mustn', 'shan', 'its', 'same', 'more', 'down', 'yourselves', 'shouldn', 'be', 'then', 'wasn', 'm', 'again', 'yours', "won't", "weren't", 'do', 've', 'so', 'from', 'some', 'an', "couldn't", 'did', 'will', 'all', 'than', 'have', 'into', 'theirs', 'that', 'out', 'haven', 're', 'those', "mightn't", 'can', 'them', "don't", 'other', 'she', "you've", 'any', 'o', 'the', 'on', 'does', 'own', 'isn', 'and', "haven't", "hadn't", 'there', 'himself', "shan't", 'these', 'weren', 'under', 'having', 'doing', 'was', 'such', 'what', 'as', 'our', 'wouldn', 'he', "needn't", 'i', 'their', 'at', 'my', 'after', "shouldn't", 'before', 'aren', 'both', 'no', 'in', 'needn', "you'd", 'is', 'your', 'only', 'll', 'ourselves', 't', 'itself', 'should', 'about', 'you', 'don', 'are', 'between', "that'll", "aren't", 'against', 'by', "she's", 'with', "isn't", 'we', 'over', 'him', 'not', 'has', 'y', 'hadn',

In [70]:
len(nltk_stopwords)

179

In [71]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
sklearn_stopwords = set(ENGLISH_STOP_WORDS)
print(sklearn_stopwords)

{'hereupon', 'beforehand', 'take', 'top', 'wherever', 'being', 'behind', 'onto', 'next', 'via', 'otherwise', 'becomes', 'couldnt', 'less', 'hasnt', 'elsewhere', 'together', 'rather', 'twenty', 'its', 'mine', 'within', 'sincere', 'something', 'others', 'six', 'yourselves', 'be', 'us', 'yours', 'last', 'sometimes', 'thru', 'thus', 'do', 'side', 'so', 'either', 'an', 'all', 'have', 'already', 'although', 'least', 'that', 'out', 're', 'can', 'mill', 'move', 'nowhere', 'everyone', 'other', 'four', 'any', 'thick', 'much', 'however', 'never', 'formerly', 'perhaps', 'former', 'always', 'eleven', 'beside', 'since', 'mostly', 'there', 'himself', 'whether', 'found', 'ltd', 'under', 'what', 'somewhere', 'per', 'our', 'i', 'interest', 'amoungst', 'at', 'anyhow', 'system', 'sixty', 'detail', 'after', 'enough', 'both', 'across', 'no', 'somehow', 'indeed', 'only', 'amongst', 'your', 'wherein', 'please', 'about', 'you', 'herein', 'against', 'by', 'with', 'several', 'whereas', 'anywhere', 'him', 'many',

In [72]:
len(sklearn_stopwords)

318

In [73]:
# Find the common stopwords from NLTK & sklearn
commom_stopwords = nltk_stopwords.intersection(sklearn_stopwords)
print(commom_stopwords)

{'being', 'while', 'or', 'because', 'its', 'same', 'more', 'down', 'yourselves', 'be', 'then', 'again', 'yours', 'do', 'so', 'from', 'some', 'an', 'will', 'all', 'than', 'have', 'into', 'that', 'out', 're', 'those', 'can', 'them', 'other', 'she', 'any', 'the', 'on', 'own', 'and', 'there', 'himself', 'these', 'under', 'such', 'was', 'what', 'as', 'our', 'he', 'i', 'their', 'at', 'my', 'after', 'before', 'both', 'no', 'in', 'is', 'your', 'only', 'ourselves', 'should', 'itself', 'about', 'you', 'are', 'between', 'against', 'by', 'with', 'we', 'over', 'him', 'not', 'has', 'who', 'through', 'yourself', 'this', 'been', 'whom', 'it', 'of', 'where', 'me', 'ours', 'most', 'once', 'during', 'hers', 'how', 'too', 'off', 'few', 'they', 'if', 'had', 'themselves', 'each', 'but', 'nor', 'her', 'very', 'up', 'were', 'until', 'further', 'when', 'here', 'myself', 'his', 'which', 'below', 'for', 'above', 'a', 'why', 'am', 'now', 'to', 'herself'}


In [74]:
len(commom_stopwords)

119

In [75]:
# Combining the stopwords from sklearn & NLTK
combined_stopwords = nltk_stopwords.union(sklearn_stopwords)
print(combined_stopwords)

{'hereupon', 'beforehand', 'take', 'd', 'mightn', 'top', 'wherever', 'being', 'behind', 'onto', 'couldn', 'next', 'via', 'otherwise', 'becomes', 'couldnt', "it's", 'less', "you'll", 'mustn', 'hasnt', 'shan', 'elsewhere', 'together', 'rather', 'twenty', 'its', 'mine', 'within', 'sincere', 'something', 'others', 'yourselves', 'six', 'be', 'wasn', 'us', 'yours', "weren't", 'last', 'sometimes', 'thru', 'do', 've', 'so', 'thus', 'side', 'either', 'an', 'all', 'have', 'already', 'although', 'least', 'that', 'out', 're', 'can', 'mill', 'move', 'nowhere', "don't", 'everyone', 'other', 'four', 'any', 'o', 'thick', 'much', 'however', 'never', 'formerly', 'perhaps', 'former', 'always', 'eleven', 'beside', 'since', "hadn't", 'there', 'himself', 'mostly', 'whether', "shan't", 'found', 'ltd', 'under', 'having', 'what', 'somewhere', 'per', 'our', 'wouldn', "needn't", 'i', 'interest', 'amoungst', 'at', 'anyhow', 'system', 'sixty', 'detail', 'after', 'enough', 'aren', 'both', 'across', 'no', 'needn', '

In [76]:
len(combined_stopwords)

378

In [77]:
#import spacy
#loading the english language small model of spacy
#en = spacy.load('en_core_web_sm')
#stopwords = en.Defaults.stop_words

#print(len(stopwords))
#print(stopwords)

In [78]:
#import  spacy
#from spacy.lang.en.stop_words import STOP_WORDS as stopwords

In [79]:
#import spacy
#from string import punctuation
#from spacy.lang.en import stop_words


#nlp = spacy.load('en_core_web_sm')

#stop_words = stop_words.STOP_WORDS
#print(stop_words)
# as an alternative solution
# stop_words = nlp.Defaults.stop_words

In [80]:
x = "I am playing cricket and i can run"

In [81]:
' '.join([t for t in x.split() if t not in combined_stopwords ])

'I playing cricket run'

In [82]:
df["review_without_stopwords"] = df["review"].apply(lambda x: ' '.join([t for t in x.split() if t not in combined_stopwords ]))

In [83]:
df.head()

Unnamed: 0,review,sentiment,review_without_stopwords
0,one of the other reviewers has mentioned that ...,positive,reviewers mentioned watching 1 oz episode shal...
1,a wonderful little production br br the filmin...,positive,wonderful little production br br filming tech...
2,i thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...
3,basically there has there is a family where a ...,negative,basically family little boy jake thinks zombie...
4,petter matteis love in the time of money is a ...,positive,petter matteis love time money visually stunni...


In [84]:
df["review_without_stopwords"][0]

'reviewers mentioned watching 1 oz episode shall hooked right exactly happened mebr br thing struck oz brutality unflinching scenes violence set right word trust faint hearted timid pulls punches regards drugs sex violence hardcore classic use wordbr br called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home manyaryans muslims gangstas latinos christians italians irish moreso scuffles death stares dodgy dealings shady agreements far awaybr br say main appeal fact goes shows dare forget pretty pictures painted mainstream audiences forget charm forget romanceoz mess episode saw struck nasty surreal say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards shall sold nickel inmates shall kill order away mannered middle class inmates turned prison bitches lack street skills prison experience watching oz 

**10. Text Normalization - Lemmitization**

In [85]:
# Text Normalization: Stemming or Lemmatization (prefer)
#from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
#lemmatizer = WordNetLemmatizer()


In [86]:
#stemmer = WordNetLemmatizer()
#def stemming(data):
    #text = [stemmer.stem(word) for word in data]
    #return ''.join(text)

In [87]:
#data = "I was doing homework"

In [88]:
#df["review_without_stopword_done_Lemmi"] = df['review_without_stopwords'].apply(lambda x: stemming(x))

In [89]:
# Initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()


def lemmatize_text(text):
    words = text.split()  # Split the text into words
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)


df['review_without_stopwords'] = df['review_without_stopwords'].apply(lemmatize_text)


print(df)

                                                  review sentiment  \
0      one of the other reviewers has mentioned that ...  positive   
1      a wonderful little production br br the filmin...  positive   
2      i thought this was a wonderful way to spend ti...  positive   
3      basically there has there is a family where a ...  negative   
4      petter matteis love in the time of money is a ...  positive   
...                                                  ...       ...   
49995  i thought this movie did a down right good job...  positive   
49996  bad plot bad dialogue bad acting idiotic direc...  negative   
49997  i am a catholic taught in parochial elementary...  negative   
49998  i am going to have to disagree with the previo...  negative   
49999  no one expects the star trek movies to be high...  negative   

                                review_without_stopwords  
0      reviewer mentioned watching 1 oz episode shall...  
1      wonderful little production br br 

In [90]:
df.head()

Unnamed: 0,review,sentiment,review_without_stopwords
0,one of the other reviewers has mentioned that ...,positive,reviewer mentioned watching 1 oz episode shall...
1,a wonderful little production br br the filmin...,positive,wonderful little production br br filming tech...
2,i thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...
3,basically there has there is a family where a ...,negative,basically family little boy jake think zombie ...
4,petter matteis love in the time of money is a ...,positive,petter matteis love time money visually stunni...


In [91]:
df["review_without_stopwords"][0]

'reviewer mentioned watching 1 oz episode shall hooked right exactly happened mebr br thing struck oz brutality unflinching scene violence set right word trust faint hearted timid pull punch regard drug sex violence hardcore classic use wordbr br called oz nickname given oswald maximum security state penitentary focus mainly emerald city experimental section prison cell glass front face inwards privacy high agenda em city home manyaryans muslim gangsta latino christian italian irish moreso scuffle death stare dodgy dealing shady agreement far awaybr br say main appeal fact go show dare forget pretty picture painted mainstream audience forget charm forget romanceoz mess episode saw struck nasty surreal say ready watched developed taste oz got accustomed high level graphic violence violence injustice crooked guard shall sold nickel inmate shall kill order away mannered middle class inmate turned prison bitch lack street skill prison experience watching oz comfortable uncomfortable viewin

In [92]:
df["review_without_stopwords"][1]

'wonderful little production br br filming technique unassuming oldtimebbc fashion give comforting discomforting sense realism entire piece br br actor extremely chosen michael sheen got polari voice pat truly seamless editing guided reference williams diary entry worth watching terrificly written performed piece masterful production great master comedy life br br realism really come home little thing fantasy guard use traditional dream technique remains solid disappears play knowledge sens particularly scene concerning orton halliwell set particularly flat halliwells mural decorating surface terribly'

In [93]:
df["review_without_stopwords"][2]

'thought wonderful way spend time hot summer weekend sitting air conditioned theater watching lighthearted comedy plot simplistic dialogue witty character likable bread suspected serial killer disappointed realize match point 2 risk addiction thought proof woody allen fully control style grown lovebr br laughed woodys comedy year dare say decade impressed scarlet johanson managed tone sexy image jumped right average spirited young womanbr br crown jewel career wittier devil wear prada interesting superman great comedy friend'

In [94]:
df["review_without_stopwords"][3]

'basically family little boy jake think zombie closet parent fighting timebr br movie slower soap opera suddenly jake decides rambo kill zombiebr br ok going make film decide thriller drama drama movie watchable parent divorcing arguing like real life jake closet totally ruin film expected boogeyman similar movie instead watched drama meaningless thriller spotsbr br 3 10 playing parent descent dialog shot jake ignore'

In [95]:
df["review_without_stopwords"][4]

'petter matteis love time money visually stunning film watch mr mattei offer vivid portrait human relation movie telling money power success people different situation encounter br br variation arthur schnitzlers play theme director transfer action present time new york different character meet connect connected way person know previous point contact stylishly film sophisticated luxurious look taken people live world live habitatbr br thing get soul picture different stage loneliness inhabits big city exactly best place human relation fulfillment discerns case people encounterbr br acting good mr matteis direction steve buscemi rosario dawson carol kane michael imperioli adrian grenier rest talented cast make character come alivebr br wish mr mattei good luck await anxiously work'

In [96]:
df["review_without_stopwords"][5]

'probably alltime favorite movie story selflessness sacrifice dedication noble cause preachy boring get old despite seen 15 time 25 year paul lukas performance brings tear eye bette davis truly sympathetic role delight kid grandma say like dressedup midget child make fun watch mother slow awakening happening world roof believable startling dozen thumb movie'

In [97]:
df["review_without_stopwords"][100]

'short film inspired soontobe length feature spatula madness hilarious piece contends similar cartoon yielding multiple writer short film star edward spatula fired job join fight evil spoon premise allows funny content near beginning barely present remainder feature film 15minute running time absorbed oddball comedy small musical number unfortunately lie plot set really time surely follows plot better highbudget hollywood film film worth watching time expect deep story'

**Majorly Numeric digits are remaining**

**11. Removing Numeric Digit from Text**

In [98]:
# Define a function to remove numeric digits from a text
def remove_digits(text):
    return re.sub(r'\d+', '', text)


df['review_without_stopwords'] = df['review_without_stopwords'].apply(remove_digits)


print(df)

                                                  review sentiment  \
0      one of the other reviewers has mentioned that ...  positive   
1      a wonderful little production br br the filmin...  positive   
2      i thought this was a wonderful way to spend ti...  positive   
3      basically there has there is a family where a ...  negative   
4      petter matteis love in the time of money is a ...  positive   
...                                                  ...       ...   
49995  i thought this movie did a down right good job...  positive   
49996  bad plot bad dialogue bad acting idiotic direc...  negative   
49997  i am a catholic taught in parochial elementary...  negative   
49998  i am going to have to disagree with the previo...  negative   
49999  no one expects the star trek movies to be high...  negative   

                                review_without_stopwords  
0      reviewer mentioned watching  oz episode shall ...  
1      wonderful little production br br 

In [99]:
df.head(3)

Unnamed: 0,review,sentiment,review_without_stopwords
0,one of the other reviewers has mentioned that ...,positive,reviewer mentioned watching oz episode shall ...
1,a wonderful little production br br the filmin...,positive,wonderful little production br br filming tech...
2,i thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...


In [100]:
df["review_without_stopwords"][5]

'probably alltime favorite movie story selflessness sacrifice dedication noble cause preachy boring get old despite seen  time  year paul lukas performance brings tear eye bette davis truly sympathetic role delight kid grandma say like dressedup midget child make fun watch mother slow awakening happening world roof believable startling dozen thumb movie'

In [101]:
df.sentiment.replace("positive", 1, inplace=True)
df.sentiment.replace("negative", 0, inplace=True)

In [102]:
df.head()

Unnamed: 0,review,sentiment,review_without_stopwords
0,one of the other reviewers has mentioned that ...,1,reviewer mentioned watching oz episode shall ...
1,a wonderful little production br br the filmin...,1,wonderful little production br br filming tech...
2,i thought this was a wonderful way to spend ti...,1,thought wonderful way spend time hot summer we...
3,basically there has there is a family where a ...,0,basically family little boy jake think zombie ...
4,petter matteis love in the time of money is a ...,1,petter matteis love time money visually stunni...


In [103]:
df.to_csv('C:/Users/Admin/Desktop/ML Deployment/3. Natural Language Processing/Projects/IMDB Kaggle/IMDBSentimentalAnalysis-main/IMDB_Clean.csv')

In [104]:
#X = df['review_without_stopwords']
#Y = df['sentiment']

In [105]:
import string
from nltk.corpus import stopwords

In [106]:
from sklearn.feature_extraction.text import CountVectorizer #DTM

In [107]:
bag_words = CountVectorizer().fit(df['review_without_stopwords'])

In [108]:
print(len(bag_words.vocabulary_))

164156


In [109]:
print(bag_words)

CountVectorizer()


In [110]:
message_bagwords = bag_words.transform(df['review_without_stopwords'])
print(message_bagwords)

  (0, 982)	1
  (0, 2699)	1
  (0, 2881)	1
  (0, 6882)	1
  (0, 9095)	1
  (0, 9689)	1
  (0, 9695)	1
  (0, 14640)	1
  (0, 17397)	3
  (0, 18775)	1
  (0, 20541)	1
  (0, 22823)	1
  (0, 23856)	1
  (0, 25125)	1
  (0, 25742)	2
  (0, 25976)	1
  (0, 25991)	1
  (0, 27842)	1
  (0, 32193)	1
  (0, 33867)	1
  (0, 33914)	1
  (0, 34542)	1
  (0, 34591)	1
  (0, 36836)	1
  (0, 39547)	1
  :	:
  (49999, 78046)	1
  (49999, 78524)	1
  (49999, 81233)	1
  (49999, 88861)	1
  (49999, 94492)	10
  (49999, 95327)	1
  (49999, 97443)	1
  (49999, 110194)	1
  (49999, 119479)	1
  (49999, 119497)	1
  (49999, 124973)	1
  (49999, 125072)	1
  (49999, 125451)	1
  (49999, 129306)	1
  (49999, 136001)	1
  (49999, 136935)	1
  (49999, 148873)	1
  (49999, 149494)	1
  (49999, 152228)	1
  (49999, 157420)	1
  (49999, 157466)	1
  (49999, 157687)	1
  (49999, 161456)	1
  (49999, 161484)	2
  (49999, 163024)	1


In [111]:
message_bagwords.shape

(50000, 164156)

In [112]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer().fit(message_bagwords)

In [113]:
message_tfidf = tfidf_transformer.transform(message_bagwords)

In [114]:
print(message_tfidf.shape)
print(message_tfidf)

(50000, 164156)
  (0, 161088)	0.10899258881897521
  (0, 161085)	0.04551081346500207
  (0, 157466)	0.06985569436237599
  (0, 157442)	0.04258394190737944
  (0, 155760)	0.21336689675796464
  (0, 155502)	0.1340438176396374
  (0, 153809)	0.04543741423328163
  (0, 152180)	0.1006421792120878
  (0, 151472)	0.07563810150652528
  (0, 149984)	0.05294742044469458
  (0, 149578)	0.06575901497006334
  (0, 147846)	0.05700955683265527
  (0, 146540)	0.0960111351371662
  (0, 145091)	0.029345817072313356
  (0, 142900)	0.05995977491169347
  (0, 141016)	0.07048864305394385
  (0, 138989)	0.15080833998125545
  (0, 138693)	0.05520869304505251
  (0, 137187)	0.05452636133867852
  (0, 136976)	0.07963505511443564
  (0, 134129)	0.0742559592449254
  (0, 132241)	0.06270600265583329
  (0, 130595)	0.04302341715280118
  (0, 129306)	0.12141256211123214
  (0, 129192)	0.09104131981988478
  :	:
  (49999, 88861)	0.20363179310432325
  (49999, 81233)	0.09706283949525045
  (49999, 78524)	0.11529309779705853
  (49999, 78046)	0.1

In [115]:
X = message_tfidf
y = df["sentiment"]

In [116]:
print(X)

  (0, 161088)	0.10899258881897521
  (0, 161085)	0.04551081346500207
  (0, 157466)	0.06985569436237599
  (0, 157442)	0.04258394190737944
  (0, 155760)	0.21336689675796464
  (0, 155502)	0.1340438176396374
  (0, 153809)	0.04543741423328163
  (0, 152180)	0.1006421792120878
  (0, 151472)	0.07563810150652528
  (0, 149984)	0.05294742044469458
  (0, 149578)	0.06575901497006334
  (0, 147846)	0.05700955683265527
  (0, 146540)	0.0960111351371662
  (0, 145091)	0.029345817072313356
  (0, 142900)	0.05995977491169347
  (0, 141016)	0.07048864305394385
  (0, 138989)	0.15080833998125545
  (0, 138693)	0.05520869304505251
  (0, 137187)	0.05452636133867852
  (0, 136976)	0.07963505511443564
  (0, 134129)	0.0742559592449254
  (0, 132241)	0.06270600265583329
  (0, 130595)	0.04302341715280118
  (0, 129306)	0.12141256211123214
  (0, 129192)	0.09104131981988478
  :	:
  (49999, 88861)	0.20363179310432325
  (49999, 81233)	0.09706283949525045
  (49999, 78524)	0.11529309779705853
  (49999, 78046)	0.18708101996276194

In [117]:
print(y)

0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64


In [118]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(35000, 164156)
(35000,)
(15000, 164156)
(15000,)


In [119]:
from sklearn.naive_bayes import MultinomialNB
scam_detect = MultinomialNB().fit(x_train, y_train)

In [120]:
predicted = scam_detect.predict(x_test)

In [121]:
predicted

array([1, 1, 0, ..., 0, 1, 1], dtype=int64)

In [122]:
print("Train Score: ", scam_detect.score(x_train, y_train))  # train score
print("Test Score: ", scam_detect.score(x_test, y_test)) #test score

Train Score:  0.9183142857142857
Test Score:  0.8631333333333333


In [123]:
from sklearn import metrics
print(metrics.classification_report(y_test, predicted))
print(metrics.confusion_matrix(y_test, predicted))

              precision    recall  f1-score   support

           0       0.85      0.88      0.86      7411
           1       0.88      0.85      0.86      7589

    accuracy                           0.86     15000
   macro avg       0.86      0.86      0.86     15000
weighted avg       0.86      0.86      0.86     15000

[[6506  905]
 [1148 6441]]


In [124]:
#from sklearn import svm
spam_detect_svm = svm.SVC().fit(x_train, y_train)

In [126]:
predicted = spam_detect_svm.predict(x_test)

In [127]:
predicted

array([0, 1, 0, ..., 0, 1, 1], dtype=int64)

In [131]:
print("Train Score: ", scam_detect.score(x_train, y_train))  # train score
print("Test Score: ", scam_detect.score(x_test, y_test)) #test score

Train Score:  0.9183142857142857
Test Score:  0.8631333333333333


In [130]:
print("Train Score: ", spam_detect_svm.score(x_train, y_train))  # train score
print("Test Score: ", spam_detect_svm.score(x_test, y_test)) #test score

Train Score:  0.9913714285714286
Test Score:  0.8952666666666667
