In [3]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 


In [4]:
df=pd.read_csv("IMDB Dataset.csv")


In [5]:
#here we have remove all the irrelevant data 
#convert the text into lower case 
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [6]:
df.shape
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [7]:
df.duplicated().sum()

418

In [8]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [9]:
#lets encode the output column into numerical form 
df['sentiment']=df['sentiment'].map({'positive':1,'negative':0}) #only run for 1 time 

In [10]:
df['sentiment']

0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64

In [11]:
#lets perform the text preprocessing on the given dataset 
import re 
pattern=r'<.*?>' 
pattern=re.compile(pattern)
def remove_tags(text):
    text=pattern.sub('',text)
    return text #it return the edited text 
df['review']=df['review'].apply(remove_tags)

In [12]:
df['review']

0        One of the other reviewers has mentioned that ...
1        A wonderful little production. The filming tec...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot, bad dialogue, bad acting, idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I'm going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

In [13]:
#lets convert the data into the lower case 
df['review']=df['review'].apply(lambda x:x.lower())

In [14]:
df['review'][0]
#lets remove the urls from the data 
pattern=r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
pattern=re.compile(pattern)
def remove_url(text):
    text=pattern.sub('',text)
    return text 
df['review']=df['review'].apply(remove_url)

In [15]:
df['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. the filming tec...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

In [16]:
#lets remove the punctuation from the dataset 
import string 
punc=string.punctuation 
print(punc)
def remove_punc(text):
    for i in punc:
        if i in text:
            text=text.replace(i,'')
    return text 
df['review']=df['review'].apply(remove_punc) #we have removed all the punctuation from teh dataset 



!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [17]:
df['review'][0]


'one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictur

In [71]:
#lets also remove the slangs from the data 
file=open('slang.txt','r')
# print(file)
slangs=dict()
# data=file.read()
# print(data)
r=r'[:=-]'
r=re.compile(r)

for line in file:
    line=line.strip()
    key,value=re.split(r,line)
    slangs[key.strip()]=value.strip()
    print(key,value)
    #error is due to some issue in the file 
    #but we can use those only slangs whose meaning we knows 
print(slangs)

AFAIK As Far As I Know
AFK Away From Keyboard
ASAP As Soon As Possible
ATK At The Keyboard
ATM At The Moment
A3 Anytime, Anywhere, Anyplace
BAK Back At Keyboard
BBL Be Back Later
BBS Be Back Soon
BFN Bye For Now
B4N Bye For Now
BRB Be Right Back
BRT Be Right There
BTW By The Way
B4 Before
B4N Bye For Now
CU See You
CUL8R See You Later
CYA See You
FAQ Frequently Asked Questions
FC Fingers Crossed
FWIW For What It's Worth
FYI For Your Information
GAL Get A Life
GG Good Game
GN Good Night
GMTA Great Minds Think Alike
GR8 Great!
G9 Genius
IC I See
ICQ I Seek you (also a chat program)
ILU I Love You
IMHO In My Honest/Humble Opinion
IMO In My Opinion
IOW In Other Words
IRL In Real Life
KISS Keep It Simple, Stupid
LDR Long Distance Relationship
LMAO Laugh My A.. Off
LOL Laughing Out Loud
LTNS Long Time No See
L8R Later
MTE My Thoughts Exactly
M8 Mate
NRN No Reply Necessary
OIC Oh I See
PITA Pain In The A..
PRT Party
PRW Parents Are Watching
ROFL Rolling On The Floor Laughing
ROFLOL Rolling On

In [19]:
def replace_slangs(text, slangs_dict):
    for slang, meaning in slangs_dict.items():
        text = re.sub(r'\b' + re.escape(slang) + r'\b', meaning, text)
    return text
#lets remvoe the slangs from the above dataset ['review']
df['review']=df['review'].apply(lambda x:replace_slangs(x,slangs))

In [20]:
df['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production the filming tech...
2        i thought this was a wonderful way to spend ti...
3        basically theres a family where a little boy j...
4        petter matteis love in the time of money is a ...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    i am a catholic taught in parochial elementary...
49998    im going to have to disagree with the previous...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

In [21]:
df.duplicated().sum()
#lets remove the duplicated items from the dataset 
df.drop_duplicates(inplace=True)

In [22]:
df.duplicated().sum() #removed the duplicted items 

0

In [23]:
#lets also do spelling correction in the data 
from textblob import TextBlob 
text='thi ia beatiful animl ' 
tb=TextBlob(text) 

print(tb.correct().string)
def spe_corr(text):
    tb=TextBlob(text)
    text=tb.correct().string 
    return text

the in beautiful animal 


In [24]:
#lets run the spelling correction on the whole dataset 
# df['review']=df['review'].apply(spe_corr)
#not performing yet because it taking alot of time here so we are currently avoiding this step 


In [25]:
#lets remove the stopwords from the dataset
import nltk
from nltk.corpus import stopwords

# Download the NLTK stopwords corpus
nltk.download('stopwords')

def remove_stopwords(text):
    # Tokenize the text
    words = text.split()

    # Get the list of English stopwords
    stop_words = set(stopwords.words('english'))

    # Remove stopwords
    filtered_words = [word for word in words if word.lower() not in stop_words]

    # Return the filtered text
    return " ".join(filtered_words)

# Example usage
# text = "This is a simple sentence with stop words."
# filtered_text = remove_stopwords(text)
# print("Filtered Text:", filtered_text)
#lets apply this on the dataset 
df['review']=df['review'].apply(remove_stopwords)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aashi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [26]:
df['review']

0        one reviewers mentioned watching 1 oz episode ...
1        wonderful little production filming technique ...
2        thought wonderful way spend time hot summer we...
3        basically theres family little boy jake thinks...
4        petter matteis love time money visually stunni...
                               ...                        
49995    thought movie right good job wasnt creative or...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    catholic taught parochial elementary schools n...
49998    im going disagree previous comment side maltin...
49999    one expects star trek movies high art fans exp...
Name: review, Length: 49580, dtype: object

In [27]:
#lets apply the tfidf on the dataset 
from sklearn.feature_extraction.text import TfidfVectorizer
tfv=TfidfVectorizer(max_features=10000)

In [28]:
x=tfv.fit_transform(df['review']).toarray()

In [29]:
y=df['sentiment']

In [30]:
count=0

for i in x[0]:
    if i!=0:
        count+=1
    print(i)
# print(count)counts the non zero values so it is working fine 


0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0


In [31]:
#lets divide the data into the train and test 
from sklearn.model_selection import train_test_split 
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=34)

In [32]:
#lets train our model 
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
lr.fit(x_train,y_train)

In [33]:


#for the prediction we have to first trans form the input then feed it to the model 

In [34]:
x1=x[0]
x2=x[1]
x4=x[3]
x5=x[4]

In [35]:
#lets perform some prediciton 
lr.predict([x1])
lr.predict([x2])
lr.predict([x4])
lr.predict([x5])

array([1], dtype=int64)

In [36]:
lr.score(x_test,y_test)

0.8938079870915692

In [37]:
#lets try on the random text 
s='my name is aashish and i am very angry and i am not staisfied by your work' 

temp=tfv.transform([s]).toarray()

In [38]:
print(lr.predict(temp))
#lets try with some positive reviews 
#we have to do this every time lets make a function for this work 
def predict_(text):
    temp=tfv.transform([text]).toarray()
    return lr.predict(temp)

[0]


In [39]:
predict_('''I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her ""sexy"" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than ""Devil Wears Prada"" and more interesting than ""Superman"" a great comedy to go see with friends.''')
print(predict_('''"Okay, last night, August 18th, 2004, I had the distinct displeasure of meeting Mr. Van Bebble at a showing of the film The Manson Family at the Three Penny in Chicago as part of the Chicago Underground Film Festival. Here's what I have to say about it. First of all, the film is an obvious rip off of every Kenneth Anger, Roman Polanski, Oliver Stone and Terry Gilliam movie I've ever seen. Second of all, in a short Q & A session after the show Mr. Van Bebble immediately stated that he never made any contact with the actual Manson Family members or Charlie himself, calling them liars and saying he wanted nothing to do with them, that the film was based on his (Van Bebble's) take on the trial having seen it all from his living room on TV and in the news (and I'm assuming from the Autobiography and the book Helter Skelter which were directly mimicked through the narrative). So I had second dibs on questions, I asked if he was trying to present the outsider, Mtv, sex drugs and rock 'n roll version and not necessarily the true story. This question obviously pissed off the by now sloshed director who started shouting ""f*** you, shut the f*** up, this is the truth! All those other movies are bullsh**!""<br /><br />Well anyway, I didn't even think about how ridiculous this was until the next day when I read the tagline for the film, ""You've heard the laws side of the story...now hear the story as it is told by the Manson Family."" Excuse me, if this guy has never even spoken to the family and considers them to be liars that he doesn't want to have anything to do with, how in God's name can he tell the story for them!? This is the most ridiculous statement I have ever heard! The film was obviously catered to the sex drugs and rock 'n roll audience that it had no trouble in attracting to the small, dimly lit theatre, and was even more obviously spawned by the sex drugs and rock 'n roll mind of a man who couldn't even watch his own film without getting up every ten minutes to go get more beer or to shout some sort of Rocky Horroresque call line to the actors on screen. This film accomplishes little more than warping the public's image of actual events (which helped shape the state of America and much of the world today) into some sort of Slasher/Comic Book/Porno/Rape fantasy dreamed up by an obviously shallow individual.<br /><br />The film was definitely very impressive to look at. The soundtrack was refreshing as it contained actual samples of Charlie's work with the Family off of his Lie album. The editing was nice and choppy to simulate the nauseating uncertainty of most modern music videos. All in all this film would have made a much better addition to the catalogues at Mtv than to the Underground Film Festival or for that matter the minds of any intellectual observers. I felt like I was at a midnight Rocky Horror viewing the way the audience was dressed and behaving (probably the best part of the experience). The cast was very good with the exception of Charlie who resembled some sort of stoned Dungeons and Dragons enthusiast more than the actual role he was portraying. The descriptions the film gave of him as full of energy, throwing ten things at you and being very physical about it all the while did not match at all the slow, lethargic, and chubby representation that was actually presented.<br /><br />All in all the film basically explains itself as Sadie (or maybe it was Linda) declares at the end, ""You can write a bunch of bullsh** books or make a bunch of bullsh** movies...etc. etc."" Case in point. Even the disclaimer ""Based on a True Story"" is a dead giveaway, signalling that somewhere beneath this psychedelic garbage heap lay the foundation of an actual story with content that will make and has made a difference in the world. All you have to do is a little bit of alchemy to separate the truth from the the crap, or actually, maybe you could just avoid it all together and go read a book instead.<br /><br />All I can say is this, when the film ended I got a free beer so I'm glad I went, but not so glad I spent fifteen dollars on my ticket to be told to shut the f*** up for asking the director a question. Peace."'''))

[0]


In [40]:
#we have to make a function so that we can directly preprocessed the input data for the prediction 


In [41]:
#LETS TRAIN OUR NEW MODEL USING THE RANDOM FOREST ALGORIGHTM 
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42,n_jobs=-1,max_depth=None,max_features='log2')
rf_model.fit(x_train, y_train)
y_pred = rf_model.predict(x_test)


In [42]:
print(rf_model.score(x_test,y_test))

0.8492335619201291


In [43]:
def rf_predict_(text):
    temp=tfv.transform([text]).toarray()
    return rf_model.predict(temp)
rf_model.predict([x1])
rf_model.predict([x2])
rf_model.predict([x4])
# rf_model.predict([x5])

array([0], dtype=int64)

In [44]:
rf_predict_('''I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her ""sexy"" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than ""Devil Wears Prada"" and more interesting than ""Superman"" a great comedy to go see with friends.''')
print(rf_predict_('''"Okay, last night, August 18th, 2004, I had the distinct displeasure of meeting Mr. Van Bebble at a showing of the film The Manson Family at the Three Penny in Chicago as part of the Chicago Underground Film Festival. Here's what I have to say about it. First of all, the film is an obvious rip off of every Kenneth Anger, Roman Polanski, Oliver Stone and Terry Gilliam movie I've ever seen. Second of all, in a short Q & A session after the show Mr. Van Bebble immediately stated that he never made any contact with the actual Manson Family members or Charlie himself, calling them liars and saying he wanted nothing to do with them, that the film was based on his (Van Bebble's) take on the trial having seen it all from his living room on TV and in the news (and I'm assuming from the Autobiography and the book Helter Skelter which were directly mimicked through the narrative). So I had second dibs on questions, I asked if he was trying to present the outsider, Mtv, sex drugs and rock 'n roll version and not necessarily the true story. This question obviously pissed off the by now sloshed director who started shouting ""f*** you, shut the f*** up, this is the truth! All those other movies are bullsh**!""<br /><br />Well anyway, I didn't even think about how ridiculous this was until the next day when I read the tagline for the film, ""You've heard the laws side of the story...now hear the story as it is told by the Manson Family."" Excuse me, if this guy has never even spoken to the family and considers them to be liars that he doesn't want to have anything to do with, how in God's name can he tell the story for them!? This is the most ridiculous statement I have ever heard! The film was obviously catered to the sex drugs and rock 'n roll audience that it had no trouble in attracting to the small, dimly lit theatre, and was even more obviously spawned by the sex drugs and rock 'n roll mind of a man who couldn't even watch his own film without getting up every ten minutes to go get more beer or to shout some sort of Rocky Horroresque call line to the actors on screen. This film accomplishes little more than warping the public's image of actual events (which helped shape the state of America and much of the world today) into some sort of Slasher/Comic Book/Porno/Rape fantasy dreamed up by an obviously shallow individual.<br /><br />The film was definitely very impressive to look at. The soundtrack was refreshing as it contained actual samples of Charlie's work with the Family off of his Lie album. The editing was nice and choppy to simulate the nauseating uncertainty of most modern music videos. All in all this film would have made a much better addition to the catalogues at Mtv than to the Underground Film Festival or for that matter the minds of any intellectual observers. I felt like I was at a midnight Rocky Horror viewing the way the audience was dressed and behaving (probably the best part of the experience). The cast was very good with the exception of Charlie who resembled some sort of stoned Dungeons and Dragons enthusiast more than the actual role he was portraying. The descriptions the film gave of him as full of energy, throwing ten things at you and being very physical about it all the while did not match at all the slow, lethargic, and chubby representation that was actually presented.<br /><br />All in all the film basically explains itself as Sadie (or maybe it was Linda) declares at the end, ""You can write a bunch of bullsh** books or make a bunch of bullsh** movies...etc. etc."" Case in point. Even the disclaimer ""Based on a True Story"" is a dead giveaway, signalling that somewhere beneath this psychedelic garbage heap lay the foundation of an actual story with content that will make and has made a difference in the world. All you have to do is a little bit of alchemy to separate the truth from the the crap, or actually, maybe you could just avoid it all together and go read a book instead.<br /><br />All I can say is this, when the film ended I got a free beer so I'm glad I went, but not so glad I spent fifteen dollars on my ticket to be told to shut the f*** up for asking the director a question. Peace."'''))

[1]


In [45]:
#lets apply the naive bayes algorithm on the above data and see the  accuracy of the model
from sklearn.naive_bayes import MultinomialNB 
nb=MultinomialNB()
nb.fit(x_train,y_train)

In [46]:
def nb_predict_(text):
    temp=tfv.transform([text]).toarray()
    return nb.predict(temp)
nb.predict([x1])
nb.predict([x2]) #this is positive 
nb.predict([x4]) #this is negative
nb_predict_('''I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her ""sexy"" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than ""Devil Wears Prada"" and more interesting than ""Superman"" a great comedy to go see with friends.''')
print(nb_predict_('''"Okay, last night, August 18th, 2004, I had the distinct displeasure of meeting Mr. Van Bebble at a showing of the film The Manson Family at the Three Penny in Chicago as part of the Chicago Underground Film Festival. Here's what I have to say about it. First of all, the film is an obvious rip off of every Kenneth Anger, Roman Polanski, Oliver Stone and Terry Gilliam movie I've ever seen. Second of all, in a short Q & A session after the show Mr. Van Bebble immediately stated that he never made any contact with the actual Manson Family members or Charlie himself, calling them liars and saying he wanted nothing to do with them, that the film was based on his (Van Bebble's) take on the trial having seen it all from his living room on TV and in the news (and I'm assuming from the Autobiography and the book Helter Skelter which were directly mimicked through the narrative). So I had second dibs on questions, I asked if he was trying to present the outsider, Mtv, sex drugs and rock 'n roll version and not necessarily the true story. This question obviously pissed off the by now sloshed director who started shouting ""f*** you, shut the f*** up, this is the truth! All those other movies are bullsh**!""<br /><br />Well anyway, I didn't even think about how ridiculous this was until the next day when I read the tagline for the film, ""You've heard the laws side of the story...now hear the story as it is told by the Manson Family."" Excuse me, if this guy has never even spoken to the family and considers them to be liars that he doesn't want to have anything to do with, how in God's name can he tell the story for them!? This is the most ridiculous statement I have ever heard! The film was obviously catered to the sex drugs and rock 'n roll audience that it had no trouble in attracting to the small, dimly lit theatre, and was even more obviously spawned by the sex drugs and rock 'n roll mind of a man who couldn't even watch his own film without getting up every ten minutes to go get more beer or to shout some sort of Rocky Horroresque call line to the actors on screen. This film accomplishes little more than warping the public's image of actual events (which helped shape the state of America and much of the world today) into some sort of Slasher/Comic Book/Porno/Rape fantasy dreamed up by an obviously shallow individual.<br /><br />The film was definitely very impressive to look at. The soundtrack was refreshing as it contained actual samples of Charlie's work with the Family off of his Lie album. The editing was nice and choppy to simulate the nauseating uncertainty of most modern music videos. All in all this film would have made a much better addition to the catalogues at Mtv than to the Underground Film Festival or for that matter the minds of any intellectual observers. I felt like I was at a midnight Rocky Horror viewing the way the audience was dressed and behaving (probably the best part of the experience). The cast was very good with the exception of Charlie who resembled some sort of stoned Dungeons and Dragons enthusiast more than the actual role he was portraying. The descriptions the film gave of him as full of energy, throwing ten things at you and being very physical about it all the while did not match at all the slow, lethargic, and chubby representation that was actually presented.<br /><br />All in all the film basically explains itself as Sadie (or maybe it was Linda) declares at the end, ""You can write a bunch of bullsh** books or make a bunch of bullsh** movies...etc. etc."" Case in point. Even the disclaimer ""Based on a True Story"" is a dead giveaway, signalling that somewhere beneath this psychedelic garbage heap lay the foundation of an actual story with content that will make and has made a difference in the world. All you have to do is a little bit of alchemy to separate the truth from the the crap, or actually, maybe you could just avoid it all together and go read a book instead.<br /><br />All I can say is this, when the film ended I got a free beer so I'm glad I went, but not so glad I spent fifteen dollars on my ticket to be told to shut the f*** up for asking the director a question. Peace."'''))

[1]


In [47]:
print(nb.score(x_test,y_test))

0.8615369100443727


In [48]:
#lets apply xg boost algorithm 


In [49]:
from xgboost import  XGBClassifier 
xgb=XGBClassifier()
xgb.fit(x_train,y_train)

In [50]:
xgb.score(x_test,y_test)

0.8561920129084308

In [51]:
def xgb_predict_(text):
    temp=tfv.transform([text]).toarray()
    return xgb.predict(temp)
print(xgb_predict_('''"Okay, last night, August 18th, 2004, I had the distinct displeasure of meeting Mr. Van Bebble at a showing of the film The Manson Family at the Three Penny in Chicago as part of the Chicago Underground Film Festival. Here's what I have to say about it. First of all, the film is an obvious rip off of every Kenneth Anger, Roman Polanski, Oliver Stone and Terry Gilliam movie I've ever seen. Second of all, in a short Q & A session after the show Mr. Van Bebble immediately stated that he never made any contact with the actual Manson Family members or Charlie himself, calling them liars and saying he wanted nothing to do with them, that the film was based on his (Van Bebble's) take on the trial having seen it all from his living room on TV and in the news (and I'm assuming from the Autobiography and the book Helter Skelter which were directly mimicked through the narrative). So I had second dibs on questions, I asked if he was trying to present the outsider, Mtv, sex drugs and rock 'n roll version and not necessarily the true story. This question obviously pissed off the by now sloshed director who started shouting ""f*** you, shut the f*** up, this is the truth! All those other movies are bullsh**!""<br /><br />Well anyway, I didn't even think about how ridiculous this was until the next day when I read the tagline for the film, ""You've heard the laws side of the story...now hear the story as it is told by the Manson Family."" Excuse me, if this guy has never even spoken to the family and considers them to be liars that he doesn't want to have anything to do with, how in God's name can he tell the story for them!? This is the most ridiculous statement I have ever heard! The film was obviously catered to the sex drugs and rock 'n roll audience that it had no trouble in attracting to the small, dimly lit theatre, and was even more obviously spawned by the sex drugs and rock 'n roll mind of a man who couldn't even watch his own film without getting up every ten minutes to go get more beer or to shout some sort of Rocky Horroresque call line to the actors on screen. This film accomplishes little more than warping the public's image of actual events (which helped shape the state of America and much of the world today) into some sort of Slasher/Comic Book/Porno/Rape fantasy dreamed up by an obviously shallow individual.<br /><br />The film was definitely very impressive to look at. The soundtrack was refreshing as it contained actual samples of Charlie's work with the Family off of his Lie album. The editing was nice and choppy to simulate the nauseating uncertainty of most modern music videos. All in all this film would have made a much better addition to the catalogues at Mtv than to the Underground Film Festival or for that matter the minds of any intellectual observers. I felt like I was at a midnight Rocky Horror viewing the way the audience was dressed and behaving (probably the best part of the experience). The cast was very good with the exception of Charlie who resembled some sort of stoned Dungeons and Dragons enthusiast more than the actual role he was portraying. The descriptions the film gave of him as full of energy, throwing ten things at you and being very physical about it all the while did not match at all the slow, lethargic, and chubby representation that was actually presented.<br /><br />All in all the film basically explains itself as Sadie (or maybe it was Linda) declares at the end, ""You can write a bunch of bullsh** books or make a bunch of bullsh** movies...etc. etc."" Case in point. Even the disclaimer ""Based on a True Story"" is a dead giveaway, signalling that somewhere beneath this psychedelic garbage heap lay the foundation of an actual story with content that will make and has made a difference in the world. All you have to do is a little bit of alchemy to separate the truth from the the crap, or actually, maybe you could just avoid it all together and go read a book instead.<br /><br />All I can say is this, when the film ended I got a free beer so I'm glad I went, but not so glad I spent fifteen dollars on my ticket to be told to shut the f*** up for asking the director a question. Peace."'''))

[1]


In [52]:
# from sklearn.svm import SVC
# model = SVC(kernel='linear', C=1.0, probability=True)
# model.fit(x_train, y_train)
# y_pred = model.predict(x_test)


In [53]:
# model.score(x_test,y_test)

In [54]:
#so we are getting the best accuracy with the log reg let's do cross validation and hyperparamter tuning
# on the log reg for best result

In [None]:
# from sklearn.model_selection import GridSearchCV, cross_val_score
# from sklearn.linear_model import LogisticRegression
# from sklearn.metrics import accuracy_score, classification_report

# # Initialize Logistic Regression model
# logreg = LogisticRegression(random_state=42)

# # Define hyperparameter grid for tuning
# param_grid = {
#     'C': [0.001, 0.01, 0.1, 1, 10, 100],    # Regularization strength
#     'penalty': ['l1', 'l2'],               # Regularization type
#     'solver': ['liblinear'],               # Solver compatible with l1 and l2
#     'class_weight': [None, 'balanced']     # Handle class imbalance
# }

# # Perform GridSearch with Cross-Validation
# grid_search = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
# grid_search.fit(x_train, y_train)

# # Print best hyperparameters
# print("Best Hyperparameters:", grid_search.best_params_)

# # Get the best estimator from GridSearch
# best_logreg = grid_search.best_estimator_

# # Evaluate model on training set using cross-validation
# cv_scores = cross_val_score(best_logreg, x_train, y_train, cv=5, scoring='accuracy')
# print(f"Cross-Validation Accuracy: {cv_scores.mean():.2f} ± {cv_scores.std():.2f}")

# # Test the model on the test set
# y_pred = best_logreg.predict(y_test)
# test_accuracy = accuracy_score(y_test, y_pred)

# print(f"Test Accuracy: {test_accuracy:.2f}")
# print("\nClassification Report:\n", classification_report(y_test, y_pred))


In [59]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Define the model
logreg = LogisticRegression(random_state=42)

# Define a simplified parameter grid
param_grid = {
    'C': [0.1, 1, 10],
    'penalty': ['l2'],
    'solver': ['liblinear']  # Use liblinear for efficiency with small datasets
}

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=2, verbose=2)

# Perform grid search on training data
grid_search.fit(x_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Perform cross-validation on the best model
cv_scores = cross_val_score(best_model, x_train, y_train, cv=5, scoring='accuracy')

# Evaluate on the test set
y_pred = best_model.predict(x_test)
test_accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters from Grid Search:", grid_search.best_params_)
print("Cross-Validation Scores:", cv_scores)
print("Mean CV Accuracy:", cv_scores.mean())
print("Test Set Accuracy:", test_accuracy)


Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best Parameters from Grid Search: {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}
Cross-Validation Scores: [0.89222236 0.89020547 0.88327241 0.89033153 0.89220877]
Mean CV Accuracy: 0.8896481081249107
Test Set Accuracy: 0.8932029043969343


In [61]:
print(cv_scores)

[0.89222236 0.89020547 0.88327241 0.89033153 0.89220877]


In [70]:
file=open('slang.txt','r') 
slangs=dict()

# print(data)
r=r'[:=-]'
r=re.compile(r)

for line in file:
    line=line.strip()
    key,value=re.split(r,line)
    slangs[key.strip()]=value.strip()
    print(key,value)
file.close()
# def replace_slangs(text, slangs_dict):\
#for replaciing the slangs from the text 
#lets dump the whole dictionary and when it is required we can use it directly 
import joblib 
joblib.dump(slangs,'slang_dictionary.pkl')
slangs=joblib.load('slang_dictionary.pkl')
print(slangs)

AFAIK As Far As I Know
AFK Away From Keyboard
ASAP As Soon As Possible
ATK At The Keyboard
ATM At The Moment
A3 Anytime, Anywhere, Anyplace
BAK Back At Keyboard
BBL Be Back Later
BBS Be Back Soon
BFN Bye For Now
B4N Bye For Now
BRB Be Right Back
BRT Be Right There
BTW By The Way
B4 Before
B4N Bye For Now
CU See You
CUL8R See You Later
CYA See You
FAQ Frequently Asked Questions
FC Fingers Crossed
FWIW For What It's Worth
FYI For Your Information
GAL Get A Life
GG Good Game
GN Good Night
GMTA Great Minds Think Alike
GR8 Great!
G9 Genius
IC I See
ICQ I Seek you (also a chat program)
ILU I Love You
IMHO In My Honest/Humble Opinion
IMO In My Opinion
IOW In Other Words
IRL In Real Life
KISS Keep It Simple, Stupid
LDR Long Distance Relationship
LMAO Laugh My A.. Off
LOL Laughing Out Loud
LTNS Long Time No See
L8R Later
MTE My Thoughts Exactly
M8 Mate
NRN No Reply Necessary
OIC Oh I See
PITA Pain In The A..
PRT Party
PRW Parents Are Watching
ROFL Rolling On The Floor Laughing
ROFLOL Rolling On

In [89]:
#lets also store the logred model and tfidf vectorizer 
joblib.dump(tfv,'tfidf_vectorizer.pkl')
joblib.dump(lr,'log_reg_model.pkl')

['log_reg_model.pkl']

In [91]:
    #lets also make a function for performing the preprocessing as we have done on the training dataset 
import string 
import re
import joblib 
from textblob import TextBlob
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
def preprocessing(text):  
        pattern=r'<.*?>' 
        pattern=re.compile(pattern)
        # def remove_tags(text):
        #for removing the html tags from the given text as input for the prediction 

        text=pattern.sub('',text)
            # return text
        # df['review']=df['review'].apply(remove_tags)
        pattern=r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
        pattern=re.compile(pattern)
        # def remove_url(text):
        #for removing the urls from the given text 

        text=pattern.sub('',text)
            # return text 
        
        punc=string.punctuation 
        # print(punc)
        # def remove_punc(text):
        #for removing the punctuation from the given text 

        for i in punc:
                if i in text:
                    text=text.replace(i,'')
            # return text 
        #lets load the slang_dictionary for removing the slangs 
        slangs=joblib.load('slang_dictionary.pkl')
        for slang, meaning in slangs.items():
                text = re.sub(r'\b' + re.escape(slang) + r'\b', meaning, text)
            # return text
        # print(tb.correct().string)
        # def spe_corr(text): spelling correction
        tb=TextBlob(text)
        text=tb.correct().string 
            # return text
        

        # Download the NLTK stopwords corpus
        

        # def remove_stopwords(text):
        #for removing the stop words

            # Tokenize the text
        words = text.split()

            # Get the list of English stopwords
        stop_words = set(stopwords.words('english'))

            # Remove stopwords
        filtered_words = [word for word in words if word.lower() not in stop_words]

            # Return the filtered text
        text=" ".join(filtered_words)
        #lets also transform the whole dataset into the tfidf vecotrization
        tfv=joblib.load('tfidf_vectorizer.pkl')
        text=tfv.transform([text]).toarray()
        #these all the preprocessing steps that we have to perform on the given input data 
        return text 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aashi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [92]:
#lets also make the function that takes the data and return the prediction 
def predict_(text):
    lr=joblib.load('log_reg_model.pkl')
    pred=lr.predict(text)
    #lets do the prediction by loading the models that we have saved 
    
    if pred==1:
        return 'POSITIVE SENTIMENTS'
    else:
        return 'NEGATIVE SENTIMENTS'

In [93]:
#lets do the prediction 
text='''Okay, last night, August 18th, 2004, I had the distinct displeasure of meeting Mr. Van Bebble at a showing of the film The Manson Family at the Three Penny in Chicago as part of the Chicago Underground Film Festival. Here's what I have to say about it. First of all, the film is an obvious rip off of every Kenneth Anger, Roman Polanski, Oliver Stone and Terry Gilliam movie I've ever seen. Second of all, in a short Q & A session after the show Mr. Van Bebble immediately stated that he never made any contact with the actual Manson Family members or Charlie himself, calling them liars and saying he wanted nothing to do with them, that the film was based on his (Van Bebble's) take on the trial having seen it all from his living room on TV and in the news (and I'm assuming from the Autobiography and the book Helter Skelter which were directly mimicked through the narrative). So I had second dibs on questions, I asked if he was trying to present the outsider, Mtv, sex drugs and rock 'n roll version and not necessarily the true story. This question obviously pissed off the by now sloshed director who started shouting ""f*** you, shut the f*** up, this is the truth! All those other movies are bullsh**!""<br /><br />Well anyway, I didn't even think about how ridiculous this was until the next day when I read the tagline for the film, ""You've heard the laws side of the story...now hear the story as it is told by the Manson Family."" Excuse me, if this guy has never even spoken to the family and considers them to be liars that he doesn't want to have anything to do with, how in God's name can he tell the story for them!? This is the most ridiculous statement I have ever heard! The film was obviously catered to the sex drugs and rock 'n roll audience that it had no trouble in attracting to the small, dimly lit theatre, and was even more obviously spawned by the sex drugs and rock 'n roll mind of a man who couldn't even watch his own film without getting up every ten minutes to go get more beer or to shout some sort of Rocky Horroresque call line to the actors on screen. This film accomplishes little more than warping the public's image of actual events (which helped shape the state of America and much of the world today) into some sort of Slasher/Comic Book/Porno/Rape fantasy dreamed up by an obviously shallow individual.<br /><br />The film was definitely very impressive to look at. The soundtrack was refreshing as it contained actual samples of Charlie's work with the Family off of his Lie album. The editing was nice and choppy to simulate the nauseating uncertainty of most modern music videos. All in all this film would have made a much better addition to the catalogues at Mtv than to the Underground Film Festival or for that matter the minds of any intellectual observers. I felt like I was at a midnight Rocky Horror viewing the way the audience was dressed and behaving (probably the best part of the experience). The cast was very good with the exception of Charlie who resembled some sort of stoned Dungeons and Dragons enthusiast more than the actual role he was portraying. The descriptions the film gave of him as full of energy, throwing ten things at you and being very physical about it all the while did not match at all the slow, lethargic, and chubby representation that was actually presented.<br /><br />All in all the film basically explains itself as Sadie (or maybe it was Linda) declares at the end, ""You can write a bunch of bullsh** books or make a bunch of bullsh** movies...etc. etc."" Case in point. Even the disclaimer ""Based on a True Story"" is a dead giveaway, signalling that somewhere beneath this psychedelic garbage heap lay the foundation of an actual story with content that will make and has made a difference in the world. All you have to do is a little bit of alchemy to separate the truth from the the crap, or actually, maybe you could just avoid it all together and go read a book instead.<br /><br />All I can say is this, when the film ended I got a free beer so I'm glad I went, but not so glad I spent fifteen dollars on my ticket to be told to shut the f*** up for asking the director a question. Peace.'''
predict_(preprocessing(text))

'NEGATIVE SENTIMENTS'