![sentiment_analysis.jpg](attachment:sentiment_analysis.jpg)

## importing Necessary Libraries

In [94]:
import numpy as np
import pandas as pd
import re 
import string
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
import gensim
from nltk import sent_tokenize
from gensim.utils import simple_preprocess
from tqdm import tqdm
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## IMDB 50K REVIEW 

This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.
For more dataset information, please go through the following link : 
http://ai.stanford.edu/~amaas/data/sentiment/

Dataset are available on kaggle :
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

### Load dataset

In [2]:
df=pd.read_csv('IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
df.shape

(50000, 2)

In [4]:
df['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [5]:
review=df['review'][0]
review

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

## Text-Preprocessing
- Remove HTML Tags
- Remove Punctuation
- tokenization
- stemming

**Remove HTML Tag**

In [6]:
import re
regex = re.compile(r'<[^>]+>')
def remove_html(string):
    return regex.sub('',string)

In [7]:
remove_html(review)

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

In [8]:
df['review']=df['review'].apply(remove_html)

In [9]:
df['review'][2]

'I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.This was the most I\'d laughed at one of Woody\'s comedies in years (dare I say a decade?). While I\'ve never been impressed with Scarlet Johanson, in this she managed to tone down her "sexy" image and jumped right into a average, but spirited young woman.This may not be the crown jewel of his career, but it was wittier than "Devil Wears Prada" and more interesting than "Superman" a great comedy to go see with friends.'

**Remove Punctuation**

In [10]:
import string
exclude=string.punctuation

In [11]:
def remove_punct(text):
    result = text.translate(str.maketrans('','',string.punctuation))
    return result

In [12]:
remove_punct('Hii !!!???@#$%^')

'Hii '

In [13]:
remove_punct(review)

'One of the other reviewers has mentioned that after watching just 1 Oz episode youll be hooked They are right as this is exactly what happened with mebr br The first thing that struck me about Oz was its brutality and unflinching scenes of violence which set in right from the word GO Trust me this is not a show for the faint hearted or timid This show pulls no punches with regards to drugs sex or violence Its is hardcore in the classic use of the wordbr br It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary It focuses mainly on Emerald City an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda Em City is home to manyAryans Muslims gangstas Latinos Christians Italians Irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awaybr br I would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare Fo

In [14]:
df['review']=df['review'].apply(remove_punct)

In [15]:
df['review'][2]

'I thought this was a wonderful way to spend time on a too hot summer weekend sitting in the air conditioned theater and watching a lighthearted comedy The plot is simplistic but the dialogue is witty and the characters are likable even the well bread suspected serial killer While some may be disappointed when they realize this is not Match Point 2 Risk Addiction I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to loveThis was the most Id laughed at one of Woodys comedies in years dare I say a decade While Ive never been impressed with Scarlet Johanson in this she managed to tone down her sexy image and jumped right into a average but spirited young womanThis may not be the crown jewel of his career but it was wittier than Devil Wears Prada and more interesting than Superman a great comedy to go see with friends'

**Remove Stopwords**

In [16]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [17]:
def remove_stopwords(text):
    new_text=[]
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x=new_text[:]
    new_text.clear()
    return " ".join(x)   

In [18]:
remove_stopwords('this is a man name your notn')

'   man name  notn'

In [19]:
df['review']=df['review'].apply(remove_stopwords)

In [20]:
df['review'][0]

'One    reviewers  mentioned   watching  1 Oz episode youll  hooked They  right    exactly  happened  meThe first thing  struck   Oz   brutality  unflinching scenes  violence  set  right   word GO Trust      show   faint hearted  timid This show pulls  punches  regards  drugs sex  violence Its  hardcore   classic use   wordIt  called OZ     nickname given   Oswald Maximum Security State Penitentary It focuses mainly  Emerald City  experimental section   prison    cells  glass fronts  face inwards  privacy   high   agenda Em City  home  manyAryans Muslims gangstas Latinos Christians Italians Irish  moreso scuffles death stares dodgy dealings  shady agreements  never far awayI would say  main appeal   show  due   fact   goes   shows wouldnt dare Forget pretty pictures painted  mainstream audiences forget charm forget romanceOZ doesnt mess around The first episode I ever saw struck    nasty   surreal I couldnt say I  ready     I watched  I developed  taste  Oz  got accustomed   high level

In [27]:
df.drop_duplicates(inplace=True)

In [28]:
df['review'] = df['review'].apply(lambda x:x.lower())

In [29]:
df['review']

0        one    reviewers  mentioned   watching  1 oz e...
1        a wonderful little production the filming tech...
2        i thought    wonderful way  spend time    hot ...
3        basically theres  family   little boy jake thi...
4        petter matteis love   time  money   visually s...
                               ...                        
49995    i thought  movie    right good job it wasnt  c...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    i   catholic taught  parochial elementary scho...
49998    im going    disagree   previous comment  side ...
49999    no one expects  star trek movies   high art   ...
Name: review, Length: 49579, dtype: object

In [35]:
df.to_csv('preprocess_feature.csv')

In [40]:
X = df.iloc[:,0:1]
y = df['sentiment']

In [41]:
encoder = LabelEncoder()
y=encoder.fit_transform(y)

In [42]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=21)

In [43]:
X_train.shape

(39663, 1)

**Applying BAg Of WOrds**

In [46]:
cv = CountVectorizer(max_features=5000)
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

**Applying a ML model**

In [47]:
gnb = GaussianNB()
gnb.fit(X_train_bow,y_train)
y_pred = gnb.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.7572609923356192

In [49]:
rf = RandomForestClassifier()
rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8458047599838645

**Applying TFIDF Vectorization**

In [52]:
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review'])

rf=RandomForestClassifier()
rf.fit(X_train_tfidf,y_train)
y_pred = rf.predict(X_test_tfidf)
accuracy_score(y_test,y_pred)

0.8514521984671238

## USing Word2Vec

### Deep Learning Approach
![Screenshot%202023-07-20%20165934.png](attachment:Screenshot%202023-07-20%20165934.png)

In [54]:
story = []
for doc in df['review']:
    raw_sent = sent_tokenize(doc)
    for sent in raw_sent:
        story.append(simple_preprocess(sent))

In [55]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

In [56]:
model.build_vocab(story)

In [57]:
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

(29099048, 30997660)

In [58]:
len(model.wv.index_to_key)

79969

In [59]:
def document_vector(doc):
    # remove out-of-vocabulary words
    doc = [word for word in doc.split() if word in model.wv.index_to_key]
    return np.mean(model.wv[doc], axis=0)

In [60]:
document_vector(df['review'].values[0])

array([-0.03575416, -0.16156894,  0.18116808,  0.81365085,  0.02675943,
       -0.54907143,  0.52953035,  0.5136398 ,  0.4996401 ,  0.22770913,
        0.18540347, -0.1581813 , -0.3888462 , -0.27320293, -0.24686196,
       -0.00931274,  0.10685576, -0.2141525 ,  0.19816083, -0.18753412,
       -0.13316365,  0.57358515,  0.4970962 ,  0.13766351,  0.23808773,
        0.04001063, -0.37604728,  0.44990337, -0.36153832,  0.14473778,
       -0.04580299, -0.2894155 ,  0.27293074, -0.07333256, -0.19250634,
        0.3214669 ,  0.13879211, -0.5132933 , -0.00366583, -0.38034138,
        0.14036074, -0.28475273, -0.1203545 ,  0.44702274,  0.11330063,
        0.02314755,  0.28783333, -0.4194556 , -0.09730901,  0.21719599,
        0.14374481,  0.03143694,  0.28238255, -0.17620102, -0.19865902,
       -0.22769754, -0.04409184,  0.28341624, -0.40156433,  0.60502434,
        0.3174995 , -0.01358412,  0.12666664,  0.09917603, -0.24127735,
       -0.02371673, -0.37038463,  0.74050134,  0.0709252 ,  0.31

In [62]:
X = []
for doc in tqdm(df['review'].values):
    X.append(document_vector(doc))

100%|████████████████████████████████████████████████████████████████████████████| 49579/49579 [59:57<00:00, 13.78it/s]


In [63]:
X = np.array(X)

In [64]:
X[0]

array([-0.03575416, -0.16156894,  0.18116808,  0.81365085,  0.02675943,
       -0.54907143,  0.52953035,  0.5136398 ,  0.4996401 ,  0.22770913,
        0.18540347, -0.1581813 , -0.3888462 , -0.27320293, -0.24686196,
       -0.00931274,  0.10685576, -0.2141525 ,  0.19816083, -0.18753412,
       -0.13316365,  0.57358515,  0.4970962 ,  0.13766351,  0.23808773,
        0.04001063, -0.37604728,  0.44990337, -0.36153832,  0.14473778,
       -0.04580299, -0.2894155 ,  0.27293074, -0.07333256, -0.19250634,
        0.3214669 ,  0.13879211, -0.5132933 , -0.00366583, -0.38034138,
        0.14036074, -0.28475273, -0.1203545 ,  0.44702274,  0.11330063,
        0.02314755,  0.28783333, -0.4194556 , -0.09730901,  0.21719599,
        0.14374481,  0.03143694,  0.28238255, -0.17620102, -0.19865902,
       -0.22769754, -0.04409184,  0.28341624, -0.40156433,  0.60502434,
        0.3174995 , -0.01358412,  0.12666664,  0.09917603, -0.24127735,
       -0.02371673, -0.37038463,  0.74050134,  0.0709252 ,  0.31

In [27]:
encoder = LabelEncoder()
y = encoder.fit_transform(df['sentiment']) 

In [66]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [68]:
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)

0.8365268253327954

## Improve the accuracy

In [5]:
df = pd.read_csv('preprocess_feature.csv',usecols=['review','sentiment'])
df.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz e...,positive
1,a wonderful little production the filming tech...,positive
2,i thought wonderful way spend time hot ...,positive
3,basically theres family little boy jake thi...,negative
4,petter matteis love time money visually s...,positive


In [6]:
df.isna().sum()

review       0
sentiment    0
dtype: int64

In [7]:
df.duplicated().sum()

1

In [31]:
df['sentiment'].value_counts()

positive    24883
negative    24695
Name: sentiment, dtype: int64

In [8]:
df.drop_duplicates(inplace=True)

In [9]:
df.duplicated().sum()

0

In [10]:
df['review'][0]

'one    reviewers  mentioned   watching  1 oz episode youll  hooked they  right    exactly  happened  methe first thing  struck   oz   brutality  unflinching scenes  violence  set  right   word go trust      show   faint hearted  timid this show pulls  punches  regards  drugs sex  violence its  hardcore   classic use   wordit  called oz     nickname given   oswald maximum security state penitentary it focuses mainly  emerald city  experimental section   prison    cells  glass fronts  face inwards  privacy   high   agenda em city  home  manyaryans muslims gangstas latinos christians italians irish  moreso scuffles death stares dodgy dealings  shady agreements  never far awayi would say  main appeal   show  due   fact   goes   shows wouldnt dare forget pretty pictures painted  mainstream audiences forget charm forget romanceoz doesnt mess around the first episode i ever saw struck    nasty   surreal i couldnt say i  ready     i watched  i developed  taste  oz  got accustomed   high level

In [11]:
X=df.iloc[:,0:1]
y=df.iloc[:,-1]
y

0        positive
1        positive
2        positive
3        negative
4        positive
           ...   
49574    positive
49575    negative
49576    negative
49577    negative
49578    negative
Name: sentiment, Length: 49578, dtype: object

In [13]:
le = LabelEncoder()
y=le.fit_transform(y)    

In [15]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=21)

In [16]:
X_train

Unnamed: 0,review
36852,oh dear lord how earth part film ever app...
31431,this movie potential far but fails ...
13551,most films i really like arthouse fare sel...
21761,this amazing movie the characters seemed r...
5912,tintin i first struck masterpiece docume...
...,...
16432,after negative reviews heard movie dou...
8964,raw force like ultrasleazy perverted versio...
5944,are serious i mean wow wow i think i saw fl...
5327,this total swill if take the devils rejects ...


In [None]:
cv = CountVectorizer(max_features=3000)
X_train_bow=cv.fit_transform(X_train['review']).toarray()
X_test_bow=cv.transform(X_test['review']).toarray()

In [19]:
tfidf = TfidfVectorizer(max_features=3000)
X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review'])

In [15]:
clf2=MultinomialNB()
clf3=BernoulliNB()

## Naive Bayes - Using Bagofwordss 
### Multinomial

In [16]:
clf2.fit(X_train_bow,y_train)
y_pred = clf2.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8441912061315047

### Bernouli 

In [17]:
clf3.fit(X_train_bow,y_train)
y_pred = clf3.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8534691407825736

## Logistic Regression - Using BagofWords

In [34]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression() 
lr.fit(X_train_bow,y_train)  

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [35]:
y_pred = lr.predict(X_test_bow)
accuracy_score(y_test,y_pred)    

0.8735377168212989

## Logistic Regression -- Using Tfidf   Find Best Accuarcy Of this Model 88%

In [67]:
tfidf = TfidfVectorizer(max_features=3000)
X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review']).toarray() 

In [55]:
lr = LogisticRegression()
lr.fit(X_train_tfidf,y_train)
y_pred = lr.predict(X_test_tfidf)
accuracy_score(y_test,y_pred)    

0.881403791851553

## save model

In [56]:
pickle.dump(lr,open('model.pkl','wb')) 
pickle.dump(tfidf,open("vectorized.pkl","wb")) 

### Load Model

In [25]:
loaded_model = pickle.load(open('model.pkl','rb'))

array([1])

### Make Prediction

In [51]:
review = 'i am happy to watch this movie'
text = [review]
preprocess_text= tfidf.transform(text).toarray()
result = loaded_model.predict(preprocess_text) 
if result==0:
    print("Negative Review")
elif result==1:
    print("Positive Review")

Positive Review


 ![image27_frqkzv.png](attachment:image27_frqkzv.png)

In [93]:
%%writefile app.py
import streamlit as st 
import string
from nltk.corpus import stopwords
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression 
import pickle 
import pyttsx3

def speak(text):
    engine = pyttsx3.init()
    engine.say(text)
    engine.runAndWait()


ps = PorterStemmer()

model = pickle.load(open('model.pkl','rb')) 
vector = pickle.load(open('vectorized.pkl','rb'))  

def transform_text(text):
    text = text.lower()
    text = nltk.word_tokenize(text)
    
    y = []
    for i in text:
        if i.isalnum():
            y.append(i)
    
    text = y[:]
    y.clear()
    
    for i in text:
        if i not in stopwords.words('english') and i not in string.punctuation:
            y.append(i)
            
    text = y[:]
    y.clear()
    
    for i in text:
        y.append(ps.stem(i))
    
            
    return " ".join(y)   

    
st.title('IMDB Reviews Sentiment Analycis')

review = st.text_area("Please Enter your Reviews") 

if st.button('Sentiment'):
    transform_review = transform_text(review)
    preprocess_review = vector.transform([transform_review]).toarray()
    output= model.predict(preprocess_review)
    if output == 0:
        st.image('Screenshot 2023-07-20 160555.png')
        st.error("Negative Review")
        speak("Negative Review")
    elif output ==1:
        st.image('Screenshot 2023-07-20 160545.png')
        st.success("Positive Review")
        speak("Positive Review")








In [2]:
!streamlit run app.py

^C
