# **IMDb Sentiment Analysis**

**This project involves building a sentiment analysis prediction model using Logistic Regression. We used feature extraction to analyze the attributes of the IMDb reviews. This approach helps us to classify the sentiments in Positive and Negative**

## **Overview of the Dataset**
The dataset consists of 50000 entries with the following 2 features:

*   review: IMDb reviews given by people
*   sentiment: Either Positive or Negative sentiment associated to the corresponding review

In [None]:
!pip install kaggle



In [None]:
#Importing packages

import os
import json
from zipfile import ZipFile
import pandas as pd
from sklearn.model_selection import train_test_split



### **Dataset through Kaggle**

In [None]:
# Loading the kaggle json file
kaggle_dictionary = json.load(open("kaggle.json"))

In [None]:
kaggle_dictionary.keys()

# setup kaggle credentials as environment variables
os.environ["KAGGLE_USERNAME"] = kaggle_dictionary["username"]
os.environ["KAGGLE_KEY"] = kaggle_dictionary["key"]

In [None]:
# Downloading the dataset Zipfile from kaggle
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
imdb-dataset-of-50k-movie-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


In [None]:
# unzip the dataset file
with ZipFile("imdb-dataset-of-50k-movie-reviews.zip", "r") as zip_ref:
  zip_ref.extractall()

In [None]:
!ls

'IMDB Dataset.csv'   imdb-dataset-of-50k-movie-reviews.zip   kaggle.json   sample_data


### **Loading and exploring the dataset**

In [None]:
# Loading the dataset
dataset = pd.read_csv("/content/IMDB Dataset.csv")

In [None]:
dataset.shape

(50000, 2)

In [None]:
dataset.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
dataset.tail()

Unnamed: 0,review,sentiment
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative
49999,No one expects the Star Trek movies to be high...,negative


In [None]:
dataset['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


In [None]:
# Replacing entries of 'sentiment' column with 0 and 1
dataset.replace({"sentiment": {"positive": 1, "negative": 0}}, inplace=True)
dataset

  dataset.replace({"sentiment": {"positive": 1, "negative": 0}}, inplace=True)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
...,...,...
49995,I thought this movie did a down right good job...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0
49997,I am a Catholic taught in parochial elementary...,0
49998,I'm going to have to disagree with the previou...,0


## **Model: NLTK and Logistic Regression**

### **Data Preprocessing**

In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#separating the data labels
X1 = dataset.drop('sentiment', axis = 1)
Y1 = dataset['sentiment']
print(X1)
print(Y1)


                                                  review
0      One of the other reviewers has mentioned that ...
1      A wonderful little production. <br /><br />The...
2      I thought this was a wonderful way to spend ti...
3      Basically there's a family where a little boy ...
4      Petter Mattei's "Love in the Time of Money" is...
...                                                  ...
49995  I thought this movie did a down right good job...
49996  Bad plot, bad dialogue, bad acting, idiotic di...
49997  I am a Catholic taught in parochial elementary...
49998  I'm going to have to disagree with the previou...
49999  No one expects the Star Trek movies to be high...

[50000 rows x 1 columns]
0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64


**Stemming: the process of reducing the words to root word**

In [None]:
# Defining stemming function

port_stem = PorterStemmer()

def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z]',' ', content)
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split()
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content

In [None]:
X1['review'] = X1['review'].apply(stemming)

In [None]:
# Printing the stemmed reviews
print(X1)

                                                  review
0      one review mention watch oz episod hook right ...
1      wonder littl product br br film techniqu unass...
2      thought wonder way spend time hot summer weeke...
3      basic famili littl boy jake think zombi closet...
4      petter mattei love time money visual stun film...
...                                                  ...
49995  thought movi right good job creativ origin fir...
49996  bad plot bad dialogu bad act idiot direct anno...
49997  cathol taught parochi elementari school nun ta...
49998  go disagre previou comment side maltin one sec...
49999  one expect star trek movi high art fan expect ...

[50000 rows x 1 columns]


In [None]:
#converting the text data to numerical data
vectorizer = TfidfVectorizer()
X1 = X1['review'].values
vectorizer.fit(X1)

X1 = vectorizer.transform(X1)
print(X1)
X1.shape

  (0, 355)	0.0945307749787177
  (0, 857)	0.08153810162738666
  (0, 929)	0.09227023919378995
  (0, 2586)	0.05708517879925333
  (0, 2999)	0.03818353181364933
  (0, 3127)	0.10320262052497035
  (0, 3553)	0.04255483554478773
  (0, 3823)	0.0815683718506981
  (0, 5101)	0.040267821094538886
  (0, 6138)	0.08227384900031838
  (0, 7342)	0.11265197865094709
  (0, 8041)	0.06226438020486721
  (0, 8829)	0.04056803469965867
  (0, 9858)	0.07244869783773818
  (0, 10245)	0.05447490632368209
  (0, 10870)	0.06652078724021002
  (0, 11172)	0.10461202651137415
  (0, 11285)	0.05409942904974466
  (0, 11287)	0.04416638453902259
  (0, 11970)	0.06893618258963648
  (0, 13512)	0.07818730678308874
  (0, 14313)	0.06617103273803054
  (0, 14331)	0.07938361934941361
  (0, 14568)	0.049752079220140265
  (0, 14594)	0.04540088135554167
  :	:
  (49999, 37747)	0.204654562298768
  (49999, 40304)	0.38957568083470595
  (49999, 40441)	0.17923256931985487
  (49999, 41430)	0.08714274206897112
  (49999, 42006)	0.16669593680403066
  (

(50000, 68997)

### **Training and Testing the dataset**

In [None]:
# Splitting the dataset to training and test data
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1,Y1, test_size = 0.2, stratify = Y, random_state =2)

In [None]:
# Loading and training the model
model = LogisticRegression()
model.fit(X1_train, Y1_train)

**Accuracy evaluation**

In [None]:
#accuracy score on the training data
X1_train_prediction = model.predict(X1_train)
print(X1_train_prediction)
training_data_accuracy = accuracy_score(X1_train_prediction, Y1_train)
print('Accuracy score of the training data :', training_data_accuracy)

[0 1 1 ... 1 0 0]
Accuracy score of the training data : 0.925675


In [None]:
#accuracy score on test data
X1_test_prediction = model.predict(X1_test)
test_data_accuracy = accuracy_score(X1_test_prediction, Y1_test)
print('Accuracy score of the test data :', test_data_accuracy)


Accuracy score of the test data : 0.8886


**Defining a predictive system**

In [None]:
X_new1 = X1_test[3]
prediction1 = model.predict(X_new1)
print(prediction1)

if (prediction1[0]==0):
  print('The review is Negative')

else:
  print('The review is Positive')

[0]
The review is Negative
