## IMDB Review Sentiment  uisng LSTM

In the rapidly evolving landscape of the entertainment industry, platforms like IMDb play a crucial role in shaping audience preferences. Understanding the **sentiment behind user-generated reviews** is paramount for filmmakers, studios, and streaming platforms seeking to gauge audience reactions. This report explores the implementation of **Long Short-Term Memory (LSTM) techniques for sentiment analysis** on IMDb reviews, shedding light on the process, challenges, and outcomes of this endeavor.

IMDb, as a **go-to platform for movie information**, hosts a vast repository of user reviews. Analyzing the sentiment within these reviews provides valuable insights into audience reactions, guiding decision-making processes for content creators and distributors.

In [1]:
# Import Libraries 

import numpy as np
import pandas as pd

**Data Collection:**<br>
A diverse dataset of IMDb reviews was collected to ensure representation across sentiments. This dataset serves as the foundation for training and testing the LSTM model.

In [2]:
df = pd.read_csv('IMDB Dataset.csv')

In [3]:
df.shape

(50000, 2)

In [4]:
df['review'][10]

'Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines.<br /><br />At first it was very odd and pretty funny but as the movie progressed I didn\'t find the jokes or oddness funny anymore.<br /><br />Its a low budget film (thats never a problem in itself), there were some pretty interesting characters, but eventually I just lost interest.<br /><br />I imagine this film would appeal to a stoner who is currently partaking.<br /><br />For something similar but better try "Brother from another planet"'

In [5]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [6]:
# Check unique values in column sentiment

df['sentiment'].unique()

array(['positive', 'negative'], dtype=object)

In [7]:
# How many of each reviews 

df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [8]:
# Lets use Natural Language Processing for string Preprocessing 

In [9]:
import nltk

In [10]:
nltk.download('stopwords')

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

**Data Preprocessing:**<br>
Cleaning the data involved removing irrelevant information, handling missing values, and transforming textual data into a format suitable for analysis. Tokenization and vectorization processes were employed to convert words into numerical representations.

In [11]:
# 1. Removing stopwords 

In [12]:
from nltk.corpus import stopwords

In [13]:
english_stopwords  = set(stopwords.words('english'))

In [14]:
# Decide I/P and O/P

In [15]:
X = df['review']          # Input
Y = df['sentiment']       # Output

In [16]:
X[0]     

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [17]:
# Pre - Processing of Input Data 

In [18]:
# Step 1 : Remove HTML tags 

In [19]:
X = X.replace({'<.*?/>' : ''}, regex = True)

In [20]:
X[110]

'Apparently, the people that wrote the back of the box did not bother to watch this so-called "movie." They described "blindingly choreographed intrigue and violence." I saw no "intrigue." I instead saw a miserable attempt at dialogue in a supposed kung fu movie. I saw no "violence." At least, I saw nothing which could cause me to suspend my disbelief as to what could possibly hurt a man with "impervious" skin--but here I am perhaps revealing too much of the "plot." Furthermore, as a viewer of many and sundry films (some of which include the occasional kung fu movie), I can authoritatively say that this piece of celluloid is unwatchable. Whatever you may choose to do, I will always remain Correct, Jonathan Tanner  P.S. I was not blinded by the choreography.'

In [21]:
# Step 2 : remove all non Alphabates

X.replace({ '[^A-Za-z]', ' ' }, regex = True)

0        One of the other reviewers has mentioned that ...
1        A wonderful little production. The filming tec...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot, bad dialogue, bad acting, idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I'm going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

In [22]:
X[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

-- REGEX Regular Expressions Patterns 

Matacharacters : 

- ^ : starts with    
- [] : Set of characters 
- (.) : any character
- ($) : ends with
- (\) : special sequence
- (*) : zero or more occurances 
- (+) : one or more occurances
- (?) : zero or one occurances
- {} : exactly mentioned occurances
- | : or 
    
Sets :

1. [abc]: it returns match where a,b,c is present.
2. [A-Z]or [a-z] : it returns match where A-Z, a-z is present.
3. [^abc] : it returns match except a,b,c.
4. [567]
5. [0-9]
6. [A-ZAa-z]

In [23]:
X[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

**Model Architecture:**<br>
The LSTM model was chosen for its ability to capture long-term dependencies in sequential data. The architecture included an embedding layer, multiple LSTM layers, and a final output layer for sentiment classification

In [24]:
# 1. Removing Stopwords 

In [25]:
X_data = X.apply(lambda  review : [ i      for i in review.split()         if i not in english_stopwords ] )

In [26]:
X_data[0]

['One',
 'reviewers',
 'mentioned',
 'watching',
 '1',
 'Oz',
 'episode',
 'hooked.',
 'They',
 'right,',
 'exactly',
 'happened',
 'me.The',
 'first',
 'thing',
 'struck',
 'Oz',
 'brutality',
 'unflinching',
 'scenes',
 'violence,',
 'set',
 'right',
 'word',
 'GO.',
 'Trust',
 'me,',
 'show',
 'faint',
 'hearted',
 'timid.',
 'This',
 'show',
 'pulls',
 'punches',
 'regards',
 'drugs,',
 'sex',
 'violence.',
 'Its',
 'hardcore,',
 'classic',
 'use',
 'word.It',
 'called',
 'OZ',
 'nickname',
 'given',
 'Oswald',
 'Maximum',
 'Security',
 'State',
 'Penitentary.',
 'It',
 'focuses',
 'mainly',
 'Emerald',
 'City,',
 'experimental',
 'section',
 'prison',
 'cells',
 'glass',
 'fronts',
 'face',
 'inwards,',
 'privacy',
 'high',
 'agenda.',
 'Em',
 'City',
 'home',
 'many..Aryans,',
 'Muslims,',
 'gangstas,',
 'Latinos,',
 'Christians,',
 'Italians,',
 'Irish',
 'more....so',
 'scuffles,',
 'death',
 'stares,',
 'dodgy',
 'dealings',
 'shady',
 'agreements',
 'never',
 'far',
 'away.I'

In [27]:
# Covert strings into lower cases

X_data = X_data.apply(lambda review : [i.lower()            for i  in  review])

In [28]:
print(X_data[0])

['one', 'reviewers', 'mentioned', 'watching', '1', 'oz', 'episode', 'hooked.', 'they', 'right,', 'exactly', 'happened', 'me.the', 'first', 'thing', 'struck', 'oz', 'brutality', 'unflinching', 'scenes', 'violence,', 'set', 'right', 'word', 'go.', 'trust', 'me,', 'show', 'faint', 'hearted', 'timid.', 'this', 'show', 'pulls', 'punches', 'regards', 'drugs,', 'sex', 'violence.', 'its', 'hardcore,', 'classic', 'use', 'word.it', 'called', 'oz', 'nickname', 'given', 'oswald', 'maximum', 'security', 'state', 'penitentary.', 'it', 'focuses', 'mainly', 'emerald', 'city,', 'experimental', 'section', 'prison', 'cells', 'glass', 'fronts', 'face', 'inwards,', 'privacy', 'high', 'agenda.', 'em', 'city', 'home', 'many..aryans,', 'muslims,', 'gangstas,', 'latinos,', 'christians,', 'italians,', 'irish', 'more....so', 'scuffles,', 'death', 'stares,', 'dodgy', 'dealings', 'shady', 'agreements', 'never', 'far', 'away.i', 'would', 'say', 'main', 'appeal', 'show', 'due', 'fact', 'goes', 'shows', 'dare.', 'for

In [29]:
# Output : Y

In [30]:
# Replace 'positive' : 1, 'negative' : 0

Y = Y.replace('positive', 1) 
Y = Y.replace('negative', 0) 

In [31]:
Y.value_counts()

sentiment
1    25000
0    25000
Name: count, dtype: int64

##### Training and Evaluation:

**Model Training:**<br>
The dataset was split into training and testing sets. The LSTM model was trained on the training set, and hyperparameters were tuned to optimize performance. Validation data was used to monitor and prevent overfitting.

**Evaluation Metrics:**<br>
The model's performance was evaluated using metrics such as accuracy, precision, recall, and F1 score. The confusion matrix provided insights into the model's ability to classify positive and negative sentiments.



In [32]:
# Train Test Split 

In [33]:
from sklearn.model_selection import train_test_split

In [34]:
X_train, X_test, Y_train, Y_test = train_test_split(X_data , Y, test_size= 0.25, random_state= 123)

In [35]:
# WAP function which returns average length of reviews.


def avg_length() : 
    review_length = []
    for i in range(X_train.shape[0]) :
        review_length.append(len(X_data[i]))
        
    return np.ceil((np.mean(review_length)))

In [36]:
avg_length()

133.0

In [37]:
X_train.shape[0]

37500

In [38]:
# Convert strings into numbers 
# Tokenizers

In [39]:
import tensorflow

In [40]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [41]:
token = Tokenizer(lower= False)

In [42]:
token.fit_on_texts(X_train)

In [43]:
train_x =  token.texts_to_sequences(X_train)
test_x  = token.texts_to_sequences(X_test)

In [44]:
print(X_train[0])

['one', 'reviewers', 'mentioned', 'watching', '1', 'oz', 'episode', 'hooked.', 'they', 'right,', 'exactly', 'happened', 'me.the', 'first', 'thing', 'struck', 'oz', 'brutality', 'unflinching', 'scenes', 'violence,', 'set', 'right', 'word', 'go.', 'trust', 'me,', 'show', 'faint', 'hearted', 'timid.', 'this', 'show', 'pulls', 'punches', 'regards', 'drugs,', 'sex', 'violence.', 'its', 'hardcore,', 'classic', 'use', 'word.it', 'called', 'oz', 'nickname', 'given', 'oswald', 'maximum', 'security', 'state', 'penitentary.', 'it', 'focuses', 'mainly', 'emerald', 'city,', 'experimental', 'section', 'prison', 'cells', 'glass', 'fronts', 'face', 'inwards,', 'privacy', 'high', 'agenda.', 'em', 'city', 'home', 'many..aryans,', 'muslims,', 'gangstas,', 'latinos,', 'christians,', 'italians,', 'irish', 'more....so', 'scuffles,', 'death', 'stares,', 'dodgy', 'dealings', 'shady', 'agreements', 'never', 'far', 'away.i', 'would', 'say', 'main', 'appeal', 'show', 'due', 'fact', 'goes', 'shows', 'dare.', 'for

In [45]:
train_x

[[131,
  2650,
  18468,
  125376,
  9388,
  1,
  936,
  116,
  1529,
  1,
  1098,
  20,
  19737,
  1,
  86,
  3376,
  678,
  362,
  256,
  745,
  1449,
  6440,
  780,
  1117,
  34291,
  49,
  55,
  238,
  2638,
  156,
  2152,
  57217,
  16327,
  1,
  18,
  280,
  379,
  1074,
  14327,
  16,
  503,
  17312,
  684,
  22204,
  76,
  16,
  6,
  86227,
  22,
  125377,
  36958,
  2639,
  68015,
  31,
  927,
  125378,
  16,
  115,
  26,
  8341,
  3478,
  1406,
  3056,
  22,
  10840,
  2639,
  1406,
  125379,
  511,
  1034,
  2906,
  1,
  106,
  93,
  161,
  40281,
  125380,
  417,
  88,
  431,
  518,
  133,
  23,
  65,
  264,
  278,
  1032,
  44384,
  581,
  4445,
  1,
  27,
  82,
  6323,
  1473],
 [149,
  72,
  329,
  191,
  669,
  7,
  3,
  179,
  547,
  6090,
  125381,
  13968,
  1548,
  700,
  966,
  661,
  37,
  45,
  3,
  2,
  2956,
  53,
  9,
  26,
  323,
  369,
  40,
  1,
  138,
  2213,
  2272,
  16,
  3309,
  3,
  57218,
  31,
  301,
  1,
  1356,
  13640,
  839,
  16818,
  29984,
  3

In [46]:
# Adjust all sequnences with average  lenth of review (133)

# We will remove all numbers of the sequence with index > average length (133)
# We will use "padding"  where length < 133.

In [47]:
len(train_x[0])

95

In [48]:
# Convert  train_x into above mentioned sequence.

In [49]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [50]:
train_X = pad_sequences(train_x, maxlen= 133,padding=  'post', truncating= 'post')
test_X = pad_sequences(test_x, maxlen= 133,padding=  'post', truncating= 'post')

In [51]:
# Building a Model 

In [55]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, Embedding, Flatten

In [56]:
model = Sequential()

In [57]:
model.add(Embedding((len(token.word_index) + 1), 32 , input_length= 133 ))

model.add(LSTM(64, return_sequences= True ))
model.add(Dropout(0.2))

model.add(LSTM(32, return_sequences= True))
model.add(Dropout(0.2))
model.add(Flatten())

model.add(Dense(1, activation = 'sigmoid'))

In [58]:
# Compile the Model

model.compile(optimizer= 'adam', loss= 'binary_crossentropy', metrics = ['accuracy'])

In [59]:
# Train the Model

model.fit(train_X, Y_train, epochs= 5, validation_data= (test_X, Y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1cec53d9050>

**Accuracy:**<br>
The LSTM model demonstrated a commendable accuracy in sentiment classification, indicating its ability to discern user sentiments in IMDb reviews.

#### Predicting User Inputs

In [60]:


# Test reviews
test_reviews = [
    "The movie is about the Avengers assembling to reverse the damage caused by Thanos and restore the universe. Some say the movie is a meditation on time and leaves the message that people should cherish the moments they have with their loved ones. Others say the movie's theme is about planning to give it your all"
]

# Pass the list directly to texts_to_sequences
test_sequences = token.texts_to_sequences(test_reviews)

# Pad the sequences
max_length = 133  # Adjust based on your model's input size
test_padded = pad_sequences(test_sequences, maxlen=max_length, padding='post', truncating='post')

# Predict sentiment
#pred = model.predict(test_padded)
#print(pred)
test_padded



array([[    3,   488,  2364,     2, 34740,   288,  9963,     2,  4702,
         2129,   870,    31, 11329,     2,  3471,    58,     2,     3,
          488,    49, 13472,   511,    19,    31,   743,     2,   811,
          218,    23,  5055, 15259,     2,   372,   120,  2220,   399,
         1439,   298,   753,    58,     2,  1151,   783,   488,  2364,
         4013,   288,    95,    11,  2207,   226,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
      

In [62]:
predictions = model.predict(test_padded)



In **conclusion**, the implementation of LSTM for IMDb review sentiment analysis proves to be a valuable tool for deciphering the ever-changing landscape of audience preferences. This report outlines the methodology, challenges, and outcomes of this endeavor, emphasizing its potential to revolutionize decision-making processes in the entertainment industry. As technology continues to advance, the synergy between sentiment analysis and deep learning promises a brighter, more informed future for the world of film and television.

In [63]:
# Print the raw prediction values
print("Raw Predictions:", predictions)

# Convert predictions to sentiment labels
sentiment_labels = ["Positive" if pred > 0.5 else "Negative" for pred in predictions]
print("Sentiment Labels:", sentiment_labels)


Raw Predictions: [[0.8752775]]
Sentiment Labels: ['Positive']
