# Movie Sentiment Analysis(v2): Review Polarity Classification

## Background

This project focuses on sentiment analysis of movie reviews, aimed at determining the underlying sentiment expressed within a body of text. By analyzing the content of movie reviews, we strive to classify each review as positive or negative automatically. 

Dataset can be found in [here](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).




## **Plan**

### 1.1 Update

Before, we used Multilayer Perceptrons (MLP), a basic model, to predict movie sentiment. This time, we will use the Bidirectional LSTM model to make it even more robust.

## **Analyze**

### 2.1 Import, Load and Examine 

In [1]:
import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

from bs4 import BeautifulSoup
import string

import warnings
warnings.filterwarnings("ignore")

2024-11-11 06:39:05.103321: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-11-11 06:39:05.843196: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-11-11 06:39:06.165662: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1731307146.645314    1884 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1731307146.788862    1884 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-11 06:39:07.832737: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU ins

In [2]:
df = pd.read_csv("IMDB_Dataset.csv")
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


### 2.2 Data Cleaning

- **HTML tags:** If dataset contain HTML tags that we will need to remove it.
- **Special characters and numbers:** Special characters and numbers do not contribute to sentiment analysis and can be removed.
- **Punctuation:** Punctuation marks can be removed, depends on different approach.
- **Lowercase** all sentences to make sure they look same in our model.
- **Missing data** Missing data row should be delete.

#### 2.2.1 Remove HTML Tags


In [4]:
# Remove HTML tags
def remove_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

# Apply function to dataset
df["review"] = df["review"].apply(remove_html_tags)

#### 2.2.2 Remove Special Characters & Numbers

In [5]:
# Remove pecial characters and numbers
df["review"] = df["review"].str.replace("[^a-zA-Z]", " ", regex=True)

# Remove extra spaces
df["review"] = df["review"].str.replace("\s+", " ", regex=True).str.strip()

#### 2.2.3 Remove Punctuation

In [6]:
# Remove punctuation
punctuation_pattern = f"[{string.punctuation}]"
df["review"] = df["review"].str.replace(punctuation_pattern, "", regex=True)

#### 2.2.4 Remove Stop Words

In [7]:
# Remove stop words
stop_words = set(stopwords.words("english"))
df["review"] = df["review"].apply(lambda x: " ".join([word for word in x.split() if word not in stop_words]))

#### 2.2.5 Lowercase

In [8]:
# Lowercase
df["review"] = df["review"].str.lower()

#### 2.2.6 Missing Value

In [9]:
df.isna().sum()

review       0
sentiment    0
dtype: int64

In [10]:
df

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching oz episode ho...,positive
1,a wonderful little production the filming tech...,positive
2,i thought wonderful way spend time hot summer ...,positive
3,basically family little boy jake thinks zombie...,negative
4,petter mattei love time money visually stunnin...,positive
...,...,...
49995,i thought movie right good job it creative ori...,positive
49996,bad plot bad dialogue bad acting idiotic direc...,negative
49997,i catholic taught parochial elementary schools...,negative
49998,i going disagree previous comment side maltin ...,negative


## **Construct**

Model: **Bidirectional LSTM**.

### 3.1 Tokenize and Integer-Encode

In [11]:
# Initialize tokenizer and fit on review text
tokenizer = Tokenizer(num_words=5000)  # Adjust `num_words` to desired vocabulary size
tokenizer.fit_on_texts(df["review"])

# Convert text to integer sequences
x_sequences = tokenizer.texts_to_sequences(df["review"])

max_length = 100  # Set based on typical review length
x_padded = pad_sequences(x_sequences, maxlen=max_length, padding="post", truncating="post")

# Converting categorical labels to numerical form
df["sentiment_numeric"] = df["sentiment"].map({"positive": 1, "negative": 0})

# Split dataset
x_train, x_test, y_train, y_test = train_test_split(x_padded, df["sentiment_numeric"], test_size=0.2, random_state=1)

### 3.2 Build Model

In [12]:
# Hyperparameters
vocab_size = 10000   # Vocabulary size
embedding_dim = 64   # Embedding dimensions
max_length = 256      # Maximum length of sequences
lstm_units = 64      # Number of LSTM units
embedding_dim = 50

# Load the GloVe embeddings
embeddings_index = {}
with open("glove.6B.50d.txt") as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefficients = np.asarray(values[1:], dtype="float32")
        embeddings_index[word] = coefficients

# Prepare embedding matrix
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
    if i < vocab_size:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector


model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length, trainable=True),
    Bidirectional(LSTM(units=lstm_units, return_sequences=True)),
    Dropout(0.5), 
    LSTM(units=lstm_units),
    Dense(1, activation="sigmoid")
])

# Model summary
model.summary()

2024-11-11 06:39:41.440996: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


### 3.3 Compile and Train the Model

In [13]:
# Compile the model
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

# Train the model
# history = model.fit(x_train, y_train, epochs=10, batch_size=64, validation_data=(x_test, y_test), verbose=1)

# With Early stopping
early_stopping = EarlyStopping(monitor="val_accuracy", patience=4, restore_best_weights=True)
history = model.fit(x_train, y_train, epochs=10, batch_size=64, validation_data=(x_test, y_test), 
                    callbacks=[early_stopping], verbose=1)

Epoch 1/10


[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 58ms/step - accuracy: 0.5714 - loss: 0.6647 - val_accuracy: 0.8385 - val_loss: 0.4129
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 57ms/step - accuracy: 0.8565 - loss: 0.3500 - val_accuracy: 0.8650 - val_loss: 0.3244
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 57ms/step - accuracy: 0.8998 - loss: 0.2533 - val_accuracy: 0.8718 - val_loss: 0.3027
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 56ms/step - accuracy: 0.9173 - loss: 0.2180 - val_accuracy: 0.8603 - val_loss: 0.3281
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 57ms/step - accuracy: 0.9332 - loss: 0.1854 - val_accuracy: 0.8626 - val_loss: 0.3452
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 56ms/step - accuracy: 0.9478 - loss: 0.1553 - val_accuracy: 0.8614 - val_loss: 0.3994
Epoch 7/10
[1m625/625[0m 

Due to the extensive computational resources and time required for hyperparameter tuning, I am unable to proceed with this step. Nonetheless, using the default parameters, we have achieved an accuracy of 87%, and we will consider this satisfactory for our current purposes.



## **Execute**

### 4.1 Make New Prediction

In [14]:
# Sample text to predict sentiment
sample_text = [
    "This movie was a great watch with brilliant performances and a gripping plot!",  # Positive
    "An absolute waste of time, the worst movie I've seen in a long while.",  # Negative
    "I found the movie to be mediocre, not terrible but not great either.",  # Neutral
    "The cinematography was stunning, but the storyline was lacking and unoriginal.",  # Neutral/Negative
    "The film was a masterpiece with a perfect blend of drama and action, a must-watch!",  # Positive
    "It was an okay movie; I neither liked it nor disliked it particularly.",  # Neutral
    "The plot twist at the end was predictable and uninspired.",  # Negative
    "A stellar cast, but the film fell flat due to poor writing.",  # Negative
    "I loved the special effects, but the characters were not very compelling.",  # Neutral/Negative
    "The movie was well-received by critics but I didn't find it very interesting.",  # Neutral
    "This film is overrated, I had high expectations but was sadly disappointed.",  # Negative
    "What an entertaining experience, I was on the edge of my seat the whole time!"  # Positive
]

# Predict sentiment for each sample text
for text in sample_text:
    # Tokenize and pad the text
    sequence = tokenizer.texts_to_sequences([text])
    padded_sequence = pad_sequences(sequence, maxlen=max_length, padding="post", truncating="post")
    
    # Get the prediction
    prediction = model.predict(padded_sequence)
    
    # Interpret the prediction
    sentiment = "Positive" if prediction[0][0] > 0.5 else "Negative"
    
    print(f"Text: {text}")
    print(f"Predicted Sentiment: {sentiment} (Confidence: {prediction[0][0]:.2f})\n")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 298ms/step
Text: This movie was a great watch with brilliant performances and a gripping plot!
Predicted Sentiment: Positive (Confidence: 0.88)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
Text: An absolute waste of time, the worst movie I've seen in a long while.
Predicted Sentiment: Negative (Confidence: 0.10)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
Text: I found the movie to be mediocre, not terrible but not great either.
Predicted Sentiment: Negative (Confidence: 0.13)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
Text: The cinematography was stunning, but the storyline was lacking and unoriginal.
Predicted Sentiment: Negative (Confidence: 0.30)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
Text: The film was a masterpiece with a perfect blend of drama and action, a must-watch!
Predicted Sentiment: Positive 

### 4.2 Conclusion

The model achieved around 95% accuracy on the train and 87% on the test/val set, which was the same result as I had before. Interestingly, many people get the same result when I check someone else's notebook on Kaggle. It might be a dataset bias or some trends that are hard to catch by model. Overall, this model has shown the average level.