<a href="https://colab.research.google.com/github/chenwh0/Natural-Language-Processing-work/blob/main/module3/NeuralModelsSentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Neural Architectures for Text Sentiment Analysis**

**Background:**  
This lab investigates how different deep learning architectures affect sentiment classification performance on movie reviews, using the IMDB dataset. It compared Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) models, analyzed the influence of embedding strategies, and interpreted the results.

# *Sources/References*
* Data source: [IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)
* https://www.geeksforgeeks.org/nlp/next-word-prediction-with-deep-learning-in-nlp/
* https://www.geeksforgeeks.org/machine-learning/python-keras-keras-utils-to_categorical/
* https://www.geeksforgeeks.org/deep-learning/zero-padding-in-deep-learning-and-signal-processing/
* https://www.geeksforgeeks.org/machine-learning/categorical-data-encoding-techniques-in-machine-learning/
* https://www.geeksforgeeks.org/nlp/Glove-Word-Embedding-in-NLP/
* https://keras.io/api/layers/core_layers/embedding/
* https://www.geeksforgeeks.org/nlp/text-classification-using-cnn/
* https://www.geeksforgeeks.org/nlp/sentiment-analysis-using-lstm/


# *Installs & Imports*

In [None]:
# Traditional ML and data processing
!pip install scikit-learn pandas numpy matplotlib seaborn -q

In [None]:
import kagglehub
# Preprocessing libraries
import pandas
import re
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords # To remove stopwords in NLTK
nltk_stopwords = set(stopwords.words("english")) # Load English stopwords once for efficiency

# Preprocess data libraries
import tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Data splitting library
from sklearn.model_selection import train_test_split

# CNN libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, LSTM, Dense, Dropout

# Convert sentimental labels to categorical format library
from keras.utils import to_categorical

# Visualization libraries
import matplotlib as plt
import numpy as np

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")
print("Path to dataset files:", path)

Using Colab cache for faster access to the 'imdb-dataset-of-50k-movie-reviews' dataset.
Path to dataset files: /kaggle/input/imdb-dataset-of-50k-movie-reviews


In [None]:
!wget -nc https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
!unzip -q glove.6B.zip

--2025-09-14 03:08:07--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2025-09-14 03:10:47 (5.15 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



# **1. Dataset Preparation**

Data selection - 50% of dataset labled positive movie review. Other 50% of dataset labeled negative movie review.

**Padding** is necessary deep neural text classification because they allow the data to conform to identicaly shapes without altering the meaning of the data. This allows for more effective processing further down the text classification pipeline.  


**Categorical labels** are necessary in deep neural text classification because neural networks often require labels to be encoded into categorical data in order to work optimally. We can go a step further and use to_categorical() to transform it into one-hot format to ensure entropy functions can accept it.

In [None]:
original_movies_dataframe = pandas.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")
display(original_movies_dataframe.head())
print("\n\nClass Distribution in the Dataset:")
original_movies_dataframe["sentiment"].value_counts()/original_movies_dataframe.shape[0] # Class distribution in the dataset

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive




Class Distribution in the Dataset:


Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,0.5
negative,0.5


In [None]:
def preprocess_text(text: str) -> str:
    text = text.replace("</br>", " ")
    text = text.lower() # Lowercase all text.
    text = re.sub(r"[^\w\s]", "", text) # Remove punctuation
    text = re.sub(r"\b\w\b", " ", text)
    text = re.sub(r"\s+", " ", text) # remove extra whitespace
    text_no_stopwords = [token for token in text.split(" ") if token not in nltk_stopwords] # Store only the non-stopwords
    text_no_stopwords = " ".join(text_no_stopwords)
    return text_no_stopwords
preprocessed_dataframe = original_movies_dataframe.copy() # Make a copy of original dataframe
preprocessed_dataframe["review"] = preprocessed_dataframe["review"].map(preprocess_text) # Preprocess the copy's data

# Tokenization & stopwords
print("Original:", original_movies_dataframe["review"][0])
print("Preprocessed:", preprocessed_dataframe["review"][0])

Original: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due 

In [None]:
# Split data (75% train, 25% test)
X = preprocessed_dataframe.review  # Feature: Movie review text
y = preprocessed_dataframe["sentiment"].map({"negative": 0, "positive": 1}).values  # Label: negative = 0. & positive = 1.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
print("X train shape =", X_train.shape, " y train shape =", y_train.shape)
print("X test shape =", X_test.shape, " y test shape =", y_test.shape)

# Tokenize texts
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
print("\n\nNumber of unique words in dictionary =", len(tokenizer.word_index))
print("Dictionary is =", tokenizer.word_index)

# Convert texts to padded sequences
padded_maxlen = 100
X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)
X_train_padded_sequences = pad_sequences(X_train_sequences, maxlen=padded_maxlen, padding='post')  # Make sure all sequences have length = 100. Pad with space if needed
X_test_padded_sequences = pad_sequences(X_test_sequences, maxlen=padded_maxlen, padding='post')  # Make sure all sequences have length = 100. Pad with space if needed

print("\n\nTokenized sequence:", X_train_sequences[0])
print("Padded sequence:", X_train_padded_sequences[0])

# Convert sentiment labels to categorical format.
y_train_categorical = to_categorical(y_train, num_classes=2)
y_test_categorical = to_categorical(y_test, num_classes=2)

print("\n\nOriginal:", y_train[0])
print("Categorical:", y_test_categorical[0])

X train shape = (37500,)  y train shape = (37500,)
X test shape = (12500,)  y test shape = (12500,)


Number of unique words in dictionary = 154015


Tokenized sequence: [13, 607, 2912, 112, 3617, 148, 8566, 12863, 85, 6, 505, 1559, 779, 853, 23612, 3721, 388, 70785, 1, 186, 920, 699, 629, 108, 42, 8, 165, 751, 77, 39, 1548, 3936, 1, 20, 6997, 729, 127, 1554, 802, 629, 4426, 260, 421, 10, 652, 9, 12864, 69, 131, 3937, 10, 7368, 1, 938, 664, 8566, 2630, 2, 11, 346, 2630, 135, 131, 1304, 123, 261, 874, 63, 203, 4, 388, 1422, 3937, 10065, 4618, 179, 15281, 1198, 1704, 1198, 1083, 11361, 1, 8566, 2573, 325, 17774, 1872, 2114, 2044, 6395, 1845, 2147, 1657, 12863, 12863, 29, 318, 4016, 998, 367, 2105, 1026, 746, 35355, 1785, 1845, 21514, 1481, 318, 738, 14488, 14868, 435, 21515, 648, 36, 456, 456, 456, 1514, 803, 3722, 21, 3011, 38, 655, 751, 283, 17, 640, 4078, 169, 5903, 1942, 2646, 169, 1750, 136, 349, 367, 182, 527, 5126, 826, 1186, 531]
Padded sequence: [   69   131  3937    10  7368   

# **2. Embedding Layer Construction**

Embedding matrix shape: (154016, 100)
* 154016 represents number of unique words in the given dataset's vocabulary.
* 100 represents desired size of each GloVe embedding vector for word.

One *benefit* of using pre-trained embeddings is that a developer doesn't need to start embedding and training word embeddings manually. There are many out of the box libraries that provide pre-trained embeddings that were trained on large datasets and contain a rich diversity of words.

One *limitation* of using pre-trained embeddings is that a developer will need to conduct transfer learning using a sufficient set of domain-specific data for the embeddings to learn more domain-specific terms and represent them properly.


In [None]:
def embedding_for_vocab(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1  # +1 for padding token (index 0)
    embedding_matrix_vocab = np.zeros((vocab_size, embedding_dim))

    with open(filepath, encoding="utf8") as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word]
                embedding_matrix_vocab[idx] = np.array(vector, dtype=np.float32)[:embedding_dim]

    return embedding_matrix_vocab

In [None]:
embedding_dim = 100 # match this with glove file
glove_path = './glove.6B.100d.txt'
embedding_matrix_vocab = embedding_for_vocab(glove_path, tokenizer.word_index, embedding_dim)
embedding_layer = Embedding(input_dim=embedding_matrix_vocab.shape[0], # Current shape of embeddings
                            output_dim=embedding_dim, # Desired shape of embeddings
                            weights=[embedding_matrix_vocab],
                            trainable=False
                            )
print("Embedding matrix shape:", embedding_matrix_vocab.shape)
print(embedding_matrix_vocab.shape[0], "represents number of unique words in the given dataset's vocabulary.")
print(embedding_matrix_vocab.shape[1], "represents desired size of each GloVe embedding vector for word.")

Embedding matrix shape: (154016, 100)
154016 represents number of unique words in the given dataset's vocabulary.
100 represents desired size of each GloVe embedding vector for word.


# **3. Model Implementation: CNN vs. LSTM**


|                              |  CNN   | LSTM |
|------------------------------|:------:|:----:|
|*Epoch 1/3*                   |        |      |
|accuracy                      | 0.7257 |0.7068|
|loss                          | 0.5224 |0.5656|
|val_accuracy                  | 0.8116 |0.8038|
|val_loss                      | 0.4140 |0.4335|
|                              |        |      |
|*Epoch 2/3*                   |        |      |
|accuracy                      | 0.8530 |0.8078|
|loss                          | 0.3344 |0.4373|
|val_accuracy                  | 0.8334 |0.8334|
|val_loss                      | 0.3705 |0.3798|
|                              |        |      |
|*Epoch 3/3*                   |        |      |
|accuracy                      | 0.8967 |0.8310|
|loss                          | 0.2578 |0.3914|
|val_accuracy                  | 0.8359 |0.8380|
|val_loss                      | 0.3874 |0.4042|

In [None]:
max_epochs = 3
CNN_model = Sequential([ # Creates a linear stack of layers where each layer passes output to the next
    embedding_layer,
    Conv1D(filters=128, kernel_size=5, activation='relu'),
    Conv1D(filters=128, kernel_size=5, activation='relu'),
    GlobalMaxPooling1D(),
    Dense(1, activation='sigmoid') # Returns binary value
])
CNN_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
CNN_model.fit(X_train_padded_sequences, y_train, batch_size=32, epochs=max_epochs, validation_split=0.25)

Epoch 1/3
[1m879/879[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 66ms/step - accuracy: 0.7213 - loss: 0.5290 - val_accuracy: 0.7677 - val_loss: 0.4669
Epoch 2/3
[1m879/879[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 68ms/step - accuracy: 0.8462 - loss: 0.3467 - val_accuracy: 0.8368 - val_loss: 0.3604
Epoch 3/3
[1m879/879[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 69ms/step - accuracy: 0.8891 - loss: 0.2665 - val_accuracy: 0.8309 - val_loss: 0.3722


<keras.src.callbacks.history.History at 0x7cab16bbb020>

In [None]:
LSTM_model = Sequential([
    embedding_layer,
    LSTM(64),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

LSTM_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
LSTM_model.fit(X_train_padded_sequences, y_train, batch_size=32, epochs=max_epochs, validation_split=0.25)

Epoch 1/3
[1m879/879[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 60ms/step - accuracy: 0.7068 - loss: 0.5656 - val_accuracy: 0.8038 - val_loss: 0.4335
Epoch 2/3
[1m879/879[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 62ms/step - accuracy: 0.8078 - loss: 0.4373 - val_accuracy: 0.8334 - val_loss: 0.3798
Epoch 3/3
[1m879/879[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m78s[0m 58ms/step - accuracy: 0.8310 - loss: 0.3914 - val_accuracy: 0.8380 - val_loss: 0.4042


<keras.src.callbacks.history.History at 0x7cab187c38c0>

# **4. Test Set Evaluation and Comparison**

|                              |  CNN   | LSTM |
|------------------------------|:------:|:----:|
|accuracy                      | 0.8379 |0.8412|
|loss                          | 0.3692 |0.3975|


In training, accuracy scores for CNN was higher than in LSTM, however, when evaluating with test dataset, Dropout layer in LSTM may have helped it become more robust and generalize to unseen data better than CNN model. CNN model may have overfit and accidentally learned data (hence it had higher accuracies in training than in testing).

*Note: Since the data is balanced (50% negative & 50% positive), accuracy is a sufficient evaluation method.*

In [None]:
# Evaluate CNN model
loss, accuracy = CNN_model.evaluate(X_test_padded_sequences, y_test)
print("CNN:")
print(f"Test Accuracy: {accuracy * 100:.2f}%")
print(f"Test Lost: {loss * 100:.2f}%")

# Evaluate LSTM model
loss, accuracy = LSTM_model.evaluate(X_test_padded_sequences, y_test)
print("\n\nLSTM:")
print(f"Test Accuracy: {accuracy * 100:.2f}%")
print(f"Test Lost: {loss * 100:.2f}%")

[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 22ms/step - accuracy: 0.8319 - loss: 0.3775
CNN:
Test Accuracy: 83.79%
Test Lost: 36.92%
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 16ms/step - accuracy: 0.8337 - loss: 0.4108


LSTM:
Test Accuracy: 84.12%
Test Lost: 39.75%


# **Technical Reflection**
## CNN vs LSTM for text data

CNNs are good for text data such as short tweets or social media comments where it is sufficient to capture local text patterns using n-grams.
  
LSTMs are good for text data such as longer texts where word ordering matters "bad" vs "not bad" this is needed in longer reviews where sometimes language is not as direct

## Neural model improvements

Introduce a Dropout layer to this CNN model in order to make the model generalize better and prevent overfitting. Introduce a bidirectional LSTM layer to this LSTM model so that context from past and future can be considered in training.

## Neural model real-world applications
In investing, having a model that both considers the daily flucuations in stock price changes while still remembering the long-term trend of the stock's direction would be better than a model that relies on the most recent flucuations for its predictions. This is why LSTM is insightful to see how long-short term memory can help predict a future financial stock price.