## Assignment: Neural Architectures for Text Sentiment Analysis (10 points)

**Background:**  
This assignment investigates how different deep learning architectures affect sentiment classification performance on movie reviews, using the IMDB dataset. You will implement and compare Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) models, analyze the influence of embedding strategies, and interpret the results. The workflow follows the structure and code style shown in the attached "DeepNN" notebook.

### Instructions and Point Breakdown

**1. Dataset Preparation (2 points)**

- Download and extract the IMDB Large Movie Review Dataset.
- Write code to:
  - Load review texts and sentiment labels for both training and test sets.
  - Tokenize the texts and convert them to padded sequences.
  - Convert sentiment labels to categorical format.
- In 2–3 sentences, explain why padding and categorical labels are necessary in deep neural text classification.

**2. Embedding Layer Construction (2 points)**

- Download pre-trained GloVe word vectors (100d or 300d).
- Build an embedding matrix mapping your vocabulary to GloVe vectors.
- Initialize a Keras Embedding layer with this matrix (`trainable=False`).
- Print the shape of your embedding matrix and explain what each dimension represents.
- Briefly discuss one benefit and one limitation of using pre-trained embeddings.

**3. Model Implementation: CNN vs. LSTM (3 points)**

- Implement two neural models for sentiment classification:
  - **CNN:** At least two 1D convolutional layers with max pooling.
  - **LSTM:** A single LSTM layer with dropout.
- Train both models on the same train/validation split for three epochs.
- Record and report validation accuracy for each epoch for both models in a Markdown table.

**4. Test Set Evaluation and Comparison (2 points)**

- Evaluate both models on the IMDB test set.
- Report and compare their test accuracy in a Markdown table.
- In 2–3 sentences, interpret your findings: Which model performed better, and what might explain the difference?

**5. Technical Reflection (1 point)**

- In a Markdown cell, answer:
  - Which types of text data or tasks might benefit more from CNNs versus LSTMs, and vice versa?
  - Suggest one modification to further improve each model (e.g., regularization, deeper layers, bidirectional LSTM).
  - Name a real-world application where your preferred model would be most appropriate, and explain why.

### Submission Requirements

- Jupyter Notebook containing:
  - Clear, well-commented code for all sections
  - Output cells showing key results and tables
  - Reflection in Markdown cells
- Use Python libraries: `tensorflow`, `numpy`, `pandas`

#### Grading Rubric

| Section                      | Points |
|------------------------------|:------:|
| Dataset preparation          |   2    |
| Embedding layer              |   2    |
| Model implementation         |   3    |
| Evaluation & comparison      |   2    |
| Reflection quality           |   1    |
| **Total**                    | **10** |

# *Sources/References*
* Data source: [IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)
* https://www.geeksforgeeks.org/nlp/next-word-prediction-with-deep-learning-in-nlp/
* https://www.geeksforgeeks.org/machine-learning/python-keras-keras-utils-to_categorical/
* https://www.geeksforgeeks.org/deep-learning/zero-padding-in-deep-learning-and-signal-processing/
* https://www.geeksforgeeks.org/machine-learning/categorical-data-encoding-techniques-in-machine-learning/
* https://www.geeksforgeeks.org/nlp/Glove-Word-Embedding-in-NLP/
* https://keras.io/api/layers/core_layers/embedding/
* https://www.geeksforgeeks.org/nlp/text-classification-using-cnn/
* https://www.geeksforgeeks.org/nlp/sentiment-analysis-using-lstm/

# IF KAGGLE IS UNABLE TO RUN, PLEASE RUN NOTEBOOK FROM HERE:

https://colab.research.google.com/drive/1QArRZXckT7g1qvP2uCI2WO5KUCzzX_G5?usp=sharing

# **Installs & Imports**

In [17]:
# Traditional ML and data processing
!pip install scikit-learn pandas matplotlib tensorflow kagglehub -q

[0m

In [18]:
import kagglehub

# Preprocessing libraries
import pandas
import re
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords # To remove stopwords in NLTK
nltk_stopwords = set(stopwords.words("english")) # Load English stopwords once for efficiency

# Preprocess data libraries
import tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Data splitting library
from sklearn.model_selection import train_test_split

# CNN libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, LSTM, Dense, Dropout

# Convert sentimental labels to categorical format library
from keras.utils import to_categorical

[nltk_data] Downloading package stopwords to /home/mch84/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


ModuleNotFoundError: No module named 'tensorflow'

In [9]:
!wget -nc https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
!unzip -q glove.6B.zip

--2025-09-14 04:14:05--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2025-09-14 04:16:46 (5.13 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



# **1. Dataset Preparation (2 points)**


Padding is necessary deep neural text classification because they allow the data to conform to identicaly shapes without altering the meaning of the data. This allows for more effective processing further down the text classification pipeline.  


Categorical labels are necessary in deep neural text classification because neural networks often require labels to be encoded into categorical data in order to work optimally. We can go a step further and use to_categorical() to transform it into one-hot format to ensure entropy functions can accept it.

In [5]:
original_movies_dataframe = pandas.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv") # alternative below
# original_movies_dataframe = pandas.read_csv("./IMDB Dataset.csv") # Alternative (if need to self download)
display(original_movies_dataframe.head())
print("\n\nClass Distribution in the Dataset:")
original_movies_dataframe["sentiment"].value_counts()/original_movies_dataframe.shape[0] # Class distribution in the dataset

Unnamed: 0,rating,title,text
19995,5.0,Vintage Frigidaire,Works perfectly with my 1962 Frigidaire cooktop.
19996,3.0,Three Stars,"Refrigerator doesn't recognize this filter, so..."
19997,5.0,Five Stars,As described.
19998,5.0,Fits most makers,Used to filter a Mr. Coffee. Very good.
19999,5.0,Easy repair,Easy to install 69 year old female fixed my dr...


In [6]:
# Split data (75% train, 25% test)
X = preprocessed_dataframe.review  # Feature: Movie review text
y = preprocessed_dataframe["sentiment"].map({"negative": 0, "positive": 1}).values  # Label: negative = 0. & positive = 1.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
print("X train shape =", X_train.shape, " y train shape =", y_train.shape)
print("X test shape =", X_test.shape, " y test shape =", y_test.shape)

# Tokenize texts
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
print("\n\nNumber of unique words in dictionary =", len(tokenizer.word_index))
print("Dictionary is =", tokenizer.word_index)

# Convert texts to padded sequences
padded_maxlen = 100
X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)
X_train_padded_sequences = pad_sequences(X_train_sequences, maxlen=padded_maxlen, padding='post')  # Make sure all sequences have length = 100. Pad with space if needed
X_test_padded_sequences = pad_sequences(X_test_sequences, maxlen=padded_maxlen, padding='post')  # Make sure all sequences have length = 100. Pad with space if needed

print("\n\nTokenized sequence:", X_train_sequences[0])
print("Padded sequence:", X_train_padded_sequences[0])

# Convert sentiment labels to categorical format.
y_train_categorical = to_categorical(y_train, num_classes=2)
y_test_categorical = to_categorical(y_test, num_classes=2)

print("\n\nOriginal:", y_train[0])
print("Categorical:", y_test_categorical[0])

Original: work great. use a new one every month
Preprocessed: work great use new one every month


In [7]:
# Split data (75% train, 25% test)
X = preprocessed_dataframe.review  # Feature: Movie review text
y = preprocessed_dataframe["sentiment"].map({"negative": 0, "positive": 1}).values  # Label: negative = 0. & positive = 1.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
print("X train shape =", X_train.shape, " y train shape =", y_train.shape)
print("X test shape =", X_test.shape, " y test shape =", y_test.shape)

# Tokenize texts
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
print("\n\nNumber of unique words in dictionary =", len(tokenizer.word_index))
print("Dictionary is =", tokenizer.word_index)

# Convert texts to padded sequences
sequences = tokenizer.texts_to_sequences(X_train)
padded_maxlen = 100
X_train_padded_sequences = pad_sequences(sequences, maxlen=padded_maxlen, padding='post')  # Make sure all sequences have length = 100. Pad with space if needed
print("\n\nTokenized sequence:", sequences[0])
print("Padded sequence:", X_train_padded_sequences[0])

# Convert sentiment labels to categorical format.
y_train_categorical = to_categorical(y_train, num_classes=2)
y_test_categorical = to_categorical(y_test, num_classes=2)

print("\n\nOriginal:", y_train[0])
print("Categorical:", y_test_categorical[0])

Unnamed: 0,rating,title,text,label
19995,5.0,Vintage Frigidaire,works perfectly 1962 frigidaire cooktop,1
19996,3.0,Three Stars,refrigerator doesnt recognize filter filter wa...,0
19997,5.0,Five Stars,described,1
19998,5.0,Fits most makers,used filter mr coffee good,1
19999,5.0,Easy repair,easy install 69 year old female fixed dryer sa...,1


# **2. Embedding Layer Construction (2 points)**

Embedding matrix shape: (154016, 100)
* 154016 represents number of unique words in the given dataset's vocabulary.
* 100 represents desired size of each GloVe embedding vector for word.

One benefit of using pre-trained embeddings is that a developer doesn't need to start embedding and training word embeddings manually. There are many out of the box libraries that provide pre-trained embeddings that were trained on large datasets and contain a rich diversity of words.

One limitation of using pre-trained embeddings is that a developer will need to conduct transfer learning using a sufficient set of domain-specific data for the embeddings to learn more domain-specific terms and represent them properly.


In [8]:
def embedding_for_vocab(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1  # +1 for padding token (index 0)
    embedding_matrix_vocab = np.zeros((vocab_size, embedding_dim))

    with open(filepath, encoding="utf8") as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word]
                embedding_matrix_vocab[idx] = np.array(vector, dtype=np.float32)[:embedding_dim]

    return embedding_matrix_vocab

Dataset label ratio (Negative = 0. Positive = 1)
label
1    16753
0     3247
Name: count, dtype: int64


In [9]:
embedding_dim = 100 # match this with glove file
glove_path = './glove.6B.100d.txt'
embedding_matrix_vocab = embedding_for_vocab(glove_path, tokenizer.word_index, embedding_dim)
embedding_layer = Embedding(input_dim=embedding_matrix_vocab.shape[0], # Current shape of embeddings
                            output_dim=embedding_dim, # Desired shape of embeddings
                            weights=[embedding_matrix_vocab],
                            trainable=False
                            )
print("Embedding matrix shape:", embedding_matrix_vocab.shape)
print(embedding_matrix_vocab.shape[0], "represents number of unique words in the given dataset's vocabulary.")
print(embedding_matrix_vocab.shape[1], "represents desired size of each GloVe embedding vector for word.")

Unnamed: 0,rating,title,text,label
19952,2.0,Not Able To Connect To Natural Gas,ascribable placement organ_organ_pipe adjustme...,0
19976,1.0,THESE ARE JUNK!!,peak share descent share prevents evidence app...,0
19984,3.0,Three Stars,bang-up expensive establish cubic_yard rubbish...,0
19987,1.0,Wrong size for Frididaire,get lighter light_bulb awhile ago travel inst...,0
19996,3.0,Three Stars,refrigerator doesnt acknowledge percolate perc...,0


# **3. Model Implementation: CNN vs. LSTM (3 points)**


|                              |  CNN   | LSTM |
|------------------------------|:------:|:----:|
|*Epoch 1/3*                   |        |      |
|accuracy                      | 0.7257 |0.7068|
|loss                          | 0.5224 |0.5656|
|val_accuracy                  | 0.8116 |0.8038|
|val_loss                      | 0.4140 |0.4335|
|                              |        |      |
|*Epoch 2/3*                   |        |      |
|accuracy                      | 0.8530 |0.8078|
|loss                          | 0.3344 |0.4373|
|val_accuracy                  | 0.8334 |0.8334|
|val_loss                      | 0.3705 |0.3798|
|                              |        |      |
|*Epoch 3/3*                   |        |      |
|accuracy                      | 0.8967 |0.8310|
|loss                          | 0.2578 |0.3914|
|val_accuracy                  | 0.8359 |0.8380|
|val_loss                      | 0.3874 |0.4042|

In [10]:
max_epochs = 3
CNN_model = Sequential([ # Creates a linear stack of layers where each layer passes output to the next
    embedding_layer,
    Conv1D(filters=128, kernel_size=5, activation='relu'),
    Conv1D(filters=128, kernel_size=5, activation='relu'),
    GlobalMaxPooling1D(),
    Dense(1, activation='sigmoid') # Returns binary value
])
CNN_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
CNN_model.fit(X_train_padded_sequences, y_train, batch_size=32, epochs=max_epochs, validation_split=0.25)

100%|██████████| 3247/3247 [1:02:34<00:00,  1.16s/it]


Unnamed: 0,rating,title,text,label
19952,2.0,Not Able To Connect To Natural Gas,Properly placed pipe fit connection pointing d...,0
19976,1.0,THESE ARE JUNK!!,The highest part falls to prevent the ground f...,0
19984,3.0,Three Stars,Expensive find yard garbage bags are much cheaper,0
19987,1.0,Wrong size for Frididaire,Installation some time before receiving the li...,0
19996,3.0,Three Stars,Refrigerator cannot recognize filter warning l...,0


In [12]:
LSTM_model = Sequential([
    embedding_layer,
    LSTM(64),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

LSTM_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
LSTM_model.fit(X_train_padded_sequences, y_train, batch_size=32, epochs=max_epochs, validation_split=0.25)

Vocab size for 1st 10 negative reviews...
0. Original, vocab size = 29
1. synonym replacement, vocab size = 67
2. back translation, vocab size = 115


ORIGINAL+AUGUMENTED dataset label ratio (Negative = 0. Positive = 1)
label
1    16753
0     9741
Name: count, dtype: int64


# **4. Test Set Evaluation and Comparison (2 points)**

|                              |  CNN   | LSTM |
|------------------------------|:------:|:----:|
|accuracy                      | 0.8379 |0.8412|
|loss                          | 0.3692 |0.3975|


In training, accuracy scores for CNN was higher than in LSTM, however, when evaluating with test dataset, Dropout layer in LSTM may have helped it become more robust and generalize to unseen data better than CNN model. CNN model may have overfit and accidentally learned data (hence it had higher accuracies in training than in testing). 

*Note: Since the data is balanced (50% negative & 50% positive), accuracy is a sufficient evaluation method.*

In [18]:
# Evaluate CNN model
loss, accuracy = CNN_model.evaluate(X_test_padded_sequences, y_test)
print("CNN:")
print(f"Test Accuracy: {accuracy * 100:.2f}%")
print(f"Test Lost: {loss * 100:.2f}%")

# Evaluate LSTM model
loss, accuracy = LSTM_model.evaluate(X_test_padded_sequences, y_test)
print("\n\nLSTM:")
print(f"Test Accuracy: {accuracy * 100:.2f}%")
print(f"Test Lost: {loss * 100:.2f}%")

ORIGINAL DATASET:
(20000,) (20000,)
X train shape = (15000,)  y train shape = (15000,)
X test shape = (5000,)  y test shape = (5000,)
Accuracy: 0.8736
Precision: 0.9441
Recall: 0.9026
F1 Score: 0.9229
AUC: 0.8880


ORIGINAL+AUGUMENTED DATASET:
(26494,) (26494,)
X train shape = (19870,)  y train shape = (19870,)
X test shape = (6624,)  y test shape = (6624,)
Accuracy: 0.9031
Precision: 0.9202
Recall: 0.9272
F1 Score: 0.9237
AUC: 0.9494


# **5. Technical Reflection (1 point)**

  - CNNs are good for text data such as short tweets or social media comments where it is sufficient to capture local text patterns using n-grams.
  LSTMs are good for text data such as longer texts where word ordering matters "bad" vs "not bad" this is needed in longer reviews where sometimes language is not as direct
  - Introduce a Dropout layer to this CNN model in order to make the model generalize better and prevent overfitting. Introduce a bidirectional LSTM layer to this LSTM model so that context from past and future can be considered in training.
  - In investing, having a model that both considers the daily flucuations in stock price changes while still remembering the long-term trend of the stock's direction would be better than a model that relies on the most recent flucuations for its predictions. This is why I prefer LSTM because it'll be insightful to see how long-short term memory can help predict a future financial stock price. 