# Advanced Neural Network Models for Sentiment Analysis

Exploring deeper into sentiment analysis with advanced neural network architectures. This notebook trains and evaluates three types of models—Simple RNN, LSTM, and GRU—on a preprocessed dataset of tweets, aiming to classify them based on sentiment. Each model's performance is assessed to determine the most effective architecture for capturing the nuances of sentiment in text data.


In [1]:
import os
import sys
import json
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Importing local modules from src folder
src_dir = os.path.join(os.getcwd(), '..', 'src')
if src_dir not in sys.path:
    sys.path.append(src_dir)

from model_utils import build_simple_RNN_model, build_LSTM_model, build_GRU_model

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aaron\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\aaron\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Data loading
csv_file_path = '../data/processed/preprocessed_tweets.csv'
df = pd.read_csv(csv_file_path)

## Data Preparation

Splitting the dataset into training and testing sets, setting the stage for model training and evaluation.


In [3]:
X = df['tweet']  # Features: tweet texts
y = df['sentiment']  # Labels: sentiments

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Text Tokenization

Converting tweets into sequences of integers for model processing. This step is crucial for embedding the tweets into a format understandable by neural networks.


In [4]:
# Tokenization of text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

## Sequence Padding

Ensuring uniform input size by padding sequences, using pre-padding to prioritize recent words in tweets.


In [5]:
# Padding sequences to be of equal length
max_length = max([len(seq) for seq in X_train_seq]) # max length won't be too large since tweets are char-limited

X_train_padded = pad_sequences(X_train_seq, maxlen=max_length, padding='pre')
X_test_padded = pad_sequences(X_test_seq, maxlen=max_length, padding='pre')

## Saving Tokenizer

Storing the tokenizer configuration to json for reproducibility and later use in model deployment.


In [6]:
# Saving tokenizer to json
json_file_path = '../data/processed/tokenizer.json'
tokenizer_json = tokenizer.to_json()
with open(json_file_path, "w", encoding="utf-8") as f:
    f.write(tokenizer_json)

## Building Models

Initializing Simple RNN, LSTM, and GRU models with tuned hyperparameters, preparing them for training.


In [7]:
# Building models
rnn_model = build_simple_RNN_model(input_length=max_length, learning_rate=0.0005)
lstm_model = build_LSTM_model(input_length=max_length, lstm_units=64, learning_rate=0.005)
gru_model = build_GRU_model(input_length=max_length, gru_units=64, learning_rate=0.0005)

## Model Training

Training the Simple RNN, LSTM, and GRU models on the padded tweet sequences, observing the learning curves for signs of convergence or overfitting.


In [8]:
# Train the RNN model
history_rnn = rnn_model.fit(X_train_padded, y_train, epochs=5, validation_split=0.2, batch_size=32)

# Train the LSTM model
history_lstm = lstm_model.fit(X_train_padded, y_train, validation_split=0.2, epochs=5, batch_size=32)

# Train the GRU model
history_gru = gru_model.fit(X_train_padded, y_train, epochs=5, validation_split=0.2, batch_size=32)

Epoch 1/5
[1m31994/31994[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m179s[0m 6ms/step - accuracy: 0.7784 - loss: 0.4614 - val_accuracy: 0.8123 - val_loss: 0.4086
Epoch 2/5
[1m31994/31994[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m175s[0m 5ms/step - accuracy: 0.8210 - loss: 0.3954 - val_accuracy: 0.8176 - val_loss: 0.4041
Epoch 3/5
[1m31994/31994[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m172s[0m 5ms/step - accuracy: 0.8334 - loss: 0.3721 - val_accuracy: 0.8149 - val_loss: 0.4049
Epoch 4/5
[1m31994/31994[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m174s[0m 5ms/step - accuracy: 0.8434 - loss: 0.3542 - val_accuracy: 0.8149 - val_loss: 0.4059
Epoch 5/5
[1m31994/31994[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m173s[0m 5ms/step - accuracy: 0.8507 - loss: 0.3393 - val_accuracy: 0.8144 - val_loss: 0.4103
Epoch 1/5
[1m31994/31994[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m364s[0m 11ms/step - accuracy: 0.7822 - loss: 0.4757 - val_accuracy: 0.8083 - val_loss: 0.43

## Model Evaluation

Assessing the performance of each model on the test set to compare accuracy and identify the most effective architecture for sentiment analysis.


In [9]:
# Evaluate the RNN model
rnn_loss, rnn_acc = rnn_model.evaluate(X_test_padded, y_test)
print(f'RNN Model Accuracy: {rnn_acc}')

# Evaluate the LSTM model
lstm_loss, lstm_acc = lstm_model.evaluate(X_test_padded, y_test)
print(f'LSTM Model Accuracy: {lstm_acc}')

# Evaluate the GRU model
gru_loss, gru_acc = gru_model.evaluate(X_test_padded, y_test)
print(f'GRU Model Accuracy: {gru_acc}')

[1m9998/9998[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 1ms/step - accuracy: 0.8138 - loss: 0.4133
RNN Model Accuracy: 0.8141466379165649
[1m9998/9998[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 3ms/step - accuracy: 0.8111 - loss: 0.4295
LSTM Model Accuracy: 0.8105990290641785
[1m9998/9998[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 3ms/step - accuracy: 0.8252 - loss: 0.3948
GRU Model Accuracy: 0.8254676461219788


## Saving Models

Saving the trained models to disk, allowing for use reloading same models for prediction without retraining.


In [10]:
# Save the models
rnn_model.save('../models/simple_rnn.keras')
lstm_model.save('../models/lstm.keras')
gru_model.save('../models/gru.keras')

## Key Findings

- LSTM showed the highest accuracy among the models, demonstrating its strength in capturing long-term dependencies.
- Simple RNN and GRU performed competitively, with GRU being slightly more efficient.
- The results highlight the importance of model selection based on the specific characteristics of text data.
