<a href="https://colab.research.google.com/github/ekinfergan/Thesis_Jupyter_Final/blob/main/src/models/rnns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning - LSTM & GRU

In [1]:
!git clone https://github.com/ekinfergan/Thesis_Jupyter_Final.git

Cloning into 'Thesis_Jupyter_Final'...
remote: Enumerating objects: 1059, done.[K
remote: Counting objects: 100% (479/479), done.[K
remote: Compressing objects: 100% (307/307), done.[K
remote: Total 1059 (delta 229), reused 404 (delta 168), pack-reused 580[K
Receiving objects: 100% (1059/1059), 192.14 MiB | 28.48 MiB/s, done.
Resolving deltas: 100% (547/547), done.
Updating files: 100% (92/92), done.


In [2]:
%cd Thesis_Jupyter_Final
!git pull
%cd ..

/content/Thesis_Jupyter_Final
Already up to date.
/content


In [3]:
!pip install git+https://github.com/scikit-optimize/scikit-optimize.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/scikit-optimize/scikit-optimize.git
  Cloning https://github.com/scikit-optimize/scikit-optimize.git to /tmp/pip-req-build-n75sf6n0
  Running command git clone --filter=blob:none --quiet https://github.com/scikit-optimize/scikit-optimize.git /tmp/pip-req-build-n75sf6n0
  Resolved https://github.com/scikit-optimize/scikit-optimize.git to commit a2369ddbc332d16d8ff173b12404b03fea472492
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pyaml>=16.9 (from scikit-optimize==0.9.0)
  Using cached pyaml-23.5.9-py3-none-any.whl (17 kB)
Building wheels for collected packages: scikit-optimize
  Building wheel for scikit-optimize (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-optimize: filename=scikit_optimize-0.9.0-py2.

In [1]:
# Import necessary libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import multiprocessing
import pickle
from numpy import asarray

from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score, auc, roc_curve, RocCurveDisplay, confusion_matrix, classification_report
from sklearn.multiclass import OneVsRestClassifier
from itertools import cycle

import tensorflow as tf
from tensorflow.keras.models import Sequential, Model, load_model
from tensorflow.keras.layers import Input, Embedding, concatenate, Dense, LSTM, GRU
from keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers.legacy import Adam, SGD, RMSprop, Adagrad

import skopt
from skopt import gbrt_minimize, gp_minimize
from skopt.utils import use_named_args
from skopt.space import Real, Categorical, Integer
from tensorflow.keras import backend as K

import functools

2023-07-21 20:39:49.836797: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


In [2]:
# Set paths
script_dir = os.path.dirname(os.path.abspath('rnns.ipynb'))
data_path = os.path.join(script_dir, 'Thesis_Jupyter_Final/src/')
os.getcwd()
print(data_path)

input_folder_path = os.path.join(data_path, 'input')
processed_folder_path = os.path.join(data_path, 'input/processed/normal')
results_folder_path =  os.path.join(data_path, "results")

# Create the folder if it doesn't exist
if not os.path.exists(results_folder_path):
    os.makedirs(results_folder_path)

/home2/s3985113/Thesis_Jupyter_Final/src/


In [3]:
# Set other constants and variables
senti_labels_dict = {1: 'Negative', 2: 'Neutral', 3: 'Positive'}
senti_labels_names = list(senti_labels_dict.values())
senti_labels_nums = list(senti_labels_dict.keys())
NUM_of_CLASSES = 3

VOCAB_SIZE = 11395
MAX_SEQ_LEN = 449
EMBEDDING_DIM = 100
NUM_OUTPUT_CLASSES = 3

In [4]:
train = pd.read_csv(os.path.join(processed_folder_path, "train.csv"))
val = pd.read_csv(os.path.join(processed_folder_path, "val.csv"))
test = pd.read_csv(os.path.join(processed_folder_path, "test.csv"))

x_train = train['x']
y_train = train['y']
x_val = val['x']
y_val = val['y']
x_test = test['x']
y_test = test['y']

# Load encoded sequences
with open(os.path.join(processed_folder_path, "x_train_encoded.pkl"), "rb") as f:
    x_train_encoded = pickle.load(f)
with open(os.path.join(processed_folder_path, "x_val_encoded.pkl"), "rb") as f:
    x_val_encoded = pickle.load(f)
with open(os.path.join(processed_folder_path, "x_test_encoded.pkl"), "rb") as f:
    x_test_encoded = pickle.load(f)
print(f"x_train_encoded:\n{x_train_encoded[:5]}\n")

# Load embedding vectors
with open(os.path.join(processed_folder_path, "embedding_matrix.pkl"), "rb") as f:
    w2v_embedding_vectors = pickle.load(f)
print(f"embedding vectors: {w2v_embedding_vectors[10][:5]}...\n")

# Load encoded labels
with open(os.path.join(processed_folder_path, "y_train_encoded.pkl"), "rb") as f:
    y_train_encoded = pickle.load(f)
with open(os.path.join(processed_folder_path, "y_val_encoded.pkl"), "rb") as f:
    y_val_encoded = pickle.load(f)
with open(os.path.join(processed_folder_path, "y_test_encoded.pkl"), "rb") as f:
    y_test_encoded = pickle.load(f)
print(f"y_train_encoded:\n{y_train_encoded[:5]}\n")

x_train_encoded:
[[  96  549  929 ...    0    0    0]
 [ 453  240 1125 ...    0    0    0]
 [1260   67  312 ...    0    0    0]
 [ 127 1352 6694 ...    0    0    0]
 [ 529   10   69 ...    0    0    0]]

embedding vectors: [-0.57674998 -0.42304999  0.27188    -0.31986001  0.18842   ]...

y_train_encoded:
[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]



## Evaluation Functions

In [13]:
def one_hot_encode(y):
    """
    One-hot encode the labels.

    Args:
        y (array): Labels to be encoded.

    Returns:
        array: One-hot encoded labels.
    """

    y_encoded = np.zeros((len(y), NUM_of_CLASSES))
    for i, label in enumerate(y):
        y_encoded[i, label - 1] = 1

    return y_encoded


def calculate_classification_report(y, y_pred):
    """
    Calculate the classification report.

    Args:
        y (array): True labels.
        y_pred (array): Predicted labels.

    Returns:
        str: Classification report.
    """

    report = classification_report(y, y_pred, labels=senti_labels_nums, target_names=senti_labels_names)
    return report


def plot_confusion_matrix(model_name, y_true, y_pred, res_path):
    """
    Plot the confusion matrix.

    Args:
        model_name (str): Name of the model.
        y_true (array): True labels.
        y_pred (array): Predicted labels.
        res_path (str): Path to save the plot.

    Returns:
        None
    """

    cnf_mat = confusion_matrix(y_true, y_pred)
    mat_disp = ConfusionMatrixDisplay(confusion_matrix=cnf_mat, display_labels=senti_labels_names)
    mat_disp = mat_disp.plot(cmap='Blues', xticks_rotation='vertical')
    plt.title(f'Confusion Matrix')
    plt.tight_layout()
    plt.savefig(os.path.join(res_path, f"{model_name}_confusion_matrix.png"))
    plt.close()


def get_results(y_pred, y, x, score, model, model_name, params, res_path, only_metrics=True):
    """
    Generate and print the results of a provided model.

    Args:
        y_pred (array): Predicted labels.
        y (array): True labels.
        x: Data.
        score: Model evaluation score (e.g., loss and accuracy).
        model: The trained model.
        model_name (str): The name of the model.
        params: Optimizer parameter.
        res_path (str): Path to store the results.
        only_metrics (bool, optional): Whether to only print metrics or include additional visualizations. Defaults is True.

    Returns:
        None
    """

    if not os.path.exists(res_path):
        os.makedirs(res_path)

    # Convert to one hot vectors
    y_classes = np.argmax(y, axis=1) + 1
    y_pred_classes = np.argmax(y_pred, axis=1) + 1

    print(y_pred.shape)
    print(y_classes.shape)
    print(y_pred_classes.shape)

    print(f"Accuracy: {score[1]:.2%}")
    print(f"Loss: {score[0]:.2f}")

    with open(os.path.join(res_path, f"{model_name}_results.txt"), "w") as f:
        f.write(f"*{model_name}\n")
        f.write(f"Optimizer Params: {params}\n\n")

        f.write(f"Accuracy: {score[1]:.2f}%\n")
        f.write(f"Loss: {score[0]:.2f}\n")

        report = calculate_classification_report(y_classes, y_pred_classes)
        if report is not None:
            f.write("Classification Report:\n")
            f.write(report)
        else:
            print("Failed to generate classification report")
        f.write("\n")

        if not only_metrics:
            plot_confusion_matrix(model_name, y_classes, y_pred_classes, res_path)

In [6]:
def plot_development(history, model_name, res_path):
    """
    Plot the training and validation accuracy and loss curves.

    Args:
        history (History): The training history object containing accuracy and loss values.
        model_name (str): The name of the model being trained.
        res_path (str): The path to save the development plot.

    Returns:
        None
    """
    
    acc =  history['accuracy']
    val_acc = history['val_accuracy']

    loss = history['loss']
    val_loss = history['val_loss']

    epochs = range(len(acc))

    plt.figure()
    plt.plot(epochs, acc, 'b', label='Training Accuracy')
    plt.plot(epochs, val_acc, 'r', label='Validation Accuracy')
    plt.title(f"{model_name} Training and Validation Accuracy")
    plt.legend()
    plt.savefig(os.path.join(res_path, f"{model_name}_accuracy_plot.png"))
    plt.close()

    plt.figure()
    plt.plot(epochs, loss, 'b', label='Training Loss')
    plt.plot(epochs, val_loss, 'r', label='Validation Loss')
    plt.title(f"{model_name} Training and Validation Loss")
    plt.legend()
    plt.savefig(os.path.join(res_path, f"{model_name}_loss_plot.png"))
    plt.close()

## Training

The provided code for the the implementation of the Gradient Boosted Regression Trees used as a surrogate model within Bayesian Optimization was inspired from the article titled "Parameter Hyperparameter Tuning with Bayesian Optimization" by crawftv on Medium. The implementation is available at the following link: [Gradient Boost Model](https://medium.com/@crawftv/parameter-hyperparameter-tuning-with-bayesian-optimization-7acf42d348e1). And the model architecture takes inspiration from the implementation provided in the Kaggle notebook titled "Twitter Sentiment Analysis with Word2Vec LSTM", which is available at the link: [Word2Vec - LSTM](https://www.kaggle.com/code/caiyutiansg/twitter-sentiment-analysis-with-word2vec-lstm).

In [14]:
K.clear_session()

batch_size= 16
epochs=1


num_units = Categorical([32, 64, 128], name='num_units')
learning_rate = Categorical([0.1, 1e-2], name='learning_rate')

search_space = [
            learning_rate
            ]

# Set initial points for the search of optimal parameter
default_params = [
                  0.1
                  ]

def define_model(learning_rate, layer_type):
    """
    Define and compile the model with the given hyperparameters.

    Args:
        learning_rate (float): Learning rate for the optimizer.
        layer_type (class): Type of recurrent layer (LSTM or GRU).

    Returns:
        model (Model): Compiled Keras model.
    """

    print(layer_type)
    # Set the input layer
    inputs = Input(shape=(MAX_SEQ_LEN,), name="input glove embeddings")

    # Set an embedding layer for the input
    embeddings = Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_SEQ_LEN, weights=[w2v_embedding_vectors], trainable=False, name="embeddings")(inputs)

    # Pass embeddings through their own LSTM or GRU layers
    layers = layer_type(64, dropout=0.2, recurrent_dropout=0.2, return_sequences=True, name="layer_1")(embeddings)
    layers = layer_type(32, dropout=0.2, recurrent_dropout=0.2, return_sequences=False, name="layer_2")(layers)

    # Dense layer for the merged inputs & output layer
    outputs = Dense(NUM_OUTPUT_CLASSES, activation='softmax', name="output")(layers)

    # Create the model
    model = Model(inputs=inputs, outputs=outputs)

    # Compile the model
    rmsprop = RMSprop(learning_rate=learning_rate)
    model.compile(optimizer=rmsprop, loss='mean_squared_error', metrics=['accuracy']) # default learning rate = 0.001
    print(model.summary())

    return model

In [15]:
def get_objective_function(layer_type):
    """
    Define the objective function for Bayesian Optimization.

    Args:
        layer_type (class): Type of recurrent layer (LSTM or GRU).

    Returns:
        objective_function (function): Objective function for Bayesian Optimization.
    """
    
    @use_named_args(dimensions=search_space)
    def objective_function(learning_rate):
        model = define_model(learning_rate=learning_rate,
                            layer_type=layer_type
                            )

        print("Optimization, starting training...")
        early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
        history = model.fit(x_train_encoded,
                            y_train_encoded,
                            validation_data=(x_val_encoded, y_val_encoded),
                            epochs=epochs,
                            batch_size=batch_size,
                            callbacks=[early_stopping],
                            verbose=2
                            )
        print("Training Complete")
        # Return the validation accuracy for the last epoch
        accuracy = history.history['val_accuracy'][-1]
        loss = history.history['val_loss'][-1]

        # Print the classification accuracy
        print(f"Accuracy: {accuracy:.2%}")
        print(f"Loss: {loss:.2}\n")

        del model,

        print('Model deleted. Clearing session...')

        # Session clearing
        K.clear_session()
        tf.compat.v1.reset_default_graph()

        # Return negative accuracy, since it is the lowest score
        return -accuracy

    return objective_function

In [16]:
def perform_bayesian_opt(objective_function):
    """
    Perform Bayesian Optimization to find the best hyperparameters.

    Args:
        objective_function: The objective function to optimize.

    Returns:
        dict: A dictionary containing the best hyperparameters found.

    """
    gbrt_result = gbrt_minimize(func=objective_function,
                                dimensions=search_space,
                                n_calls=12,
                                n_jobs=-1,
                                x0=default_params)

    gbrt_best_params = {param.name: value for param, value in zip(gbrt_result.space, gbrt_result.x)}
    print("Best Hyperparameters:", gbrt_best_params)
    print()

    return gbrt_best_params


def fit_model(model, model_name, x_train, y_train, x_val, y_val, res_path):
    """
    Fit a model to the training data.

    Args:
        model (object): The model to fit.
        model_name (str): The name of the model.
        x_train (array): The training data.
        y_train (array): The training labels.
        x_val (array): The validation data.
        y_val (array): The validation labels.
        res_path (str): The path to save the model and results.

    Returns:
        object: The fitted model.
        object: The training history.

    """

    model_file = f"model_{model_name}.h5"
    history_file = f"history_{model_name}.pkl"
    model_path = os.path.join(res_path, model_file)
    history_path = os.path.join(res_path, history_file) 

    # Check if the model exists
    if os.path.exists(model_path):
        # If the model exists, then load it
        model = load_model(model_path)
        print(f"Model {model_name} loaded successfully...")
        # Load the training history
        with open(history_path, 'rb') as file:
            history_saved = pickle.load(file)
    else:
        print("Fitting best model...")
        early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
        history = model.fit(x_train,
                            y_train,
                            validation_data=(x_val, y_val),
                            epochs=2,
                            batch_size=batch_size,
                            callbacks=[early_stopping],
                            verbose=2)

        model.save(model_path)
        print(f"Model {model_name} saved at {model_path}")

        # Save the training history
        history_saved = history.history
        with open(history_path, 'wb') as file:
            pickle.dump(history_saved, file)

    return model, history_saved


def init_results(model, model_name, x_train, y_train, x_val, y_val, x_test, y_test, res_path):
    """
    Initialize and save the results for the best model.

    Args:
        model (object): The best model.
        model_name (str): The name of the model.
        x_train (array): The training data.
        y_train (array): The training labels.
        x_val (array): The validation data.
        y_val (array): The validation labels.
        x_test (array): The test data.
        y_test (array): The test labels.
        res_path (str): The path to save the results.

    Returns:
        None

    """

    print("Saving best results...")

    y_pred = model.predict(x_train)
    score = model.evaluate(x_train, y_train)
    get_results(y_pred, y_train, x_train, score, model, f"Train-{model_name}", model.optimizer.get_config(), res_path)

    y_pred = model.predict(x_val)
    score = model.evaluate(x_val, y_val)
    get_results(y_pred, y_val, x_val, score, model, f"Val-{model_name}", model.optimizer.get_config(), res_path)

    y_pred = model.predict(x_test)
    score = model.evaluate(x_test, y_test)
    get_results(y_pred, y_test, x_test, score, model, f"Test-{model_name}", model.optimizer.get_config(), res_path, only_metrics=False)


def setup_dl(model_name):
    """
    Setup the Deep Learning settings for the given model name.

    Args:
        model_name (str): The name of the model.

    Returns:
        None

    """
    '''
    if model_name == "LSTM":
        objective_function = get_objective_function(LSTM)
    else:
        objective_function = get_objective_function(GRU)

    best_params = perform_bayesian_opt(objective_function)
    '''
    
    subfolder_path = f"{model_name}_results"
    res_path = os.path.join(results_folder_path, subfolder_path)

    # Fit best model
    model = define_model(0.01,  #best_params['learning_rate'], 
                         LSTM if model_name == "LSTM" else GRU)

    best_model, history = fit_model(model, model_name, x_train_encoded, y_train_encoded, x_val_encoded, y_val_encoded, res_path)
    plot_development(history, model_name, res_path)

    # Get results
    init_results(best_model, model_name, x_train_encoded, y_train_encoded, x_val_encoded, y_val_encoded, x_test_encoded, y_test_encoded, res_path)

In [17]:
setup_dl("LSTM")

<class 'keras.layers.rnn.lstm.LSTM'>


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input glove embeddings (Inp  [(None, 449)]            0         
 utLayer)                                                        
                                                                 
 embeddings (Embedding)      (None, 449, 100)          1139500   
                                                                 
 layer_1 (LSTM)              (None, 449, 64)           42240     
                                                                 
 layer_2 (LSTM)              (None, 32)                12416     
                                                                 
 output (Dense)              (None, 3)                 99        
                                                                 
Total params: 1,194,255
Trainable params: 54,755
Non-trainable params: 1,139,500
______________________________________________

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


(11529, 3)
(11529,)
(11529,)
Accuracy: 86.74%
Loss: 0.15


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


(11899, 3)
(11899,)
(11899,)
Accuracy: 84.04%
Loss: 0.15


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
setup_dl("GRU")

<class 'keras.layers.rnn.gru.GRU'>


2023-06-26 01:53:41.093017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38306 MB memory:  -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:17:00.0, compute capability: 8.0


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input glove embeddings (Inp  [(None, 449)]            0         
 utLayer)                                                        
                                                                 
 embeddings (Embedding)      (None, 449, 100)          1139500   
                                                                 
 layer_1 (GRU)               (None, 449, 64)           31872     
                                                                 
 layer_2 (GRU)               (None, 32)                9408      
                                                                 
 output (Dense)              (None, 3)                 99        
                                                                 
Total params: 1,180,879
Trainable params: 41,379
Non-trainable params: 1,139,500
______________________________________________

2023-06-26 01:53:47.328547: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:630] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.


2563/2563 - 4372s - loss: 0.2119 - accuracy: 0.4844 - val_loss: 0.1164 - val_accuracy: 0.8674 - 4372s/epoch - 2s/step
Epoch 2/2
2563/2563 - 5763s - loss: 0.2052 - accuracy: 0.4971 - val_loss: 0.1088 - val_accuracy: 0.8447 - 5763s/epoch - 2s/step
Model GRU saved at /home2/s3985113/Thesis_Jupyter_Final/src/results/GRU_results/model_GRU.h5
Saving best results...
(41000, 3)
(41000,)
(41000,)
Accuracy: 49.69%
Loss: 0.22
(11529, 3)
(11529,)
(11529,)
Accuracy: 84.49%
Loss: 0.11
(11899, 3)
(11899,)
(11899,)
Accuracy: 82.31%
Loss: 0.12
