# **CPC353 Lab 6**
* In this session, you will need the gensim library. Install it using pip install gensim.

## **Long Short-Term Memory (LSTM)**

Long Short-Term Memory (LSTM) is a type of deep learning model, specifically a Recurrent Neural Network (RNN), designed to understand and predict sequential data by remembering important information over long periods, overcoming the vanishing gradient problem that limits standard RNNs. To put it simply, LSTMs process data points one after another, maintaining a "memory" of previous inputs, making them ideal for text, audio, and time-series data.

Meanwhile, a sequence of a text means words appear in order, and earlier words can affect the meaning of later ones. Unlike a feedforward neural network, which treats text as a fixed set of features and ignores word order, an LSTM processes words one by one and maintains an internal memory that captures context across the sequence. This allows it to model word order and handle variable-length texts naturally.

## **Step 1: Library Import & GloVe**

GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm designed to generate dense vector representations also known as embeddings. Its primary objective is to capture semantic relationships between words by analyzing their co-occurrence patterns in a large text corpus. GloVe has pre-defined dense vectors for around every 6 billion words of English literature along with many other characters like commas, braces and semicolons. It can be downloaded and used immediately in many natural language processing (NLP) applications. Users can select a pre-trained GloVe embedding in a dimension like 50d, 100d, 200d or 300d vectors that best fits their needs in terms of computational resources and task specificity.

In [1]:
from nltk import TweetTokenizer
from sklearn.metrics import accuracy_score
from keras.models import Model
from keras.layers import Input, Dense, LSTM, Flatten
from keras.preprocessing import sequence
from keras.utils import to_categorical
import numpy as np
import gensim.downloader as api

# Load a pretrained GloVe word embedding model (50-dimensional vectors) using gensim
model_glove = api.load("glove-wiki-gigaword-50")



In [2]:
# Example
print(f"\"movie\" - Shape: {model_glove["movie"].shape}, Value:\n{model_glove["movie"]}\n")
print(f"\"hello\" - Shape: {model_glove["hello"].shape}, Value:\n{model_glove["hello"]}\n")

"movie" - Shape: (50,), Value:
[ 0.30824   0.17223  -0.23339   0.023105  0.28522   0.23076  -0.41048
 -1.0035   -0.2072    1.4327   -0.80684   0.68954  -0.43648   1.1069
  1.6107   -0.31966   0.47744   0.79395  -0.84374   0.064509  0.90251
  0.78609   0.29699   0.76057   0.433    -1.5032   -1.6423    0.30256
  0.30771  -0.87057   2.4782   -0.025852  0.5013   -0.38593  -0.15633
  0.45522   0.04901  -0.42599  -0.86402  -1.3076   -0.29576   1.209
 -0.3127   -0.72462  -0.80801   0.082667  0.26738  -0.98177  -0.32147
  0.99823 ]

"hello" - Shape: (50,), Value:
[-0.38497   0.80092   0.064106 -0.28355  -0.026759 -0.34532  -0.64253
 -0.11729  -0.33257   0.55243  -0.087813  0.9035    0.47102   0.56657
  0.6985   -0.35229  -0.86542   0.90573   0.03576  -0.071705 -0.12327
  0.54923   0.47005   0.35572   1.2611   -0.67581  -0.94983   0.68666
  0.3871   -1.3492    0.63512   0.46416  -0.48814   0.83827  -0.9246
 -0.33722   0.53741  -1.0616   -0.081403 -0.67111   0.30923  -0.3923
 -0.55002  -0.68827 

In [6]:
print(model_glove)

KeyedVectors<vector_size=50, 400000 keys>


## **Step 2: Tokenization & Sequence Padding**

In this step, raw text documents are transformed into fixed-length numerical sequences that can be fed into a neural network such as an LSTM. First, tokenization is applied to each document. Tokenization breaks a sentence into smaller units using TweetTokenizer. Next, each token is mapped to a word embedding (50-dimension vector) using the pretrained GloVe model, thus producing a sequence of vectors for each document. Since different documents contain different numbers of words, these sequences naturally have variable lengths.

However, neural networks require inputs of uniform size, so sequence padding is used to standardize all documents to the same length. A maximum sentence length (e.g. sent_length = 15) is chosen. Documents shorter than this length are padded with zero vectors, while longer documents are truncated to keep only the first or last 15 tokens (depending on the padding strategy). After padding, every document has the same shape (sent_length, embedding_dim), making it suitable for batch training in an LSTM. By default, if a sequence is shorter than the target length, zeros are inserted at the start of the sequence to make it the required length (pre-padding). Conversely, if a sequence is longer than the target length, the earlier elements are discarded, and only the last portion of the sequence is retained (pre-truncating).

In [3]:
docs = ['I love the movies!',
        'The actors are great.',
        'Beautiful actress :)',
        "i do not like the music",
        'nice story',
        'actors are great, but overall is not nice.',
        'love it!',
        'great...',
        'enjoy it very much.',
        'Wonderful experience.',
        'really boring',
        ':(',
        'Bad acting',
        "I do not like the actors",
        "Fall asleep throughout the movie!",
        "too much dialogs and not much actions"]

cat = [1, 1, 1, 
       0, 1, 0,
       1, 1, 1, 
       1, 0, 0,
       0, 0, 0, 
       0]

sent_length = 15
n_features = 50
n_output = 2
batch_size = 4

# Tokenize the text in the document and append to a list
tokenizer = TweetTokenizer()
docs_embedding = list()
for d in docs:
    tokens = tokenizer.tokenize(d.lower())
    embedding = model_glove[tokens]
    docs_embedding.append(embedding)
    print(f"Embedding: {embedding.shape}, Tokens: {tokens}")

# Pad the embedding sequence so that they have the same sentence length
X = sequence.pad_sequences(docs_embedding, maxlen = sent_length, dtype = "int32")
X = np.array(X)
print(f"\nX Shape: {str(X.shape)}\n")

Embedding: (5, 50), Tokens: ['i', 'love', 'the', 'movies', '!']
Embedding: (5, 50), Tokens: ['the', 'actors', 'are', 'great', '.']
Embedding: (3, 50), Tokens: ['beautiful', 'actress', ':)']
Embedding: (6, 50), Tokens: ['i', 'do', 'not', 'like', 'the', 'music']
Embedding: (2, 50), Tokens: ['nice', 'story']
Embedding: (10, 50), Tokens: ['actors', 'are', 'great', ',', 'but', 'overall', 'is', 'not', 'nice', '.']
Embedding: (3, 50), Tokens: ['love', 'it', '!']
Embedding: (2, 50), Tokens: ['great', '...']
Embedding: (5, 50), Tokens: ['enjoy', 'it', 'very', 'much', '.']
Embedding: (3, 50), Tokens: ['wonderful', 'experience', '.']
Embedding: (2, 50), Tokens: ['really', 'boring']
Embedding: (1, 50), Tokens: [':(']
Embedding: (2, 50), Tokens: ['bad', 'acting']
Embedding: (6, 50), Tokens: ['i', 'do', 'not', 'like', 'the', 'actors']
Embedding: (6, 50), Tokens: ['fall', 'asleep', 'throughout', 'the', 'movie', '!']
Embedding: (7, 50), Tokens: ['too', 'much', 'dialogs', 'and', 'not', 'much', 'actions

In [4]:
# Example
print(f"Embedding index 0:\n{np.array(docs_embedding[0]).astype(int)}\n")
print(f"X index 0:\n{X[0]}\n")

Embedding index 0:
[[ 0  0  0  0  0  0  0  0  0  0  0  0 -1  0  1  0  0  0  0  0  0  0  0  0
   1 -2 -1  0  1 -1  3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [ 0  1  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  1  0  0
   1 -1 -1  0  1  0  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1
   0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0 -1  0  0  0  0  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [ 0  0  0  0  0  0  0 -1  0  1  0  0  0  0  1  0  0  0  0  0  1  0  0  1
   0  0 -1  0  0 -1  2  0  0  0  0  0  0  0 -1  0  0  1  0  0  0  0  0  0
   0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0 -1 -1  0  0 -1  2  0 -1  1 -1  0  0 -1  0  0  0  0  0  0  0  0  0 -1
   0  0]]

X index 0:
[[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [ 0  0  0  0  0  0  0  0 

## **Step 3: LSTM Model Training & Prediction**

In Step 3, the LSTM model is constructed, trained, and used for prediction. The predicted labels are then compared to the true labels to evaluate model performance.

In [5]:
# Define the LSTM layer for computation
inputs = Input(shape = (sent_length, n_features))
lstm = LSTM(2, return_sequences = True, return_state = True)
outputs_seq, state_h, state_c = lstm(inputs)
flat = Flatten()(outputs_seq)
outputs = Dense(n_output, activation = 'softmax')(flat)

# Wrap the LSTM layer into a Keras model, connecting the input and output layers to form a single end-to-end model
model = Model(inputs = inputs, outputs = outputs)
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.summary()

# Train the keras model
model.fit(X, to_categorical(cat), validation_data = (X, to_categorical(cat)),
          epochs = 20, verbose = 1, batch_size = batch_size)

# Evaluate the keras model
prob = model.predict(X)
test = np.argmax(prob, axis = 1)

print(f"\nPrediction Probability:\n{prob}")
print(f"\nPredicted Class Label: {test}")
print(f"\nAccuracy Score: {accuracy_score(cat, test)}")

Epoch 1/20
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 98ms/step - accuracy: 0.4375 - loss: 0.6919 - val_accuracy: 0.4375 - val_loss: 0.6881
Epoch 2/20
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.4375 - loss: 0.6876 - val_accuracy: 0.4375 - val_loss: 0.6856
Epoch 3/20
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - accuracy: 0.4375 - loss: 0.6849 - val_accuracy: 0.4375 - val_loss: 0.6831
Epoch 4/20
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.4375 - loss: 0.6823 - val_accuracy: 0.4375 - val_loss: 0.6804
Epoch 5/20
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.5000 - loss: 0.6798 - val_accuracy: 0.5625 - val_loss: 0.6778
Epoch 6/20
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.5625 - loss: 0.6773 - val_accuracy: 0.5625 - val_loss: 0.6752
Epoch 7/20
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━