# Word Embedding and Positional Embedding using Recurrent Neural Network (RNN) and Transformers Architecture

## Summary

A Sequence to Text approach has taken to the IMDB example using Recurrent Neural Network (RNN) by training the embedding layer on own and with a pretrained word embedding layer using different sample sizes and altering text lengths from 100 to 150. Test Accuracy for different observations are mentioned in Table 1 below:

**Intrepration for Table 1:** Embedding layer with masking performed well compared to Pretrained word embedding with all the different training samples (100, 200, 300, 400) by altering text length to cutoff the reviews after 150 or 300 words. Usually, it should be viceversa i.e., Pretrained word embedding should perform better compared to Embedding layer with masking.

The possible reason could be as follows:


1.   **Small Training Data:** If training data size (100, 200, 300, 400 samples) is relatively small, the pre-trained embeddings might not have enough data to adapt effectively to the specific task of sentiment analysis on movie reviews.
2.   **Domain Specificity:** Pre-trained word embeddings like GloVe or Word2Vec are trained on massive datasets that might not be specific to movie reviews. The embedding layer trained on your own IMDB data might capture the nuances of sentiment and vocabulary specific to movie reviews better.
3. **Masking Impact:** Padding sequences to a fixed length (150 or 300 words) with pre-trained embeddings might introduce noise, especially if many reviews are shorter. Masking removes these padding tokens, allowing the model to focus on the actual content.


**Table 1: Comparision between Word Embedding and Positional Embedding using RNN**

<table>
<thead>
  <tr>
    <th>S.no</th>
    <th>Model Name</th>
    <th>Test Accuracy<br>100 Training samples,<br>150 Text length</th>
    <th>Test Accuracy<br>200 Training samples,<br>150 Text length</th>
    <th>Test Accuracy<br>400 Training samples,<br>150 Text length</th>
    <th>Test Accuracy<br>100 Training samples,<br>300 Text length</th>
    <th>Test Accuracy<br>300 Training samples,<br>300 Text length</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>1</td>
    <td>Basic sequential</td>
    <td>55.4</td>
    <td>49.5</td>
    <td>57.7</td>
    <td>52.0</td>
    <td>60.8</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Embedded layer</td>
    <td>50.7</td>
    <td>50.7</td>
    <td>49.5</td>
    <td>60.6</td>
    <td>56.8</td>
  </tr>
  <tr>
    <td>3</td>
    <td> Embedded layer with masking</td>
    <td>61.1</td>
    <td>59.9</td>
    <td>61.0</td>
    <td>62.9</td>
    <td>61.8</td>
  </tr>
  <tr>
    <td>4</td>
    <td>Pretrained word embedding</td>
    <td>54.4</td>
    <td>55.3</td>
    <td>50.0</td>
    <td>53.3</td>
    <td>53.1</td>
  </tr>
</tbody>
</table>



In addition to this, the same IMDB dataset has passed through the "Transformer Architecture" for both word embedding and positional embedding. Given the 100 training sample size with 300 text length performed well using RNN, transformers architecture was implememted on the sample sample size.Apparently, the Test Accuracy is higher in Transformers compared to RNN however, the positional encoding is still lower compared to the base Transformer encoder.

Assuming its due to lower training sample size (100), it was increased to 1000 training samples but still the trend between positional encoding and base transformer encoder is same. Details of the observations are mentioned below:


**Table 2: Comparision between Word Embedding and Positional Embedding using RNN and Transformers**

<table>
<thead>
  <tr>
    <th>S.no</th>
    <th>Model Name</th>
    <th>Test Accuracy<br>100 Training samples,<br>300 Text length</th>
    <th>Test Accuracy<br>1000 Training samples,<br>300 Text length</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>1</td>
    <td>Basic sequential</td>
    <td>52.0</td>
    <td>74.0</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Embedded layer</td>
    <td>60.6</td>
    <td>71.8</td>
  </tr>
  <tr>
    <td>3</td>
    <td> Embedded layer with masking</td>
    <td>62.9</td>
    <td>77.6</td>
  </tr>
  <tr>
    <td>4</td>
    <td>Pretrained word embedding</td>
    <td>53.3</td>
    <td>66.5</td>
  </tr>
  <tr>
    <td>5</td>
    <td>Transformer Encoder based</td>
    <td>68.2</td>
    <td>80.5</td>
  </tr>
  <tr>
    <td>6</td>
    <td>Positional Embedding</td>
    <td>61.9</td>
    <td>76.9</td>
  </tr>
</tbody>
</table>


Note: This file has the execution of 100 Training Samples with 150 Text Length. Other files with training samples are attached in [this github folder.](https://github.com/cpendyal/ChaitanyaPendyalaRepository/tree/main/Transformers/IMDB%20Dataset)

### Processing words as a sequence: The sequence model approach

#### A first practical example

**Downloading the data**

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  5081k      0  0:00:16  0:00:16 --:--:-- 7485k


**Preparing the data**

In [None]:
import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
excess_dir = base_dir / "excess"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    os.makedirs(excess_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = 5000
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

    files = os.listdir(train_dir / category)
    random.Random(1338).shuffle(files)
    num_ex_samples = 50
    ex_files = files[-num_ex_samples:]
    for fname in ex_files:
        shutil.move(train_dir / category / fname,
                    excess_dir / category / fname)

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/excess", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

Found 100 files belonging to 2 classes.
Found 10000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [None]:

!ls -al /content/aclImdb/excess/neg/ | wc
!ls -al /content/aclImdb/excess/pos/ | wc
!ls -al /content/aclImdb/val/pos/ | wc
!rm -rf /content/aclImdb1/
!rm -rf /content/aclImdb/

     53     470    2702
     53     470    2722
  10003   90020  545011


**Preparing integer sequence datasets**

In [None]:
from tensorflow.keras import layers

max_length = 150
max_tokens = 10000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

**A sequence model built on one-hot encoded vector sequences**

In [None]:
import tensorflow as tf
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = tf.one_hot(inputs, depth=max_tokens)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="adam",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 tf.one_hot (TFOpLambda)     (None, None, 10000)       0         
                                                                 
 bidirectional (Bidirection  (None, 64)                2568448   
 al)                                                             
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 2568513 (9.80 MB)
Trainable params: 2568513 (9.80 MB)
Non-trainable params: 0 (0.00 Byte)
_______________________

**Training a first basic sequence model**

In [None]:
callbacks = [
    keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("one_hot_bidir_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.554


#### Understanding word embeddings

#### Learning word embeddings with the Embedding layer

**Instantiating an `Embedding` layer**

In [None]:
embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256)

**Model that uses an `Embedding` layer trained from scratch**

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="adam",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_2 (Embedding)     (None, None, 256)         2560000   
                                                                 
 bidirectional_2 (Bidirecti  (None, 64)                73984     
 onal)                                                           
                                                                 
 dropout_2 (Dropout)         (None, 64)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
Total params: 2634049 (10.05 MB)
Trainable params: 2634049 (10.05 MB)
Non-trainable params: 0 (0.00 Byte)
___________________

#### Understanding padding and masking

**Using an `Embedding` layer with masking enabled**

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(
    input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="adam",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru_with_masking.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru_with_masking.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_3 (Embedding)     (None, None, 256)         2560000   
                                                                 
 bidirectional_3 (Bidirecti  (None, 64)                73984     
 onal)                                                           
                                                                 
 dropout_3 (Dropout)         (None, 64)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
Total params: 2634049 (10.05 MB)
Trainable params: 2634049 (10.05 MB)
Non-trainable params: 0 (0.00 Byte)
___________________

#### Using pretrained word embeddings

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2024-05-03 03:35:02--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-05-03 03:35:03--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-05-03 03:35:04--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

**Parsing the GloVe word-embeddings file**

In [None]:
import numpy as np
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print(f"Found {len(embeddings_index)} word vectors.")

Found 400000 word vectors.


**Preparing the GloVe word-embeddings matrix**

In [None]:
embedding_dim = 100

vocabulary = text_vectorization.get_vocabulary()
word_index = dict(zip(vocabulary, range(len(vocabulary))))

embedding_matrix = np.zeros((max_tokens, embedding_dim))
for word, i in word_index.items():
    if i < max_tokens:
        embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [None]:
embedding_layer = layers.Embedding(
    max_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
    mask_zero=True,
)

**Model that uses a pretrained Embedding layer**

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="adam",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("glove_embeddings_sequence_model.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("glove_embeddings_sequence_model.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_4 (Embedding)     (None, None, 100)         1000000   
                                                                 
 bidirectional_4 (Bidirecti  (None, 64)                34048     
 onal)                                                           
                                                                 
 dropout_4 (Dropout)         (None, 64)                0         
                                                                 
 dense_4 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1034113 (3.94 MB)
Trainable params: 34113 (133.25 KB)
Non-trainable params: 1000000 (3.81 MB)
_________________