Assignment on Text and Sequence

-By Vaishnavi Haripuri
811285838

# Modifying Parameters and Data Preprocessing

1. Cutoff reviews after 150 words.
2. Restrict training samples to 100.
3. Validate on 10,000 samples.
4. Consider only the top 10,000 words.

In [None]:
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, Dropout
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping

# Load the IMDB dataset
max_features = 10000  # Consider only the top 10,000 words
maxlen = 150          # Cutoff after 150 words
training_samples = 100  # Restrict training samples to 100
validation_samples = 10000  # Validate on 10,000 samples

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# Limit training data to 100 samples
x_train = x_train[:training_samples]
y_train = y_train[:training_samples]

# Pad the sequences to ensure all reviews have the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)

# Reserve 10,000 samples for validation
x_val = x_test[:validation_samples]
y_val = y_test[:validation_samples]


# Defining Model Architecture with Embedding Layer

In [None]:
# Model with trainable embedding layer
model_embedding = Sequential([
    Embedding(max_features, 50, input_length=maxlen),  # Embedding layer
    Flatten(),
    Dense(32, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model_embedding.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])







In [None]:

history_embedding = model_embedding.fit(
    x_train, y_train,
    epochs=15,
    batch_size=32,
    validation_data=(x_val, y_val),
)



Epoch 1/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 213ms/step - accuracy: 1.0000 - loss: 0.0135 - val_accuracy: 0.5262 - val_loss: 0.7827
Epoch 2/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 225ms/step - accuracy: 1.0000 - loss: 0.0086 - val_accuracy: 0.5256 - val_loss: 0.7875
Epoch 3/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 199ms/step - accuracy: 1.0000 - loss: 0.0069 - val_accuracy: 0.5250 - val_loss: 0.7901
Epoch 4/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 203ms/step - accuracy: 1.0000 - loss: 0.0047 - val_accuracy: 0.5248 - val_loss: 0.7927
Epoch 5/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 224ms/step - accuracy: 1.0000 - loss: 0.0064 - val_accuracy: 0.5252 - val_loss: 0.7958
Epoch 6/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 224ms/step - accuracy: 1.0000 - loss: 0.0066 - val_accuracy: 0.5254 - val_loss: 0.8000
Epoch 7/15
[1m4/4[0m [32m━━━━━━━━━━━━

In [None]:
word_index = imdb.get_word_index()

# Adjust word indices by shifting the indices by 3 (to account for padding, start, and unknown tokens)
word_index = {word: index + 3 for word, index in word_index.items()}

embedding_matrix = np.zeros((max_features, embedding_dim))

for word, i in word_index.items():
    if i < max_features:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector


In [None]:
Embedding(max_features, embedding_dim, input_length=maxlen, weights=[embedding_matrix], trainable=True)


<Embedding name=embedding_9, built=True>

#Defining Model with Pretrained Word Embeddings ( GloVe)

In [None]:
model_pretrained = Sequential([
    Embedding(max_features, embedding_dim, input_length=maxlen, weights=[embedding_matrix], trainable=True),
    Flatten(),
    Dense(32, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

optimizer = Adam(learning_rate=0.0001)  # Try a lower learning rate
model_pretrained.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])



In [None]:
history_pretrained = model_pretrained.fit(
    x_train, y_train,
    epochs=15,
    batch_size=32,
    validation_data=(x_val, y_val),
)


Epoch 1/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 224ms/step - accuracy: 0.7458 - loss: 0.5829 - val_accuracy: 0.5133 - val_loss: 0.7148
Epoch 2/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 225ms/step - accuracy: 0.8090 - loss: 0.5444 - val_accuracy: 0.5157 - val_loss: 0.7130
Epoch 3/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 199ms/step - accuracy: 0.8139 - loss: 0.5098 - val_accuracy: 0.5193 - val_loss: 0.7090
Epoch 4/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 231ms/step - accuracy: 0.8209 - loss: 0.4934 - val_accuracy: 0.5206 - val_loss: 0.7094
Epoch 5/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 223ms/step - accuracy: 0.8511 - loss: 0.4751 - val_accuracy: 0.5224 - val_loss: 0.7128
Epoch 6/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 224ms/step - accuracy: 0.8299 - loss: 0.4709 - val_accuracy: 0.5244 - val_loss: 0.7171
Epoch 7/15
[1m4/4[0m [32m━━━━━━━━━━━━

#changing the number of training samples

In [None]:
# Experimenting with different training sample sizes
training_sample_sizes = [100, 500, 1000, 2000]

results = {}
for sample_size in training_sample_sizes:
    # Limit training data to sample_size
    x_train = x_train[:sample_size]
    y_train = y_train[:sample_size]

    # Train both models with the new sample size
    history_embedding = model_embedding.fit(
        x_train, y_train,
        epochs=15,
        batch_size=32,
        validation_data=(x_val, y_val),
    )

    history_pretrained = model_pretrained.fit(
        x_train, y_train,
        epochs=15,
        batch_size=32,
        validation_data=(x_val, y_val),
        )

    # Record the performance of both models for comparison
    results[sample_size] = {
        'embedding_acc': history_embedding.history['val_accuracy'][-1],
        'pretrained_acc': history_pretrained.history['val_accuracy'][-1]
    }

# Print results to compare the performance
print(results)


Epoch 1/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 391ms/step - accuracy: 1.0000 - loss: 4.4569e-04 - val_accuracy: 0.5278 - val_loss: 0.9449
Epoch 2/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 342ms/step - accuracy: 1.0000 - loss: 3.6424e-04 - val_accuracy: 0.5272 - val_loss: 0.9484
Epoch 3/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 336ms/step - accuracy: 1.0000 - loss: 4.6609e-04 - val_accuracy: 0.5275 - val_loss: 0.9518
Epoch 4/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 337ms/step - accuracy: 1.0000 - loss: 4.7175e-04 - val_accuracy: 0.5277 - val_loss: 0.9537
Epoch 5/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 235ms/step - accuracy: 1.0000 - loss: 0.0011 - val_accuracy: 0.5269 - val_loss: 0.9560
Epoch 6/15
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 224ms/step - accuracy: 1.0000 - loss: 4.7259e-04 - val_accuracy: 0.5277 - val_loss: 0.9578
Epoch 7/15
[1m4/4[

Best Observed Performance of embedding layer :

Epoch 13–15 in the last model training session:

Validation accuracy was ~54.4% (highest among the epochs observed).
Validation loss reached a minimum of 0.8607, suggesting the embedding layer performed best at this point.

#Comparing Performance

#For 100 samples:
Embedding model accuracy: 0.5258
Pretrained model accuracy: 0.5346
#For 500 samples:
Embedding model accuracy: 0.5269
Pretrained model accuracy: 0.5376
#For 1000 samples:
Embedding model accuracy: 0.5270
Pretrained model accuracy: 0.5405
#For 2000 samples:
Embedding model accuracy: 0.5277
Pretrained model accuracy: 0.5405

#Conclusion:

At every training sample size, the pretrained embedding model performs marginally better than the embedding model. The pretrained model appears to perform slightly better, despite the tiny variations, indicating that even with fewer training datasets, employing pretrained embeddings is advantageous.

Performance trend based on different training size for pretrained model

100 samples: Pretrained model accuracy = 0.5346

500 samples: Pretrained model accuracy = 0.5376

1000 samples: Pretrained model accuracy = 0.5405

2000 samples: Pretrained model accuracy = 0.5405


The pretrained model’s accuracy increases slightly as you go from 100 to 1000 samples but then stabilizes at 2000 samples (no further increase in accuracy from 1000 to 2000).
For the pretrained model, you get a stable accuracy of 0.5405 starting from 1000 samples, and there's no noticeable improvement with more samples (2000).


Best choice: 1000 samples seems to be a good compromise. The accuracy is relatively high, and adding more samples (up to 2000) doesn't provide significant improvement.