<a href="https://colab.research.google.com/github/andr3w1699/HumanLanguageTechnologyProject/blob/main/SentimentClassificationWithRecurrent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q gdown

In [2]:
# Replace FILE_ID with your actual file ID
file_id = '0Bz8a_Dbh9QhbZVhsUnRWRDhETzA'
output_name = 'amazon_review_full_csv.tar.gz'

!gdown --id {file_id} -O {output_name}

Downloading...
From (original): https://drive.google.com/uc?id=0Bz8a_Dbh9QhbZVhsUnRWRDhETzA
From (redirected): https://drive.google.com/uc?id=0Bz8a_Dbh9QhbZVhsUnRWRDhETzA&confirm=t&uuid=b0558250-08ae-4c98-9b28-a1433d90bfc5
To: /content/amazon_review_full_csv.tar.gz
100% 644M/644M [00:09<00:00, 64.5MB/s]


In [3]:
import tarfile

with tarfile.open(output_name, "r:gz") as tar:
    tar.extractall("Dataset")

In [4]:
!ls -R Dataset

Dataset:
amazon_review_full_csv

Dataset/amazon_review_full_csv:
readme.txt  test.csv  train.csv


In [5]:
import pandas as pd

# Set options to show full text and all rows
pd.set_option('display.max_colwidth', None)

df_train = pd.read_csv(
    './Dataset/amazon_review_full_csv/train.csv',
    header=None,
    names=['label', 'title', 'text'],
    quotechar='"',
    doublequote=True,
    escapechar='\\',
    engine='python',
    encoding='utf-8',
    on_bad_lines='skip'  # Skip rows with parsing errors
)

df_train.head()

Unnamed: 0,label,title,text
0,3,more like funchuck,"Gave this to my dad for a gag gift after directing ""Nunsense,"" he got a reall kick out of it!"
1,5,Inspiring,"I hope a lot of people hear this cd. We need more strong and positive vibes like this. Great vocals, fresh tunes, cross-cultural happiness. Her blues is from the gut. The pop sounds are catchy and mature."
2,5,The best soundtrack ever to anything.,"I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny."
3,4,Chrono Cross OST,"The music of Yasunori Misuda is without question my close second below the great Nobuo Uematsu.Chrono Cross OST is a wonderful creation filled with rich orchestra and synthesized sounds. While ambiance is one of the music's major factors, yet at times it's very uplifting and vigorous. Some of my favourite tracks include; ""Scars Left by Time, The Girl who Stole the Stars, and Another World""."
4,5,Too good to be true,Probably the greatest soundtrack in history! Usually it's better to have played the game first but this is so enjoyable anyway! I worked so hard getting this soundtrack and after spending [money] to get it it was really worth every penny!! Get this OST! it's amazing! The first few tracks will have you dancing around with delight (especially Scars Left by Time)!! BUY IT NOW!!


In [6]:
# Number of rows
print("Number of rows:", len(df_train))

# Check for null values
if df_train.isnull().values.any():
    print("There are null elements in the DataFrame.")
else:
    print("There are no null elements in the DataFrame.")

Number of rows: 2999746
There are null elements in the DataFrame.


In [7]:
df_test = pd.read_csv(
    './Dataset/amazon_review_full_csv/train.csv',
    header=None,
    names=['label', 'title', 'text'],
    quotechar='"',
    doublequote=True,
    escapechar='\\',
    engine='python',
    encoding='utf-8',
    on_bad_lines='skip'  # Skip rows with parsing errors
)

df_test.head()

Unnamed: 0,label,title,text
0,3,more like funchuck,"Gave this to my dad for a gag gift after directing ""Nunsense,"" he got a reall kick out of it!"
1,5,Inspiring,"I hope a lot of people hear this cd. We need more strong and positive vibes like this. Great vocals, fresh tunes, cross-cultural happiness. Her blues is from the gut. The pop sounds are catchy and mature."
2,5,The best soundtrack ever to anything.,"I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny."
3,4,Chrono Cross OST,"The music of Yasunori Misuda is without question my close second below the great Nobuo Uematsu.Chrono Cross OST is a wonderful creation filled with rich orchestra and synthesized sounds. While ambiance is one of the music's major factors, yet at times it's very uplifting and vigorous. Some of my favourite tracks include; ""Scars Left by Time, The Girl who Stole the Stars, and Another World""."
4,5,Too good to be true,Probably the greatest soundtrack in history! Usually it's better to have played the game first but this is so enjoyable anyway! I worked so hard getting this soundtrack and after spending [money] to get it it was really worth every penny!! Get this OST! it's amazing! The first few tracks will have you dancing around with delight (especially Scars Left by Time)!! BUY IT NOW!!


In [8]:
# Keep only positive (4,5) and negative (1,2) ratings
df_train_binary = df_train[df_train['label'] != 3].copy()

# Map ratings to binary sentiment
df_train_binary['sentiment'] = df_train_binary['label'].apply(lambda x: 1 if x > 3 else 0)

In [9]:
df_train_binary['review'] = df_train_binary['title'].fillna('') + ' ' + df_train_binary['text'].fillna('')
df_train_sampled = df_train_binary.sample(n=150000, random_state=42)
X = df_train_sampled['review'].values
y = df_train_sampled['sentiment'].values

In [10]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Parameters
max_words = 30000  # Size of vocabulary
max_len = 200      # Max review length

# Tokenizer
tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(X)

# Convert text to sequences
sequences = tokenizer.texts_to_sequences(X)
X_padded = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X_padded, y, test_size=0.2, random_state=42)

In [12]:
from tensorflow.keras.layers import Bidirectional,LSTM, Input, Dense, Dropout, Embedding
from tensorflow.keras.models import Sequential

model_BiLSTM = Sequential([
    Input(shape=(max_len,)),  # Define the input shape
    Embedding(input_dim=max_words, output_dim=128),
    Bidirectional(LSTM(64)),  # BiLSTM instead of LSTM
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])
model_BiLSTM.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_BiLSTM.summary()

In [13]:
from tensorflow.keras.layers import LSTM, Input, Dense, Dropout, Embedding
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping


# Define EarlyStopping callback
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True
)

# Train the model with EarlyStopping
history = model_BiLSTM.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=512,
    epochs=5,
    callbacks=[early_stop]  # 👈 Early stopping in action
)

Epoch 1/5
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 54ms/step - accuracy: 0.7590 - loss: 0.4794 - val_accuracy: 0.8977 - val_loss: 0.2569
Epoch 2/5
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 53ms/step - accuracy: 0.9174 - loss: 0.2156 - val_accuracy: 0.9028 - val_loss: 0.2491
Epoch 3/5
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 53ms/step - accuracy: 0.9414 - loss: 0.1639 - val_accuracy: 0.9020 - val_loss: 0.2654
Epoch 4/5
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 51ms/step - accuracy: 0.9556 - loss: 0.1293 - val_accuracy: 0.9040 - val_loss: 0.2541


In [14]:
# Keep only positive (4,5) and negative (1,2) ratings
df_test_binary = df_test[df_test['label'] != 3].copy()

# Map ratings to binary sentiment
df_test_binary['sentiment'] = df_test_binary['label'].apply(lambda x: 1 if x > 3 else 0)


# Preprocess test set
df_test_binary['review'] = df_test_binary['title'].fillna('') + ' ' + df_test_binary['text'].fillna('')

df_test_sampled = df_test_binary.sample(n=150000, random_state=42)

X_test_seq = tokenizer.texts_to_sequences(df_test_sampled['review'].values)
X_test_padded = pad_sequences(X_test_seq, maxlen=max_len, padding='post', truncating='post')
y_test = df_test_sampled['sentiment'].values

# Evaluate
loss, acc = model_BiLSTM.evaluate(X_test_padded, y_test)
print(f"Test accuracy: {acc:.2f}")



[1m4688/4688[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 8ms/step - accuracy: 0.9349 - loss: 0.1802
Test accuracy: 0.93


In [15]:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Conv1D, MaxPooling1D, GlobalMaxPooling1D, Dropout, Dense, Input
from tensorflow.keras.callbacks import EarlyStopping

model_BiLSTM_CNN = Sequential([
    Input(shape=(max_len,)),
    Embedding(input_dim=max_words, output_dim=128),

    Bidirectional(LSTM(64, return_sequences=True)),  # Keep sequences for CNN
    Conv1D(filters=64, kernel_size=3, activation='relu'),
    MaxPooling1D(pool_size=2),

    GlobalMaxPooling1D(),  # 👈 This flattens (batch, time, features) → (batch, features)
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

# Compile the model
model_BiLSTM_CNN.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Early stopping callback
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True
)

# Train the model
history = model_BiLSTM_CNN.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=512,
    epochs=10,
    callbacks=[early_stop]
)

# Show model summary
model_BiLSTM_CNN.summary()

Epoch 1/10
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 69ms/step - accuracy: 0.7364 - loss: 0.5139 - val_accuracy: 0.8975 - val_loss: 0.2543
Epoch 2/10
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 67ms/step - accuracy: 0.9178 - loss: 0.2154 - val_accuracy: 0.9030 - val_loss: 0.2418
Epoch 3/10
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 67ms/step - accuracy: 0.9439 - loss: 0.1531 - val_accuracy: 0.9041 - val_loss: 0.2438
Epoch 4/10
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 66ms/step - accuracy: 0.9597 - loss: 0.1166 - val_accuracy: 0.9013 - val_loss: 0.2939


In [16]:
# Evaluate
loss, acc = model_BiLSTM_CNN.evaluate(X_test_padded, y_test)
print(f"Test accuracy: {acc:.2f}")


[1m4688/4688[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 8ms/step - accuracy: 0.9370 - loss: 0.1689
Test accuracy: 0.94


In [17]:
from tensorflow.keras.layers import GRU, Input, Dense, Dropout, Embedding
from tensorflow.keras.models import Sequential

model_GRU = Sequential([
    Input(shape=(max_len,)),  # Same input shape
    Embedding(input_dim=max_words, output_dim=128),
    Bidirectional(GRU(64)),                  # 👈 GRU instead of LSTM
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model_GRU.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_GRU.summary()

In [18]:
# Early stopping callback
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True
)

# Train the model
history = model_GRU.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=512,
    epochs=20,
    callbacks=[early_stop]
)

# Show model summary
model_GRU.summary()

Epoch 1/20
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 50ms/step - accuracy: 0.7634 - loss: 0.4841 - val_accuracy: 0.8820 - val_loss: 0.2901
Epoch 2/20
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 50ms/step - accuracy: 0.9100 - loss: 0.2338 - val_accuracy: 0.9005 - val_loss: 0.2548
Epoch 3/20
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 48ms/step - accuracy: 0.9367 - loss: 0.1745 - val_accuracy: 0.9051 - val_loss: 0.2449
Epoch 4/20
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 50ms/step - accuracy: 0.9557 - loss: 0.1310 - val_accuracy: 0.9034 - val_loss: 0.2618
Epoch 5/20
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 48ms/step - accuracy: 0.9672 - loss: 0.1017 - val_accuracy: 0.8994 - val_loss: 0.2832


In [19]:
# Evaluate
loss, acc = model_GRU.evaluate(X_test_padded, y_test)
print(f"Test accuracy: {acc:.2f}")

[1m4688/4688[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 7ms/step - accuracy: 0.9468 - loss: 0.1546
Test accuracy: 0.95


In [20]:
from tensorflow.keras.layers import SimpleRNN, Input, Dense, Dropout, Embedding
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping

model_Simple_RNN = Sequential([
    Input(shape=(max_len,)),               # input_length = max_len
    Embedding(input_dim=max_words,         # vocabulary size
              output_dim=128),             # embedding dimension
    Bidirectional(SimpleRNN(128)),                        # simple RNN with 128 units
    Dropout(0.5),
    Dense(1, activation='sigmoid')         # binary output
])

model_Simple_RNN.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

model_Simple_RNN.summary()

# Early stopping callback
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True
)

# Train
history = model_Simple_RNN.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=512,
    epochs=20,
    callbacks=[early_stop]
)

Epoch 1/20
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 69ms/step - accuracy: 0.7083 - loss: 0.5263 - val_accuracy: 0.8827 - val_loss: 0.2841
Epoch 2/20
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 59ms/step - accuracy: 0.9048 - loss: 0.2432 - val_accuracy: 0.8754 - val_loss: 0.2977
Epoch 3/20
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 58ms/step - accuracy: 0.9279 - loss: 0.1942 - val_accuracy: 0.8806 - val_loss: 0.3243


In [21]:
# Evaluate
loss, acc = model_Simple_RNN.evaluate(X_test_padded, y_test)
print(f"Test accuracy: {acc:.2f}")

[1m4688/4688[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m48s[0m 10ms/step - accuracy: 0.9014 - loss: 0.2456
Test accuracy: 0.90


In [22]:
!pip install --upgrade gensim



In [26]:
!wget -c http://nlp.stanford.edu/data/glove.6B.zip

--2025-04-24 10:14:41--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2025-04-24 10:14:41--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-04-24 10:14:41--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [27]:
!unzip -q glove.6B.zip glove.6B.100d.txt

In [28]:
import numpy as np

embedding_index = {}
with open('glove.6B.100d.txt', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs

In [29]:
embedding_dim = 100
embedding_matrix = np.zeros((max_words, embedding_dim))

for word, idx in tokenizer.word_index.items():
    if idx < max_words:
        embedding_vector = embedding_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[idx] = embedding_vector
        else:
            # Random init for words not found in GloVe
            embedding_matrix[idx] = np.random.normal(size=(embedding_dim,))

In [33]:
from tensorflow.keras.layers import Embedding

model_BiLSTM = Sequential([
    Input(shape=(max_len,)),
    Embedding(
        input_dim=max_words,
        output_dim=embedding_dim,
        weights=[embedding_matrix],
        input_length=max_len,
        trainable=True  # freeze or True to fine-tune
    ),
    Bidirectional(LSTM(64)),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model_BiLSTM.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)
model_BiLSTM.summary()



In [34]:
# Assuming you’ve already defined:
# - model_BiLSTM
# - X_train, y_train, X_val, y_val
# - early_stop = EarlyStopping(...)

history = model_BiLSTM.fit(
    X_train,
    y_train,
    validation_data=(X_val, y_val),
    batch_size=512,
    epochs=20,              # you can start with 10–20 epochs
    callbacks=[early_stop],
    verbose=1               # show progress bar
)

Epoch 1/20
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 53ms/step - accuracy: 0.7531 - loss: 0.4869 - val_accuracy: 0.8824 - val_loss: 0.2881
Epoch 2/20
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 54ms/step - accuracy: 0.8978 - loss: 0.2600 - val_accuracy: 0.8913 - val_loss: 0.2617
Epoch 3/20
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 52ms/step - accuracy: 0.9152 - loss: 0.2183 - val_accuracy: 0.8998 - val_loss: 0.2463
Epoch 4/20
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 54ms/step - accuracy: 0.9331 - loss: 0.1820 - val_accuracy: 0.9081 - val_loss: 0.2388
Epoch 5/20
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 53ms/step - accuracy: 0.9462 - loss: 0.1527 - val_accuracy: 0.9105 - val_loss: 0.2372
Epoch 6/20
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 53ms/step - accuracy: 0.9529 - loss: 0.1353 - val_accuracy: 0.9083 - val_loss: 0.2458
Epoch 7/20
[1m2

In [35]:
# Evaluate
loss, acc = model_BiLSTM.evaluate(X_test_padded, y_test)
print(f"Test accuracy: {acc:.2f}")

[1m4688/4688[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 8ms/step - accuracy: 0.9494 - loss: 0.1459
Test accuracy: 0.95


In [36]:
!wget -c https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
!unzip -q wiki-news-300d-1M.vec.zip

--2025-04-24 10:29:05--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.226.210.111, 13.226.210.15, 13.226.210.25, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.226.210.111|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 681808098 (650M) [application/zip]
Saving to: ‘wiki-news-300d-1M.vec.zip’


2025-04-24 10:29:16 (58.8 MB/s) - ‘wiki-news-300d-1M.vec.zip’ saved [681808098/681808098]



In [37]:
import numpy as np

embedding_index = {}
with open('wiki-news-300d-1M.vec', encoding='utf8', errors='ignore') as f:
    next(f)  # skip header line: “1000000 300”
    for line in f:
        values = line.rstrip().split(' ')
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs

In [38]:
embedding_dim = 300
embedding_matrix = np.zeros((max_words, embedding_dim))

for word, idx in tokenizer.word_index.items():
    if idx < max_words:
        vector = embedding_index.get(word)
        if vector is not None:
            embedding_matrix[idx] = vector
        else:
            # you can use random init or zeros
            embedding_matrix[idx] = np.random.normal(size=(embedding_dim,))

In [41]:
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dropout, Dense, Input
from tensorflow.keras.models import Sequential

model_FT = Sequential([
    Input(shape=(max_len,)),
    Embedding(
        input_dim=max_words,
        output_dim=embedding_dim,
        weights=[embedding_matrix],
        input_length=max_len,
        trainable=True  # or True to fine-tune
    ),
    Bidirectional(LSTM(64)),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model_FT.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)
model_FT.summary()



In [42]:
history = model_FT.fit(
    X_train,
    y_train,
    validation_data=(X_val, y_val),
    batch_size=512,
    epochs=10,            # start small; EarlyStopping will help
    callbacks=[early_stop],
    verbose=1
)

Epoch 1/10
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 92ms/step - accuracy: 0.7884 - loss: 0.4474 - val_accuracy: 0.8930 - val_loss: 0.2628
Epoch 2/10
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 92ms/step - accuracy: 0.9166 - loss: 0.2190 - val_accuracy: 0.9051 - val_loss: 0.2365
Epoch 3/10
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 91ms/step - accuracy: 0.9403 - loss: 0.1633 - val_accuracy: 0.9090 - val_loss: 0.2407
Epoch 4/10
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 90ms/step - accuracy: 0.9567 - loss: 0.1224 - val_accuracy: 0.9085 - val_loss: 0.2465


In [43]:
# Evaluate
loss, acc = model_FT.evaluate(X_test_padded, y_test)
print(f"Test accuracy: {acc:.2f}")

[1m4688/4688[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 10ms/step - accuracy: 0.9349 - loss: 0.1759
Test accuracy: 0.94
