<a href="https://colab.research.google.com/github/andr3w1699/HumanLanguageTechnologyProject/blob/main/SentimentClassificationWithRecurrent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q gdown

In [2]:
# Replace FILE_ID with your actual file ID
file_id = '0Bz8a_Dbh9QhbZVhsUnRWRDhETzA'
output_name = 'amazon_review_full_csv.tar.gz'

!gdown --id {file_id} -O {output_name}

Downloading...
From (original): https://drive.google.com/uc?id=0Bz8a_Dbh9QhbZVhsUnRWRDhETzA
From (redirected): https://drive.google.com/uc?id=0Bz8a_Dbh9QhbZVhsUnRWRDhETzA&confirm=t&uuid=611c8d1b-7230-498f-afc9-136fb123c75c
To: /content/amazon_review_full_csv.tar.gz
100% 644M/644M [00:14<00:00, 43.9MB/s]


In [3]:
import tarfile

with tarfile.open(output_name, "r:gz") as tar:
    tar.extractall("Dataset")

In [4]:
!ls -R Dataset

Dataset:
amazon_review_full_csv

Dataset/amazon_review_full_csv:
readme.txt  test.csv  train.csv


In [5]:
import pandas as pd

# Set options to show full text and all rows
pd.set_option('display.max_colwidth', None)

df_train = pd.read_csv(
    './Dataset/amazon_review_full_csv/train.csv',
    header=None,
    names=['label', 'title', 'text'],
    quotechar='"',
    doublequote=True,
    escapechar='\\',
    engine='python',
    encoding='utf-8',
    on_bad_lines='skip'  # Skip rows with parsing errors
)

df_train.head()

Unnamed: 0,label,title,text
0,3,more like funchuck,"Gave this to my dad for a gag gift after directing ""Nunsense,"" he got a reall kick out of it!"
1,5,Inspiring,"I hope a lot of people hear this cd. We need more strong and positive vibes like this. Great vocals, fresh tunes, cross-cultural happiness. Her blues is from the gut. The pop sounds are catchy and mature."
2,5,The best soundtrack ever to anything.,"I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny."
3,4,Chrono Cross OST,"The music of Yasunori Misuda is without question my close second below the great Nobuo Uematsu.Chrono Cross OST is a wonderful creation filled with rich orchestra and synthesized sounds. While ambiance is one of the music's major factors, yet at times it's very uplifting and vigorous. Some of my favourite tracks include; ""Scars Left by Time, The Girl who Stole the Stars, and Another World""."
4,5,Too good to be true,Probably the greatest soundtrack in history! Usually it's better to have played the game first but this is so enjoyable anyway! I worked so hard getting this soundtrack and after spending [money] to get it it was really worth every penny!! Get this OST! it's amazing! The first few tracks will have you dancing around with delight (especially Scars Left by Time)!! BUY IT NOW!!


In [6]:
# Number of rows
print("Number of rows:", len(df_train))

# Check for null values
if df_train.isnull().values.any():
    print("There are null elements in the DataFrame.")
else:
    print("There are no null elements in the DataFrame.")

Number of rows: 2999746
There are null elements in the DataFrame.


In [7]:
df_test = pd.read_csv(
    './Dataset/amazon_review_full_csv/train.csv',
    header=None,
    names=['label', 'title', 'text'],
    quotechar='"',
    doublequote=True,
    escapechar='\\',
    engine='python',
    encoding='utf-8',
    on_bad_lines='skip'  # Skip rows with parsing errors
)

df_test.head()

Unnamed: 0,label,title,text
0,3,more like funchuck,"Gave this to my dad for a gag gift after directing ""Nunsense,"" he got a reall kick out of it!"
1,5,Inspiring,"I hope a lot of people hear this cd. We need more strong and positive vibes like this. Great vocals, fresh tunes, cross-cultural happiness. Her blues is from the gut. The pop sounds are catchy and mature."
2,5,The best soundtrack ever to anything.,"I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny."
3,4,Chrono Cross OST,"The music of Yasunori Misuda is without question my close second below the great Nobuo Uematsu.Chrono Cross OST is a wonderful creation filled with rich orchestra and synthesized sounds. While ambiance is one of the music's major factors, yet at times it's very uplifting and vigorous. Some of my favourite tracks include; ""Scars Left by Time, The Girl who Stole the Stars, and Another World""."
4,5,Too good to be true,Probably the greatest soundtrack in history! Usually it's better to have played the game first but this is so enjoyable anyway! I worked so hard getting this soundtrack and after spending [money] to get it it was really worth every penny!! Get this OST! it's amazing! The first few tracks will have you dancing around with delight (especially Scars Left by Time)!! BUY IT NOW!!


In [8]:
# Keep only positive (4,5) and negative (1,2) ratings
df_train_binary = df_train[df_train['label'] != 3].copy()

# Map ratings to binary sentiment
df_train_binary['sentiment'] = df_train_binary['label'].apply(lambda x: 1 if x > 3 else 0)

In [9]:
df_train_binary['review'] = df_train_binary['title'].fillna('') + ' ' + df_train_binary['text'].fillna('')
df_train_sampled = df_train_binary.sample(n=100000, random_state=42)
X = df_train_sampled['review'].values
y = df_train_sampled['sentiment'].values

In [10]:
!pip install tensorflow



In [11]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Parameters
max_words = 30000  # Size of vocabulary
max_len = 200      # Max review length

# Tokenizer
tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(X)

# Convert text to sequences
sequences = tokenizer.texts_to_sequences(X)
X_padded = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X_padded, y, test_size=0.2, random_state=42)

In [13]:
from tensorflow.keras.layers import Bidirectional,LSTM, Input, Dense, Dropout, Embedding
from tensorflow.keras.models import Sequential

model = Sequential([
    Input(shape=(max_len,)),  # Define the input shape
    Embedding(input_dim=max_words, output_dim=128),
    Bidirectional(LSTM(64)),  # BiLSTM instead of LSTM
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [14]:
from tensorflow.keras.layers import LSTM, Input, Dense, Dropout, Embedding
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping


# Define EarlyStopping callback
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True
)

# Train the model with EarlyStopping
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=512,
    epochs=10,
    callbacks=[early_stop]  # 👈 Early stopping in action
)

Epoch 1/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 59ms/step - accuracy: 0.7498 - loss: 0.5059 - val_accuracy: 0.8935 - val_loss: 0.2671
Epoch 2/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 56ms/step - accuracy: 0.9170 - loss: 0.2234 - val_accuracy: 0.8956 - val_loss: 0.2562
Epoch 3/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 55ms/step - accuracy: 0.9390 - loss: 0.1697 - val_accuracy: 0.8925 - val_loss: 0.2756
Epoch 4/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 52ms/step - accuracy: 0.9545 - loss: 0.1301 - val_accuracy: 0.8895 - val_loss: 0.3053


In [15]:
# Keep only positive (4,5) and negative (1,2) ratings
df_test_binary = df_test[df_test['label'] != 3].copy()

# Map ratings to binary sentiment
df_test_binary['sentiment'] = df_test_binary['label'].apply(lambda x: 1 if x > 3 else 0)


# Preprocess test set
df_test_binary['review'] = df_test_binary['title'].fillna('') + ' ' + df_test_binary['text'].fillna('')

df_test_sampled = df_test_binary.sample(n=100000, random_state=42)

X_test_seq = tokenizer.texts_to_sequences(df_test_sampled['review'].values)
X_test_padded = pad_sequences(X_test_seq, maxlen=max_len, padding='post', truncating='post')
y_test = df_test_sampled['sentiment'].values

# Evaluate
loss, acc = model.evaluate(X_test_padded, y_test)
print(f"Test accuracy: {acc:.2f}")



[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 8ms/step - accuracy: 0.9301 - loss: 0.1945
Test accuracy: 0.93


In [16]:
from tensorflow.keras.layers import LSTM, Input, Dense, Dropout, Embedding
from tensorflow.keras.models import Sequential

model = Sequential([
    Input(shape=(max_len,)),  # Define the input shape
    Embedding(input_dim=max_words, output_dim=128),
    LSTM(64),                 # 👈 Simple LSTM instead of Bidirectional
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [17]:
from tensorflow.keras.layers import LSTM, Input, Dense, Dropout, Embedding
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping


# Define EarlyStopping callback
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True
)

# Train the model with EarlyStopping
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=512,
    epochs=10,
    callbacks=[early_stop]  # 👈 Early stopping in action
)

Epoch 1/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 32ms/step - accuracy: 0.4993 - loss: 0.6937 - val_accuracy: 0.4988 - val_loss: 0.6932
Epoch 2/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 30ms/step - accuracy: 0.5059 - loss: 0.6933 - val_accuracy: 0.5014 - val_loss: 0.6932
Epoch 3/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 28ms/step - accuracy: 0.5043 - loss: 0.6932 - val_accuracy: 0.4988 - val_loss: 0.6932


In [18]:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Conv1D, MaxPooling1D, GlobalMaxPooling1D, Dropout, Dense, Input
from tensorflow.keras.callbacks import EarlyStopping

model = Sequential([
    Input(shape=(max_len,)),
    Embedding(input_dim=max_words, output_dim=128),

    Bidirectional(LSTM(64, return_sequences=True)),  # Keep sequences for CNN
    Conv1D(filters=64, kernel_size=3, activation='relu'),
    MaxPooling1D(pool_size=2),

    GlobalMaxPooling1D(),  # 👈 This flattens (batch, time, features) → (batch, features)
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Early stopping callback
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True
)

# Train the model
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=512,
    epochs=10,
    callbacks=[early_stop]
)

# Show model summary
model.summary()

Epoch 1/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 73ms/step - accuracy: 0.7159 - loss: 0.5351 - val_accuracy: 0.8934 - val_loss: 0.2630
Epoch 2/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 70ms/step - accuracy: 0.9157 - loss: 0.2220 - val_accuracy: 0.8936 - val_loss: 0.2614
Epoch 3/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 66ms/step - accuracy: 0.9440 - loss: 0.1543 - val_accuracy: 0.8956 - val_loss: 0.2736
Epoch 4/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 67ms/step - accuracy: 0.9608 - loss: 0.1136 - val_accuracy: 0.8894 - val_loss: 0.3033


In [19]:
# Evaluate
loss, acc = model.evaluate(X_test_padded, y_test)
print(f"Test accuracy: {acc:.2f}")


[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 8ms/step - accuracy: 0.9325 - loss: 0.1789
Test accuracy: 0.93


In [20]:
from tensorflow.keras.layers import GRU, Input, Dense, Dropout, Embedding
from tensorflow.keras.models import Sequential

model = Sequential([
    Input(shape=(max_len,)),  # Same input shape
    Embedding(input_dim=max_words, output_dim=128),
    GRU(64),                  # 👈 GRU instead of LSTM
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [21]:
# Early stopping callback
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True
)

# Train the model
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=512,
    epochs=10,
    callbacks=[early_stop]
)

# Show model summary
model.summary()

Epoch 1/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 31ms/step - accuracy: 0.4944 - loss: 0.6937 - val_accuracy: 0.5015 - val_loss: 0.6933
Epoch 2/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 29ms/step - accuracy: 0.4992 - loss: 0.6935 - val_accuracy: 0.5014 - val_loss: 0.6931
Epoch 3/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 26ms/step - accuracy: 0.5035 - loss: 0.6932 - val_accuracy: 0.5016 - val_loss: 0.6931
Epoch 4/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 29ms/step - accuracy: 0.5013 - loss: 0.6933 - val_accuracy: 0.5015 - val_loss: 0.6933


In [24]:
!pip install gdown



In [26]:
import gdown # Import gdown
# Google Drive URL for the Word2Vec model
url = 'https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM'
output = 'GoogleNews-vectors-negative300.bin.gz'

# Download the file
gdown.download(url, output, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM
From (redirected): https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&confirm=t&uuid=710877c7-b659-4084-b048-68dae6475a39
To: /content/GoogleNews-vectors-negative300.bin.gz
100%|██████████| 1.65G/1.65G [00:28<00:00, 57.8MB/s]


'GoogleNews-vectors-negative300.bin.gz'

In [27]:
import gzip
import shutil

with gzip.open('GoogleNews-vectors-negative300.bin.gz', 'rb') as f_in:
    with open('GoogleNews-vectors-negative300.bin', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [28]:
from gensim.models import KeyedVectors

# Load the Word2Vec model
w2v_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

# Get vector for a word (e.g., "computer")
vector = w2v_model['computer']
print(vector)

[ 1.07421875e-01 -2.01171875e-01  1.23046875e-01  2.11914062e-01
 -9.13085938e-02  2.16796875e-01 -1.31835938e-01  8.30078125e-02
  2.02148438e-01  4.78515625e-02  3.66210938e-02 -2.45361328e-02
  2.39257812e-02 -1.60156250e-01 -2.61230469e-02  9.71679688e-02
 -6.34765625e-02  1.84570312e-01  1.70898438e-01 -1.63085938e-01
 -1.09375000e-01  1.49414062e-01 -4.65393066e-04  9.61914062e-02
  1.68945312e-01  2.60925293e-03  8.93554688e-02  6.49414062e-02
  3.56445312e-02 -6.93359375e-02 -1.46484375e-01 -1.21093750e-01
 -2.27539062e-01  2.45361328e-02 -1.24511719e-01 -3.18359375e-01
 -2.20703125e-01  1.30859375e-01  3.66210938e-02 -3.63769531e-02
 -1.13281250e-01  1.95312500e-01  9.76562500e-02  1.26953125e-01
  6.59179688e-02  6.93359375e-02  1.02539062e-02  1.75781250e-01
 -1.68945312e-01  1.21307373e-03 -2.98828125e-01 -1.15234375e-01
  5.66406250e-02 -1.77734375e-01 -2.08984375e-01  1.76757812e-01
  2.38037109e-02 -2.57812500e-01 -4.46777344e-02  1.88476562e-01
  5.51757812e-02  5.02929

In [29]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer

# Set up tokenizer
max_words = 10000  # Max number of words in the vocabulary
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df_train_sampled['review'])  # Replace with your column name

# Prepare embedding matrix
embedding_dim = 300  # Google News vectors are 300-dimensional
embedding_matrix = np.zeros((max_words, embedding_dim))

for word, i in tokenizer.word_index.items():
    if i < max_words:
        try:
            embedding_matrix[i] = w2v_model[word]  # Fetch embedding
        except KeyError:
            embedding_matrix[i] = np.random.normal(scale=0.6, size=(embedding_dim,))  # Random embedding

In [30]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Input
from tensorflow.keras.callbacks import EarlyStopping

# Define the LSTM model
model = Sequential([
    Input(shape=(max_len,)),  # Input shape is the padded sequence length
    Embedding(input_dim=max_words,
              output_dim=embedding_dim,
              weights=[embedding_matrix],  # Use Word2Vec embeddings
              input_length=max_len,
              trainable=False),  # Freeze embeddings
    LSTM(64),  # LSTM layer
    Dropout(0.5),
    Dense(1, activation='sigmoid')  # Binary classification
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()



In [31]:
# Set up early stopping
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Train the model
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=512,
    epochs=10,
    callbacks=[early_stop]
)


Epoch 1/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 38ms/step - accuracy: 0.5032 - loss: 0.6932 - val_accuracy: 0.5038 - val_loss: 0.6931
Epoch 2/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 36ms/step - accuracy: 0.5001 - loss: 0.6932 - val_accuracy: 0.4994 - val_loss: 0.6931
Epoch 3/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 35ms/step - accuracy: 0.5042 - loss: 0.6930 - val_accuracy: 0.5046 - val_loss: 0.6930
Epoch 4/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 36ms/step - accuracy: 0.5047 - loss: 0.6930 - val_accuracy: 0.5051 - val_loss: 0.6930
Epoch 5/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 36ms/step - accuracy: 0.5002 - loss: 0.6929 - val_accuracy: 0.4972 - val_loss: 0.6932
Epoch 6/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 37ms/step - accuracy: 0.5053 - loss: 0.6929 - val_accuracy: 0.4950 - val_loss: 0.6931
Epoch 7/10
[1m157/