# Amazon reviews for cell phones and accessories dataset

### 3. **Word Embedding models**

a. Learn word2vec models using gensim on this dataset with the following settings:
(a) Size=100, 200, 300, (b) Window=3,7, (c) Min_count=2, 5. Use skipgram.
i. This will give 12 word2vec models. For each of these models, for each
review take average word embeddings and train a logistic regression.
Report accuracy on test set.

b. Use the already available google word2vec model. For each review take average word embeddings and train a logistic regression. Report accuracy on test set.

c. Use the already available glove models: 50D, 100D and 200D. For each review
take average word embeddings and train a logistic regression. Report accuracy on test set for each of the three sized embeddings.

### 4. **Neural Network models**

a. RNNs: Train a single directional RNN with L layers. Vary the number of layers
(as 1,2,3,4) and also size of layers (20, 50, 100, 200). Report accuracy on test set.

b. LSTMs: Train a single directional LSTM with L layers. Vary the number of
layers (as 1,2,3,4) and also size of layers (20, 50, 100, 200). Report accuracy on test set.

c. BiLSTM: Train a single directional RNN with L layers. Vary the number of
layers (as 1,2,3,4) and also size of layers (20, 50, 100, 200). Report accuracy on test set.

## Connecting to drive

In [1]:
from google.colab import drive

drive.mount("/content/Drive")

Mounted at /content/Drive


In [2]:
import os

# Mount your Google Drive to access files stored there
drive.mount('/content/drive')

# Replace 'your_file_name.csv' with the actual name of your file.
file_name = 'Cell_Phones_and_Accessories_5.json'

# Set the root directory to your Google Drive
root_dir = '/content/drive/My Drive/'

# Function to recursively search for the file in all directories and subdirectories
def find_file(directory):
    for item in os.listdir(directory):
        item_path = os.path.join(directory, item)
        if os.path.isfile(item_path) and item == file_name:
            return directory
        elif os.path.isdir(item_path):
            result = find_file(item_path)
            if result:
                return result
    return None

# Call the function to find the file directory
file_directory = find_file(root_dir)

# Print the file directory
if file_directory:
    print("File directory:", file_directory)
else:
    print("File not found in Google Drive.")


Mounted at /content/drive
File directory: /content/drive/My Drive/midterm (1)/dataset


In [3]:
os.chdir('/content/drive/My Drive/midterm (1)/dataset')

## Importing required libraries

In [4]:
import os
import pandas as pd

In [5]:
import numpy as np
import nltk
import json
import re
import multiprocessing as mp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

In [6]:
import warnings
# Ignore warnings
warnings.filterwarnings("ignore")

## Reading the dataset

In [7]:
# Step 1: Read the JSON file and convert it to DataFrame
df = pd.read_json('Cell_Phones_and_Accessories_5.json', lines=True)

# Now, 'df' contains the data in DataFrame format, and you can work with it as needed.


In [8]:
df.head(2)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194439 entries, 0 to 194438
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   reviewerID      194439 non-null  object
 1   asin            194439 non-null  object
 2   reviewerName    190920 non-null  object
 3   helpful         194439 non-null  object
 4   reviewText      194439 non-null  object
 5   overall         194439 non-null  int64 
 6   summary         194439 non-null  object
 7   unixReviewTime  194439 non-null  int64 
 8   reviewTime      194439 non-null  object
dtypes: int64(2), object(7)
memory usage: 13.4+ MB


In [10]:
# Extracting the 'reviewText' and 'overall' columns
df = df[['reviewText', 'overall']]

## Sampling

Sampling to make the data manageable

In [11]:
# The percentage of data to sample
sample_percentage = 0.025

# Perform simple random sampling
sample_data = df.sample(frac=sample_percentage, random_state=42)

## Preprocessing

#### Cleaning the text column ('reviewText') by:
1. Removing stop words
2. Convert text to lowercase
3. Removing punctuations and numbers
4. Tokenizing
5. Stemming and
6. Lemmatization

In [12]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
import re

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [13]:
# Define stopwords
stop_words = set(stopwords.words('english'))

# Define stemmer
stemmer = PorterStemmer()

# Define lemmatizer
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove punctuations and numbers
    text = re.sub('[^a-zA-Z]', ' ', text)

    # Tokenize and remove stop words
    tokenized_text = [w for w in word_tokenize(text) if w not in stop_words]
    text = ' '.join(tokenized_text)

    # Perform stemming and lemmatization
    stemmed_lemmatized_text = [stemmer.stem(lemmatizer.lemmatize(w)) for w in word_tokenize(text)]
    text = ' '.join(stemmed_lemmatized_text)

    return text

In [14]:
sample_data['reviewText'] = sample_data['reviewText'].apply(preprocess_text)
print(sample_data)

                                               reviewText  overall
156187  ibolt xprodock activ car dock holder mount sam...        5
102252  pouch everyth look otter box commut case aroun...        5
23146   first case iphon previous one free one give al...        3
86461   order case case htc inspir case last year abso...        5
62407   bought gift big hit love choic color made devi...        5
...                                                   ...      ...
145652  work well samsung note ii charg charger rest p...        2
165116  use brand extern batteri pack et np k charger ...        5
65593   perfect case highli recommend samsung galaxi i...        5
42927   got phone upgrad week swap need soon got home ...        5
9887    lucki enough abl telecommut home full time spe...        5

[4861 rows x 2 columns]


## Splitting the dataset

In [15]:
# Split the data
train_size = int(0.7 * len(sample_data))
val_size = int(0.1 * len(sample_data))

train_data = sample_data[:train_size]
val_data = sample_data[train_size : train_size+val_size]
test_data = sample_data[train_size+val_size:]

# Extract the 'reviewText' and 'overall' fields
X_train, y_train = train_data['reviewText'], train_data['overall']
X_val, y_val = val_data['reviewText'], val_data['overall']
X_test, y_test = test_data['reviewText'], test_data['overall']

In [16]:
(X_train.shape), (y_train.shape)

((3402,), (3402,))

In [17]:
(X_val.shape), (y_val.shape)

((486,), (486,))

In [18]:
(X_test.shape), (y_test.shape)

((973,), (973,))

## Word Embeddings

3. **Average of Word Embeddings:**
   - **Description:** This approach computes the average word embeddings of each review and uses them as features to train classifiers like logistic regression.
   - **Advantages:** Simple, captures context, useful for short texts.
   - **Disadvantages:** Ignores word order, loses some semantic information.


a. **word2vec models using gensim**

In [19]:
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [20]:
# Function to extract word embeddings
def extract_word_embeddings(reviews, w2v_model):
    embeddings = []
    for tokens in reviews:
        embeddings_sum = np.zeros(w2v_model.vector_size)
        word_count = 0
        for token in tokens:
            if token in w2v_model.wv:
                embeddings_sum += w2v_model.wv[token]
                word_count += 1
        if word_count > 0:
            embeddings.append(embeddings_sum / word_count)
        else:
            embeddings.append(np.zeros(w2v_model.vector_size))
    return np.array(embeddings)

In [21]:
# Train word2vec models and perform sentiment analysis with multiple ML models
sizes = [100, 200, 300]
windows = [3, 7]
min_counts = [2, 5]
ml_models = {
    "Logistic Regression": LogisticRegression()
}

results = {}

**word2vec models using gensim
(a) Size=100, 200, 300, (b) Window=3,7, (c) Min_count=2, 5. Use skipgram.**

In [22]:
for size in sizes:
    for window in windows:
        for min_count in min_counts:
            model_name = f"word2vec_size_{size}_window_{window}_min_count_{min_count}"
            print(f"Training {model_name}...")

            # Tokenize the cleaned text
            tokenized_reviews = train_data['reviewText'].apply(word_tokenize)

            # Train Word2Vec model
            w2v_model = Word2Vec(tokenized_reviews, vector_size=size, window=window, min_count=min_count, sg=1)

            # Extract word embeddings
            X_train = extract_word_embeddings(tokenized_reviews, w2v_model)
            y_train = train_data['overall'].values

            # Evaluate with multiple ML models
            for model_name, model in ml_models.items():
                print(f"Training and evaluating {model_name}...")

                model.fit(X_train, y_train)

                # Extract word embeddings for test set
                tokenized_test_reviews = test_data['reviewText'].apply(word_tokenize)
                X_test = extract_word_embeddings(tokenized_test_reviews, w2v_model)
                y_test = test_data['overall'].values

                # Predict and calculate accuracy
                y_pred = model.predict(X_test)
                accuracy = accuracy_score(y_test, y_pred)

                results[(model_name, model_name)] = accuracy

                print(f"{model_name} Accuracy: {accuracy:.4f}")

Training word2vec_size_100_window_3_min_count_2...
Training and evaluating Logistic Regression...
Logistic Regression Accuracy: 0.5992
Training word2vec_size_100_window_3_min_count_5...
Training and evaluating Logistic Regression...
Logistic Regression Accuracy: 0.6002
Training word2vec_size_100_window_7_min_count_2...
Training and evaluating Logistic Regression...
Logistic Regression Accuracy: 0.6023
Training word2vec_size_100_window_7_min_count_5...
Training and evaluating Logistic Regression...
Logistic Regression Accuracy: 0.6002
Training word2vec_size_200_window_3_min_count_2...
Training and evaluating Logistic Regression...
Logistic Regression Accuracy: 0.5992
Training word2vec_size_200_window_3_min_count_5...
Training and evaluating Logistic Regression...
Logistic Regression Accuracy: 0.5971
Training word2vec_size_200_window_7_min_count_2...
Training and evaluating Logistic Regression...
Logistic Regression Accuracy: 0.6053
Training word2vec_size_200_window_7_min_count_5...
Trai

The best performing model is word2vec_size_200_window_7_min_count_2

Accuracy: 0.6053

word2vec_size_200_window_7_min_count_5

Accuracy: 0.6053

b. **google word2vec model**

In [24]:
from gensim.models import KeyedVectors
import gensim.downloader as api

In [25]:
# Download and load the Google News Word2Vec model
# This will download the model if not already available in your Colab environment
w2v_model = api.load('word2vec-google-news-300')



In [26]:
w2v_model_google= w2v_model

In [27]:
# Function to extract average word embeddings for each review
def extract_average_embeddings(reviews, w2v_model):
    embeddings = []
    for review in reviews:
        tokens = word_tokenize(review)
        embeddings_sum = np.zeros(w2v_model.vector_size)
        word_count = 0
        for token in tokens:
            if token in w2v_model:
                embeddings_sum += w2v_model[token]
                word_count += 1
        if word_count > 0:
            embeddings.append(embeddings_sum / word_count)
        else:
            embeddings.append(np.zeros(w2v_model.vector_size))
    return np.array(embeddings)

# Extract average embeddings for training data
X_train = extract_average_embeddings(train_data['reviewText'], w2v_model)
y_train = train_data['overall'].values

# Extract average embeddings for test data
X_test = extract_average_embeddings(test_data['reviewText'], w2v_model)
y_test = test_data['overall'].values

# Train logistic regression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

# Predict and calculate accuracy on the test set
y_pred = lr_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Logistic Regression Accuracy: {accuracy:.4f}")


Logistic Regression Accuracy: 0.6043


c. **Glove**

In [28]:
#Load pre-trained GloVe models with different dimensions
glove_50d = api.load("glove-wiki-gigaword-50")
glove_100d = api.load("glove-wiki-gigaword-100")
glove_200d = api.load("glove-wiki-gigaword-200")



In [29]:
# Function to extract average word embeddings for each review
def extract_average_embeddings(reviews, glove_model):
    embeddings = []
    for review in reviews:
        tokens = word_tokenize(review)
        embeddings_sum = np.zeros(glove_model.vector_size)
        word_count = 0
        for token in tokens:
            if token in glove_model:
                embeddings_sum += glove_model[token]
                word_count += 1
        if word_count > 0:
            embeddings.append(embeddings_sum / word_count)
        else:
            embeddings.append(np.zeros(glove_model.vector_size))
    return np.array(embeddings)

glove_50d model

In [30]:
# Extract average embeddings for training data
X_train = extract_average_embeddings(train_data['reviewText'], glove_50d)
y_train = train_data['overall'].values

# Extract average embeddings for test data
X_test = extract_average_embeddings(test_data['reviewText'], glove_50d)
y_test = test_data['overall'].values

# Train logistic regression
lr_glove_50d_model = LogisticRegression()
lr_glove_50d_model.fit(X_train, y_train)

# Predict and calculate accuracy on the test set
y_pred = lr_glove_50d_model.predict(X_test)
accuracy_glove_50d = accuracy_score(y_test, y_pred)

print(f"Logistic Regression Accuracy: {accuracy_glove_50d:.4f}")

Logistic Regression Accuracy: 0.5848


glove_100d model

In [31]:
# Extract average embeddings for training data
X_train = extract_average_embeddings(train_data['reviewText'], glove_100d)
y_train = train_data['overall'].values

# Extract average embeddings for test data
X_test = extract_average_embeddings(test_data['reviewText'], glove_100d)
y_test = test_data['overall'].values

# Train logistic regression
lr_glove_100d_model = LogisticRegression()
lr_glove_100d_model.fit(X_train, y_train)

# Predict and calculate accuracy on the test set
y_pred = lr_glove_100d_model.predict(X_test)
accuracy_glove_100d = accuracy_score(y_test, y_pred)

print(f"Logistic Regression Accuracy: {accuracy_glove_100d:.4f}")

Logistic Regression Accuracy: 0.5848


glove_200d model

In [32]:
# Extract average embeddings for training data
X_train = extract_average_embeddings(train_data['reviewText'], glove_200d)
y_train = train_data['overall'].values

# Extract average embeddings for test data
X_test = extract_average_embeddings(test_data['reviewText'], glove_200d)
y_test = test_data['overall'].values

# Train logistic regression
lr_glove_200d_model = LogisticRegression()
lr_glove_200d_model.fit(X_train, y_train)

# Predict and calculate accuracy on the test set
y_pred = lr_glove_200d_model.predict(X_test)
accuracy_glove_200d = accuracy_score(y_test, y_pred)

print(f"Logistic Regression Accuracy: {accuracy_glove_200d:.4f}")

Logistic Regression Accuracy: 0.5838


In [33]:
# make it in a format to get the below
print(f"Accuracy (50D): {accuracy_glove_50d:.4f}")
print(f"Accuracy (100D): {accuracy_glove_100d:.4f}")
print(f"Accuracy (200D): {accuracy_glove_200d:.4f}")

Accuracy (50D): 0.5848
Accuracy (100D): 0.5848
Accuracy (200D): 0.5838


Almost all the models peform equally well, with golve 50d and 100d having the highest acccuracy.

**Of all the word embeddings models the best performing model is word2vec model using genism with size=200, window=7, min_count= 2,5 and Accuracy: 0.6053**

## RNN


4. **Recurrent Neural Networks (RNNs):**
   - **Description:** RNNs process sequences by maintaining hidden states that capture context. Used for various tasks including sequence generation and classification.
   - **Advantages:** Captures sequential dependencies, flexible for various sequence lengths.
   - **Disadvantages:** Struggles with long-range dependencies, vanishing/exploding gradient problem.

Steps for RNN models:

1. Preprocess
2. Tokenize the text
3. Padding
4. One Hot encodding
5. Function for RNN
6. RNN model

In [34]:
import tensorflow as tf

tokenize the data

In [35]:
# Tokenize the text
train_tokens = train_data['reviewText'].apply(word_tokenize)
test_tokens = test_data['reviewText'].apply(word_tokenize)

# Convert tokens to sequences of indices based on a vocabulary
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(train_tokens)
X_train = tokenizer.texts_to_sequences(train_tokens)
X_test = tokenizer.texts_to_sequences(test_tokens)

padding

In [36]:
# Pad sequences to a fixed length
max_sequence_length = 100  # Adjust as needed
X_train_padded = tf.keras.preprocessing.sequence.pad_sequences(X_train, maxlen=max_sequence_length, padding='post')
X_test_padded = tf.keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_sequence_length, padding='post')

one hot encoding

In [37]:
from sklearn.preprocessing import LabelEncoder

# Create a label encoder
label_encoder = LabelEncoder()

# Fit the encoder on the 'overall' column of your training data
label_encoder.fit(train_data['overall'])

# Transform the labels to integer-encoded labels
y_train_encoded = label_encoder.transform(train_data['overall'])
y_test_encoded = label_encoder.transform(test_data['overall'])

# Convert encoded labels to one-hot encoded vectors
num_classes = len(label_encoder.classes_)
y_train = tf.keras.utils.to_categorical(y_train_encoded, num_classes=num_classes)
y_test = tf.keras.utils.to_categorical(y_test_encoded, num_classes=num_classes)

In [38]:
# Define a function to create and train an RNN model for multiclass classification
def train_rnn_multiclass(num_layers, layer_size):
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=300, input_length=max_sequence_length))
    for _ in range(num_layers):
        model.add(tf.keras.layers.SimpleRNN(units=layer_size, return_sequences=True))
    model.add(tf.keras.layers.GlobalMaxPooling1D())
    model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))  # Use softmax for multiclass
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit(X_train_padded, y_train, epochs=5, batch_size=16, validation_split=0.1)
    return model

In [40]:
# Train LSTMs with different layer configurations for multiclass classification
layer_sizes = [20, 50, 100, 200]
results_rnn_multiclass = {}

for num_layers in [1, 2, 3, 4]:
    for layer_size in layer_sizes:
        print(f"Training RNN with {num_layers} layers and size {layer_size} for multiclass")
        model = train_rnn_multiclass(num_layers, layer_size)

        # Evaluate the model on the test set
        y_pred = model.predict(X_test_padded)
        y_pred_class = np.argmax(y_pred, axis=1)  # Convert probabilities to predicted class
        accuracy = accuracy_score(np.argmax(y_test, axis=1), y_pred_class)
        results_rnn_multiclass[(num_layers, layer_size)] = accuracy
        print(f"RNN Multiclass Accuracy: {accuracy:.4f}")

Training RNN with 1 layers and size 20 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
RNN Multiclass Accuracy: 0.5221
Training RNN with 1 layers and size 50 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
RNN Multiclass Accuracy: 0.5848
Training RNN with 1 layers and size 100 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
RNN Multiclass Accuracy: 0.5899
Training RNN with 1 layers and size 200 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
RNN Multiclass Accuracy: 0.5910
Training RNN with 2 layers and size 20 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
RNN Multiclass Accuracy: 0.4851
Training RNN with 2 layers and size 50 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
RNN Multiclass Accuracy: 0.5642
Training RNN with 2 layers and size 100 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
RNN Multiclass Accuracy: 0.5632
Training RNN with 2 layers and size 200 for m

In [41]:
# Print results
print("\nResults for Multiclass RNN:")
for layer_config, accuracy in results_rnn_multiclass.items():
    num_layers, layer_size = layer_config
    print(f"RNN with {num_layers} layers and size {layer_size}: Accuracy - {accuracy:.4f}")


Results for Multiclass RNN:
RNN with 1 layers and size 20: Accuracy - 0.5221
RNN with 1 layers and size 50: Accuracy - 0.5848
RNN with 1 layers and size 100: Accuracy - 0.5899
RNN with 1 layers and size 200: Accuracy - 0.5910
RNN with 2 layers and size 20: Accuracy - 0.4851
RNN with 2 layers and size 50: Accuracy - 0.5642
RNN with 2 layers and size 100: Accuracy - 0.5632
RNN with 2 layers and size 200: Accuracy - 0.5910
RNN with 3 layers and size 20: Accuracy - 0.4964
RNN with 3 layers and size 50: Accuracy - 0.5242
RNN with 3 layers and size 100: Accuracy - 0.5725
RNN with 3 layers and size 200: Accuracy - 0.5498
RNN with 4 layers and size 20: Accuracy - 0.5457
RNN with 4 layers and size 50: Accuracy - 0.4943
RNN with 4 layers and size 100: Accuracy - 0.5663
RNN with 4 layers and size 200: Accuracy - 0.5910


**The best performing model was RNN with 1 layers and size 200: Accuracy - 0.5910, RNN with 2 layers and size 200: Accuracy - 0.5910,
RNN with 4 layers and size 200: Accuracy - 0.5910**

## LSTM

4.b. **Long Short-Term Memory (LSTM):**
   - **Description:** LSTM is an advanced RNN variant designed to alleviate vanishing gradient problem, making it better at learning long-range dependencies.
   - **Advantages:** Handles long sequences, captures context well, mitigates vanishing gradient.
   - **Disadvantages:** Complex architecture, training can be slower.

In [42]:
# Define a function to create and train an LSTM model for multiclass classification
def train_lstm_multiclass(num_layers, layer_size):
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=300, input_length=max_sequence_length))
    for _ in range(num_layers):
        model.add(tf.keras.layers.LSTM(units=layer_size, return_sequences=True))
    model.add(tf.keras.layers.GlobalMaxPooling1D())
    model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))  # Use softmax for multiclass
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit(X_train_padded, y_train, epochs=5, batch_size=32, validation_split=0.1)
    return model

In [43]:
# Train LSTMs with different layer configurations for multiclass classification
layer_sizes = [20, 100]
results_lstm_multiclass = {}

for num_layers in [1, 2, 3, 4]:
    for layer_size in layer_sizes:
        print(f"Training LSTM with {num_layers} layers and size {layer_size} for multiclass")
        model = train_lstm_multiclass(num_layers, layer_size)

        # Evaluate the model on the test set
        y_pred = model.predict(X_test_padded)
        y_pred_class = np.argmax(y_pred, axis=1)  # Convert probabilities to predicted class
        accuracy = accuracy_score(np.argmax(y_test, axis=1), y_pred_class)
        results_lstm_multiclass[(num_layers, layer_size)] = accuracy
        print(f"LSTM Multiclass Accuracy: {accuracy:.4f}")

Training LSTM with 1 layers and size 20 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
LSTM Multiclass Accuracy: 0.5982
Training LSTM with 1 layers and size 100 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
LSTM Multiclass Accuracy: 0.5324
Training LSTM with 2 layers and size 20 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
LSTM Multiclass Accuracy: 0.5807
Training LSTM with 2 layers and size 100 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
LSTM Multiclass Accuracy: 0.5324
Training LSTM with 3 layers and size 20 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
LSTM Multiclass Accuracy: 0.4882
Training LSTM with 3 layers and size 100 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
LSTM Multiclass Accuracy: 0.4872
Training LSTM with 4 layers and size 20 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
LSTM Multiclass Accuracy: 0.5581
Training LSTM with 4 layers and

In [44]:
# Print results
print("\nResults for Multiclass LSTM:")
for layer_config, accuracy in results_lstm_multiclass.items():
    num_layers, layer_size = layer_config
    print(f"LSTM with {num_layers} layers and size {layer_size}: Accuracy - {accuracy:.4f}")


Results for Multiclass LSTM:
LSTM with 1 layers and size 20: Accuracy - 0.5982
LSTM with 1 layers and size 100: Accuracy - 0.5324
LSTM with 2 layers and size 20: Accuracy - 0.5807
LSTM with 2 layers and size 100: Accuracy - 0.5324
LSTM with 3 layers and size 20: Accuracy - 0.4882
LSTM with 3 layers and size 100: Accuracy - 0.4872
LSTM with 4 layers and size 20: Accuracy - 0.5581
LSTM with 4 layers and size 100: Accuracy - 0.4522


**The best performing model was LSTM with 1 layers and size 20: Accuracy - 0.5982**

## BiLSTM

4.c. **Bidirectional LSTM (BiLSTM):**
   - **Description:** Extends LSTM by processing sequences in both forward and backward directions, capturing context from both past and future.
   - **Advantages:** Enhanced context capture, useful for tasks needing full sequence understanding.
   - **Disadvantages:** Doubles computational complexity, might lead to overfitting.

In [45]:
# Define a function to create and train a Bidirectional LSTM model for multiclass classification
def train_bilstm_multiclass(num_layers, layer_size):
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=300, input_length=max_sequence_length))
    for _ in range(num_layers):
        model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=layer_size, return_sequences=True)))
    model.add(tf.keras.layers.GlobalMaxPooling1D())
    model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))  # Use softmax for multiclass
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit(X_train_padded, y_train, epochs=5, batch_size=32, validation_split=0.1)
    return model

In [46]:
# Train LSTMs with different layer configurations for multiclass classification
layer_sizes = [20, 100]
results_bilstm_multiclass = {}

for num_layers in [1, 2, 3, 4]:
    for layer_size in layer_sizes:
        print(f"Training BiLSTM with {num_layers} layers and size {layer_size} for multiclass")
        model = train_bilstm_multiclass(num_layers, layer_size)

        # Evaluate the model on the test set
        y_pred = model.predict(X_test_padded)
        y_pred_class = np.argmax(y_pred, axis=1)  # Convert probabilities to predicted class
        accuracy = accuracy_score(np.argmax(y_test, axis=1), y_pred_class)
        results_bilstm_multiclass[(num_layers, layer_size)] = accuracy
        print(f"BiLSTM Multiclass Accuracy: {accuracy:.4f}")

Training BiLSTM with 1 layers and size 20 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
BiLSTM Multiclass Accuracy: 0.5550
Training BiLSTM with 1 layers and size 100 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
BiLSTM Multiclass Accuracy: 0.5211
Training BiLSTM with 2 layers and size 20 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
BiLSTM Multiclass Accuracy: 0.5776
Training BiLSTM with 2 layers and size 100 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
BiLSTM Multiclass Accuracy: 0.5766
Training BiLSTM with 3 layers and size 20 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
BiLSTM Multiclass Accuracy: 0.5190
Training BiLSTM with 3 layers and size 100 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
BiLSTM Multiclass Accuracy: 0.5416
Training BiLSTM with 4 layers and size 20 for multiclass
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
BiLSTM Multiclass Accuracy: 0.5200
Tra

In [47]:
# Print results
print("\nResults for Multiclass BiLSTM:")
for layer_config, accuracy in results_bilstm_multiclass.items():
    num_layers, layer_size = layer_config
    print(f"BiLSTM with {num_layers} layers and size {layer_size}: Accuracy - {accuracy:.4f}")


Results for Multiclass BiLSTM:
BiLSTM with 1 layers and size 20: Accuracy - 0.5550
BiLSTM with 1 layers and size 100: Accuracy - 0.5211
BiLSTM with 2 layers and size 20: Accuracy - 0.5776
BiLSTM with 2 layers and size 100: Accuracy - 0.5766
BiLSTM with 3 layers and size 20: Accuracy - 0.5190
BiLSTM with 3 layers and size 100: Accuracy - 0.5416
BiLSTM with 4 layers and size 20: Accuracy - 0.5200
BiLSTM with 4 layers and size 100: Accuracy - 0.5498


The best performing model was BiLSTM with 2 layers and size 20: Accuracy - 0.5776





**Of all the RNN Models the best performing model was LSTM with 1 layers and size 20: Accuracy - 0.5982**

## CONCLUSION

**The best perfoming model turned out to be the  word embeddings word2vec model using genism with size=200, window=7, min_count= 2,5 and Accuracy: 0.6053**

I took 2.5% of the entire dataset to run all the models. Potentially with better hyperparameter tuning running more epochs and taking larger chunck of data the model performance could be improved a lot.

Working with unstructured textual data is tricky and the experiments provided insights into the performance trade-offs, the impact of model complexity, and the significance of leveraging pre-trained embeddings and models.
