<a href="https://colab.research.google.com/github/frxldi-xyz/TensorFlow-LTSM-Multiclass-Amazon-Review/blob/main/ltsm_multiclass_review_amazon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Ferdi Rahmad Rizaldi

13/02/2024

In [None]:
import pandas as pd

import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.regularizers import l1
from tensorflow.keras.regularizers import l2
from tensorflow.keras.regularizers import L1L2

from nltk.tokenize import RegexpTokenizer

from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

import re

# **Amazon Review LTSM Multiclass Review**
Data source from [here](https://www.kaggle.com/datasets/danielihenacho/amazon-reviews-dataset)

Data Summary:
- 3 Class sentiments based on review (Positive, Negative and Neutral)
- 17.000+ sample

In [None]:
df = pd.read_csv('cleaned_reviews.csv')
df = df.drop(columns=['cleaned_review_length', 'review_score'])
df.head(10)

Unnamed: 0,sentiments,cleaned_review
0,positive,i wish would have gotten one earlier love it a...
1,neutral,i ve learned this lesson again open the packag...
2,neutral,it is so slow and lags find better option
3,neutral,roller ball stopped working within months of m...
4,neutral,i like the color and size but it few days out ...
5,positive,overall love this mouse the size weight clicki...
6,neutral,it stopped working
7,positive,my son uses school issued chromebook for schoo...
8,negative,loved this cute little mouse but it broke afte...
9,negative,should ve spent the money to get quality produ...


# **Data Transform**
Data originallly have string type for sentiments. As ML identify number better so convert it to number was needed

Process:
1. Get dummy variable from data (Binary number so it can be 1 for true and 0 for false for each sentiments class)
2. Concat into new columns
3. Delete old sentiments column

In [None]:
category = pd.get_dummies(df.sentiments)
df_new = pd.concat([df, category], axis=1)
df_new = df_new.drop(columns='sentiments')

df_new

Unnamed: 0,cleaned_review,negative,neutral,positive
0,i wish would have gotten one earlier love it a...,0,0,1
1,i ve learned this lesson again open the packag...,0,1,0
2,it is so slow and lags find better option,0,1,0
3,roller ball stopped working within months of m...,0,1,0
4,i like the color and size but it few days out ...,0,1,0
5,overall love this mouse the size weight clicki...,0,0,1
6,it stopped working,0,1,0
7,my son uses school issued chromebook for schoo...,0,0,1
8,loved this cute little mouse but it broke afte...,1,0,0
9,should ve spent the money to get quality produ...,1,0,0


# **Data Cleaning**

After facing some problem, data type especially for cleaned_review need to be string but originally was float so covert it from float to string

In [None]:
tokenizer = RegexpTokenizer(r'\w+')

df_new['cleaned_review'] = df_new['cleaned_review'].apply(lambda x: ' '.join(tokenizer.tokenize(str(x).lower())))

review = df_new['cleaned_review'].values
sentiments = df_new[['negative', 'neutral', 'positive']].values

# **Tokenizer**

Tokenizer process to optimize string for ML model and sequelize

In [None]:
# Tokenizer ✔
tokenizer = Tokenizer(num_words=5000, oov_token='x')
tokenizer.fit_on_texts(review)

x = tokenizer.texts_to_sequences(review)

pad_x = pad_sequences(x)

# **Split Train and Test**

ML best practice are split data into train and test data. It matter on the Data but now we use 20% validation set it mean we'll have 80% train set and 20% test set

In [None]:
# Split Validation set 20% ✔
review_train, review_test, sentiments_train, sentiments_test = train_test_split(pad_x, sentiments, test_size=0.2, random_state = 3)
review_train, review_val, sentiments_train, sentiments_val = train_test_split(review_train, sentiments_train, test_size=0.2, random_state = 3)

print(review_train, sentiments_train)

[[  0   0   0 ... 108 355 398]
 [  0   0   0 ...  47   4 943]
 [  0   0   0 ... 823 862  10]
 ...
 [  0   0   0 ...  18   9 633]
 [  0   0   0 ...   6   2  71]
 [  0   0   0 ...  44  86   3]] [[0 0 1]
 [0 0 1]
 [0 0 1]
 ...
 [0 0 1]
 [0 1 0]
 [0 0 1]]


# **Model Build**

Build model using Sequelize and Embed data into 128 output, LTSM model and also dense many layer to optimize the model and for prevent it from overfitting we do some Dropout in each Dense process then Nomalize

In [None]:
#Bangun Model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=5000, output_dim=128), #Embed ✔

    tf.keras.layers.LSTM(128), #LTSM Model ✔

    tf.keras.layers.Dense(128, kernel_regularizer=L1L2(l1=0.01, l2=0.01), activation='relu'),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.BatchNormalization(),

    tf.keras.layers.Dense(128, kernel_regularizer=L1L2(l1=0, l2=0.01), activation='relu'),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.BatchNormalization(),

    tf.keras.layers.Dense(128, kernel_regularizer=L1L2(l1=0, l2=0.01), activation='relu'),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.BatchNormalization(),

    tf.keras.layers.Dense(128, kernel_regularizer=L1L2(l1=0, l2=0.01), activation='relu'),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.BatchNormalization(),

    tf.keras.layers.Dense(128, kernel_regularizer=L1L2(l1=0, l2=0.01), activation='relu'),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.BatchNormalization(),

    tf.keras.layers.Dense(3, activation='softmax')
])

print(model)

<keras.engine.sequential.Sequential object at 0x7ad61d312c20>


# **Model Compile**

Since data was categorical we use categorigal crossentropy for compiler and use basic accuracy metrics with adam optimizer

In [None]:
model.compile(loss='categorical_crossentropy',
              optimizer=keras.optimizers.Adam(),
              metrics=['accuracy'])

# **Stop Callback**

Callback function to ensure that model fit process will stop at any point. In this case val_accuracy must to be between 0,75 and 0,9 but my high target wwas 0,9 or 90%

In [None]:
class stopCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs={}):
        if(logs.get('val_accuracy') > 0.9 ): #Minimal 75% accurary Target 90 ✔
            self.model.stop_training = True

callbacks = stopCallback()

# **Model Fit**

Train model based on train and test variable that i build before. Make sure to call stopCallback since we need reach some point and dont want to waste time for all 1000 epocs hahaha :)

In [None]:
num_epochs = 1000
fit = model.fit(review_train,
                sentiments_train,
                epochs=num_epochs,
                validation_data=(review_val, sentiments_val),
                callbacks = [callbacks]
)

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000

# **Visualize**

Visualize accuracy and loss for enrichment model train result

In [None]:
plt.plot(fit.history['categorical_accuracy'], label='Train')
plt.plot(fit.history['val_accuracy'], label='Test')
plt.legend()
plt.show()

In [None]:
plt.plot(fit.history['loss'], label='Train')
plt.plot(fit.history['val_loss'], label='Val')
plt.legend()
plt.show()