# **Deep learning model**

*The goal of this notebook is to implement deep learning model to improve our classification results.* 

Importing useful libraries

In [204]:
# import the necessary libraries
import numpy as np
import pandas as pd
import plotly.express as px

import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from sklearn.preprocessing import OneHotEncoder
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
from imblearn.over_sampling import SMOTE

import seaborn as sns
import os

from tensorflow import keras

Useful function :

In [145]:
def report(y_true, y_pred, class_labels):
        print(classification_report(y_true, y_pred, labels=class_labels, zero_division=0))

        confusion_matrix_kwargs = dict(
            text_auto=True,
            title="Confusion Matrix",
            width=1000,
            height=800,
            labels=dict(x="Predicted", y="True Label"),
            x=class_labels,
            y=class_labels,
            color_continuous_scale='Blues'
        )
        
        c_m = confusion_matrix(y_true, y_pred, labels=class_labels) 
        fig = px.imshow(c_m, **confusion_matrix_kwargs)
        fig.show()

In [146]:
# Load the preprocessed dataset 
df = pd.read_csv('data/APPLE_iPhone_SE_preprocessed.csv')
df.dropna(subset=['Reviews'], inplace=True)

Let's see which is the longest review in our dataset (useful for the next steps..)

In [147]:
#list of reviews length
review_lengths = [len(review.split()) for review in df['Reviews']]

# index of the longest review
index_of_longest_review = review_lengths.index(max(review_lengths))

# get the longest review
longest_review = df['Reviews'][index_of_longest_review]

# Number of words
num_words_in_longest_review = len(longest_review.split())

print("the longest review contains {} words.".format(num_words_in_longest_review))

the longest review contains 58 words.


# RNN implementation on the preprocessed dataset without classes redistribution and without oversampling

In this part we are going to implement a simple sequential deep learning model. Indeed, we are going to train an RNN model with the following layers :

- 1 embedding layer
- 2 SimpleRNN layers with reLu activation functions
- 1 output layer with a softmax activation function

The model will use Adam for the optimization, a sparse_categorical_crossentropy loss function and will be train along 40 epochs.

In [148]:
# Tokenize the text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df["Reviews"])

# Calculate max_words
max_words = len(tokenizer.word_index) + 1  # Vocabulary size

# Convert text data to sequences
sequences = tokenizer.texts_to_sequences(df["Reviews"])

# Padding sequences 
max_sequence_length = 58
X = pad_sequences(sequences, maxlen=max_sequence_length, padding='post', truncating='post')

y = df["Ratings"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Let's create our model :

In [149]:
# Define the RNN model
model = Sequential()

# Embedding layer to convert text data to dense vectors
model.add(Embedding(input_dim=max_words, output_dim=128, input_length=max_sequence_length, name='embedding'))

# Two SimpleRNN layers with ReLU activation functions
model.add(SimpleRNN(64, activation='relu', return_sequences=True, name='rnn1'))
model.add(SimpleRNN(32, activation='relu', name='rnn2'))

# Dense layer (output layer) with a softmax activation function for multi-class classification
model.add(Dense(6, activation='softmax', name='output'))

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001),  loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Display model summary
model.summary()

Model: "sequential_42"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 58, 128)           777472    
                                                                 
 rnn1 (SimpleRNN)            (None, 58, 64)            12352     
                                                                 
 rnn2 (SimpleRNN)            (None, 32)                3104      
                                                                 
 output (Dense)              (None, 6)                 198       
                                                                 
Total params: 793126 (3.03 MB)
Trainable params: 793126 (3.03 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Let's train the model :

In [152]:
model.fit(X_train, y_train, epochs=40, batch_size=64, validation_split=0.2)

# Evaluate the model for multi-class classification
y_proba = model.predict(X_test)
y_pred = np.argmax(y_proba.round(2), axis=-1)


class_labels = [1, 2, 3, 4, 5]
report(y_test, y_pred, class_labels)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
              precision    recall  f1-score   support

           1       0.32      0.58      0.41        88
           2       0.00      0.00      0.00        44
           3       0.07      0.16      0.10        80
           4       0.25      0.24      0.24       363
           5       0.81      0.74      0.78      1367

    accuracy                           0.60      1942
   macro avg       0.29      0.35      0.31      1942
weighted avg       0.64      0.60      0.62      1942



## Result Analysis

Overall, the performances of the model are relatively similar to those obtained with the baseline model. Indeed, the model is still rather efficient in classifying the dominant class and less so for the others. However, we can note a slight improvement in the classification of certain lower classes compared to the basic Gradient Boosting model: classes 1, 3 and 4 (by comparing the F1 score).

# RNN implementation on the preprocessed dataset with classes redistribution, oversampling and early stopping

In this part, we are going to implement the same model used before with some changes : 
- class redistribution ('bad','good','very good')
- oversampling (by using SMOTE)
- early stopping 

In [205]:
# Load the preprocessed dataset 
df = pd.read_csv('data/APPLE_iPhone_SE_preprocessed.csv')
df.dropna(subset=['Reviews'], inplace=True)

In [206]:
# Create a mapping dictionary
rating_mapping = {1: 1, 2: 1, 3: 2, 4: 3, 5: 3}
df['Sentiment'] = df['Ratings'].map(rating_mapping)


In [207]:
# Tokenize the text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df["Reviews"])

# Calculate max_words
max_words = len(tokenizer.word_index) + 1  # Vocabulary size

# Convert text data to sequences
sequences = tokenizer.texts_to_sequences(df["Reviews"])

# Padding sequences 
max_sequence_length = 58
X = pad_sequences(sequences, maxlen=max_sequence_length, padding='post', truncating='post')

y = df['Sentiment']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#SMOTE instance
smote = SMOTE(random_state=42)

#Apply SMOTE oversampling to the training set
X_train, y_train = smote.fit_resample(X_train, y_train)

Let's define the model with few changes :

In [208]:
# Define the RNN model
model_2 = Sequential()

# Embedding layer to convert text data to dense vectors
model_2.add(Embedding(input_dim=max_words, output_dim=128, input_length=max_sequence_length, name='embedding'))

# Two SimpleRNN layers with ReLU activation functions
model_2.add(SimpleRNN(64, activation='relu', return_sequences=True, name='rnn1'))
model_2.add(SimpleRNN(32, activation='relu', name='rnn2'))

# Dense layer (output layer) with a softmax activation function for multi-class classification
model_2.add(Dense(4, activation='softmax', name='output'))

# Compile the model
model_2.compile(optimizer=Adam(learning_rate=0.001),  loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Display model summary
model_2.summary()

Model: "sequential_50"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 58, 128)           777472    
                                                                 
 rnn1 (SimpleRNN)            (None, 58, 64)            12352     
                                                                 
 rnn2 (SimpleRNN)            (None, 32)                3104      
                                                                 
 output (Dense)              (None, 4)                 132       
                                                                 
Total params: 793060 (3.03 MB)
Trainable params: 793060 (3.03 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Let's train the new model :

In [209]:
# EarlyStopping Callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

model_2.fit(X_train, y_train, epochs=20, batch_size=64, validation_split=0.2, callbacks=[early_stopping])

# Define a mapping dictionary
label_mapping = {1: 'bad', 2: 'good', 3: 'very good'}

# Evaluate the model for multi-class classification
y_proba = model_2.predict(X_test)
y_pred = np.argmax(y_proba.round(2), axis=-1)

# Map integer labels to string labels for y_pred and y_test
y_pred = [label_mapping[label] for label in y_pred]
y_test = [label_mapping[label] for label in y_test]

class_labels = ['bad', 'good', 'very good']

report(y_test, y_pred, class_labels)



Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
              precision    recall  f1-score   support

         bad       0.19      0.80      0.30       132
        good       0.00      0.00      0.00        80
   very good       0.95      0.76      0.85      1730

    accuracy                           0.73      1942
   macro avg       0.38      0.52      0.38      1942
weighted avg       0.86      0.73      0.77      1942



# Result Analysis

We can see that the overall performance of the model has improved with a score of 73% but this is explained by the large quantity of 'very good' classes. With this method we have eliminated the confusion between classes 4 and 5 but we have reinforced the imbalance. In fact, we face the same limit that we observed with the baseline model.
Thus, using different optimization techniques (SMOTE, oversampling..) and a deep learning model more complex than the models used before did not allow us to sufficiently overcome the imabalance of the dataset and the limits generated.