# **Improve on Baseline Model**


*The goal of this notebook is to improve our baseline model based on the studies made in the previous notebook.* 


Importing necessary libraries

In [11]:
import pandas as pd
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import re
import numpy as np
import random

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

from sklearn.ensemble import GradientBoostingClassifier
from sklearn import svm
from sklearn.model_selection import GridSearchCV

In [2]:
# Load the preprocessed dataset 
df = pd.read_csv('data/APPLE_iPhone_SE_preprocessed.csv')
df.dropna(subset=['Reviews'], inplace=True)

In this part, we are going to reuse our previous class model with some modifications in order to facilitate the application of the different optimisation techniques

In [3]:
class Model:
    def __init__(self, X_train, y_train, model_architecture, random_seed=42, test_size=0.2) -> None:
        self.X_train = X_train
        self.y_train = y_train
        self.model_instance = model_architecture
        self.random_seed = random_seed
        self.test_size = test_size

        self.pipeline = Pipeline([
        ('classifier', model_architecture)
        ]) 

     
    def fit(self):
        # fit self.pipeline to the training data
        self.pipeline.fit(self.X_train, self.y_train)

    def predict(self,X_test):
        return self.pipeline.predict(X_test)

    
    def predict_proba(self,X_test):
        return self.pipeline.predict_proba(X_test)
   
    
    def report(self, y_true, y_pred, class_labels):
        print(classification_report(y_true, y_pred, labels=class_labels, zero_division=0))

        confusion_matrix_kwargs = dict(
            text_auto=True,
            title="Confusion Matrix",
            width=1000,
            height=800,
            labels=dict(x="Predicted", y="True Label"),
            x=class_labels,
            y=class_labels,
            color_continuous_scale='Blues'
        )
        
        c_m = confusion_matrix(y_true, y_pred, labels=class_labels) 
        fig = px.imshow(c_m, **confusion_matrix_kwargs)
        fig.show()

## Change the class distribution

For the rest of the study, we will combine the classes of 4 and 5 ratings into a single class and do the same for ratings 1 and 2. Thus, we will end up with only 3 different classes which will simplify our model and normally improve the results. This choice reduces the precision of the ratings but is consistent given the similarity of the reviews that can be found between a review with a rating of 4 and another rated 5 for example.

We are going to established 3 new classes :

- 'bad' : Reviews with 1 or 2 rating
- 'good' : Reviews with 3 rating
- 'very good' : Reviews with 4 or 5 rating


In [4]:
# Create a mapping dictionary
rating_mapping = {1: 'bad', 2: 'bad', 3: 'good', 4: 'very good', 5: 'very good'}

# Apply the mapping to create a new column 'Sentiment'
df['Sentiment'] = df['Ratings'].map(rating_mapping)

# Check the result
df[['Ratings', 'Sentiment']].head(10)

Unnamed: 0,Ratings,Sentiment
0,5,very good
1,5,very good
2,5,very good
3,5,very good
4,5,very good
5,4,very good
6,4,very good
7,4,very good
8,4,very good
9,3,good


In [5]:
print(df["Sentiment"].value_counts())
fig = px.histogram(df, x="Sentiment",color="Sentiment",text_auto=True, title="Sentiment distribution")
fig.show()

very good    8499
bad           673
good          535
Name: Sentiment, dtype: int64


In [6]:
# Define X and y
X = df["Reviews"]  # Text: the reviews preprocessed
y = df['Sentiment']  # Outputs: Sentiments

#TF-IDF vectorizer
tfidf = TfidfVectorizer()

# TF-IDF vectorization to the text data 
X_tfidf = tfidf.fit_transform(X)

#Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

In [7]:
#instantiate the Model class 
model_reduce = Model(X_train, y_train, GradientBoostingClassifier(), random_seed=42, test_size=0.2 )

# fit the model
model_reduce.fit()

# predict and generate classification report
y_pred = model_reduce.predict(X_test)
class_labels=['bad','good','very good']
model_reduce.report(y_test, y_pred,class_labels)

              precision    recall  f1-score   support

         bad       0.58      0.22      0.32       132
        good       0.44      0.05      0.09        80
   very good       0.91      0.99      0.95      1730

    accuracy                           0.90      1942
   macro avg       0.65      0.42      0.45      1942
weighted avg       0.87      0.90      0.87      1942



We can see that the overall performance of the model has improved with a score of 90% but this is explained by the large quantity of 'very good' classes. With this method we have eliminated the confusion between classes 4 and 5 but we have reinforced the imbalance. The model is still poorly performing for the other classes due to the lack of data.

# Oversampling

SMOTE is a method in machine learning to address class imbalance issues. It helps reduce  imbalance by generating synthetic samples for the minority class. It works by identifying existing minority class samples and creating new similar data points. The goal is to improve the overall model's performance and reducing bias due to the majority class. 

In [8]:
#SMOTE instance
smote = SMOTE(random_state=42)

#Apply SMOTE oversampling to the training set
X_train_oversampled, y_train_oversampled = smote.fit_resample(X_train, y_train)


#instantiate the Model class 
model_over = Model(X_train_oversampled, y_train_oversampled, GradientBoostingClassifier(), random_seed=42, test_size=0.2 )

# fit the model
model_over.fit()

# predict and generate classification report
y_pred = model_over.predict(X_test)
class_labels=['bad','good','very good']
model_over.report(y_test, y_pred,class_labels)

              precision    recall  f1-score   support

         bad       0.41      0.47      0.44       132
        good       0.10      0.31      0.16        80
   very good       0.96      0.85      0.90      1730

    accuracy                           0.81      1942
   macro avg       0.49      0.55      0.50      1942
weighted avg       0.88      0.81      0.84      1942



## Undersampling

RandomUnderSampler is a method in machine learning to address class imbalance issues. It helps reduce imbalance by using an undersampling method that  randomly remove a portion of the majority class samples.

In [9]:
#RandomUnderSampler instance
undersampler = RandomUnderSampler(sampling_strategy='auto', random_state=42)

# Apply random undersampling to the training set
X_train_undersampled, y_train_undersampled = undersampler.fit_resample(X_train, y_train)

#instantiate the Model class 
model_under = Model(X_train_undersampled, y_train_undersampled, GradientBoostingClassifier(), random_seed=42, test_size=0.2 )

# fit the model
model_under.fit()

# predict and generate classification report
y_pred = model_under.predict(X_test)
class_labels=['bad','good','very good']
model_under.report(y_test, y_pred,class_labels)

              precision    recall  f1-score   support

         bad       0.26      0.63      0.37       132
        good       0.08      0.49      0.14        80
   very good       0.98      0.65      0.78      1730

    accuracy                           0.64      1942
   macro avg       0.44      0.59      0.43      1942
weighted avg       0.90      0.64      0.73      1942



We can see in the two previous studies that we have reduced the overall precision of our model. However, we improved performance on minority classes (for example we can compare the difference between the f1 score for the minority classes with and without oversampling), which was the desired goal. Our model is now more efficient at classifying reviews as a whole with less inequality due to imbalance. Thus, we will continue to keep the oversampled model (more efficient than the undersampled model).

## Tuning hyperparameters

In [13]:
# parameter grid to search over
param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 200]
}

#Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier(random_state=42)

#GridSearchCV
grid_search = GridSearchCV(gb_classifier, param_grid, cv=3, scoring='accuracy')

#Fit the grid search with data
grid_search.fit(X_train_oversampled, y_train_oversampled)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

#Get the best estimator 
best_model = grid_search.best_estimator_

#Use the best model to prediction
y_pred = best_model.predict(X_test)

#classification report
class_labels=['bad','good','very good']
print(classification_report(y_test, y_pred, labels=class_labels, zero_division=0))


Best Parameters: {'learning_rate': 0.2, 'n_estimators': 200}
              precision    recall  f1-score   support

         bad       0.44      0.54      0.49       132
        good       0.12      0.24      0.16        80
   very good       0.96      0.90      0.93      1730

    accuracy                           0.85      1942
   macro avg       0.51      0.56      0.52      1942
weighted avg       0.89      0.85      0.87      1942



By using a GridSearch, we obtained 2 hyperparameters which improve the performance of our model ('learning_rate': 0.2, 'n_estimators': 200) --> we have improved the performance among the 3 different classes.