# Classification model

In this notebook, we chose to to apply classification instead of regression to see if there are any performance improvements. We will only use the openai embeddings as these embeddings performed the best for regression. 

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [7]:
df_sample=pd.read_csv('embeddings\sample_embed_reviews.csv')
df_sample['ada_embedding']

0       [0.006420237943530083, -0.0034434357658028603,...
1       [-0.013516747392714024, 0.013427401892840862, ...
2       [-0.005110813304781914, -0.0019843303598463535...
3       [-0.01380288228392601, -0.006777149625122547, ...
4       [0.0034719263203442097, -0.005105964373797178,...
                              ...                        
4995    [-0.01762218214571476, 0.0030267485417425632, ...
4996    [0.012827948667109013, 0.007280903868377209, 0...
4997    [-0.003466191468760371, -0.005143802147358656,...
4998    [-0.010686701163649559, -0.004436171147972345,...
4999    [-0.008222557604312897, 0.00519371684640646, 0...
Name: ada_embedding, Length: 5000, dtype: object

In [8]:
import numpy as np

# Convert string representations to actual lists
df_sample['ada_embedding'] = df_sample['ada_embedding'].apply(eval)

# Convert lists to numpy arrays
df_sample['ada_embedding'] = df_sample['ada_embedding'].apply(np.array)

# Check the type of the first element
print(type(df_sample['ada_embedding'][0]))


<class 'numpy.ndarray'>


In [9]:
len(df_sample['ada_embedding'][0])

1536

In [10]:
# Convert embeddings from 'ada_embedding' column into a matrix
X = pd.DataFrame(df_sample['ada_embedding'].tolist())

# Ratings will be our target variable
y = df_sample['Rating out of 5']

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [11]:
# Initialize and train the Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=None, min_samples_split=2, min_samples_leaf=1)
rf_classifier.fit(X_train, y_train)

# Predict ratings on the test set
y_pred_rf = rf_classifier.predict(X_test)

# Calculate and print evaluation metrics
accuracy = accuracy_score(y_test, y_pred_rf)
classificationReport = classification_report(y_test, y_pred_rf)
confusionMatrix = confusion_matrix(y_test, y_pred_rf)

print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", classificationReport)
print("Confusion Matrix:\n", confusionMatrix)

Accuracy: 0.64
Classification Report:
               precision    recall  f1-score   support

         1.0       0.62      0.77      0.69       194
         2.0       0.49      0.46      0.47       197
         3.0       0.59      0.57      0.58       228
         4.0       0.70      0.63      0.66       191
         5.0       0.84      0.81      0.82       190

    accuracy                           0.64      1000
   macro avg       0.65      0.65      0.65      1000
weighted avg       0.64      0.64      0.64      1000

Confusion Matrix:
 [[150  29  13   1   1]
 [ 63  90  40   4   0]
 [ 25  51 130  18   4]
 [  2  10  33 121  25]
 [  0   2   6  29 153]]


The model performs well on classes 1.0 and 5.0, with relatively high precision, recall, and F1-Score.
The model struggles more with classes 2.0, 3.0, and 4.0, where we see lower values for these metrics.
There might be some confusion between classes, especially between adjacent classes (e.g., 3.0 and 4.0), which is common in ordinal classification tasks. Hence, in the next part we will try group classes 1 and 2, and 4 and 5 to see how the model now performs.

In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [None]:
#Load dataset
df_sample = pd.read_csv('embeddings/sample_embed_reviews.csv')

# Convert string representations to actual lists
df_sample['ada_embedding'] = df_sample['ada_embedding'].apply(eval)

# Convert lists to numpy arrays
df_sample['ada_embedding'] = df_sample['ada_embedding'].apply(np.array)

In [13]:

# Convert embeddings from 'ada_embedding' column into a matrix
X = pd.DataFrame(df_sample['ada_embedding'].tolist())

# Ratings will be our target variable
y = df_sample['Rating out of 5']

# Map the original ratings to the new class labels
y = y.map({1: 1, 2: 1, 3: 2, 4: 3, 5: 3})

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [14]:
# Initialize and train the Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=None, min_samples_split=2, min_samples_leaf=1)
rf_classifier.fit(X_train, y_train)

# Predict ratings on the test set
y_pred_rf = rf_classifier.predict(X_test)

# Calculate and print evaluation metrics
accuracy = accuracy_score(y_test, y_pred_rf)
classificationReport = classification_report(y_test, y_pred_rf)
confusionMatrix = confusion_matrix(y_test, y_pred_rf)

print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", classificationReport)
print("Confusion Matrix:\n", confusionMatrix)


Accuracy: 0.78
Classification Report:
               precision    recall  f1-score   support

           1       0.70      0.94      0.81       391
           2       0.78      0.26      0.39       228
           3       0.87      0.91      0.89       381

    accuracy                           0.78      1000
   macro avg       0.78      0.71      0.70      1000
weighted avg       0.78      0.78      0.74      1000

Confusion Matrix:
 [[368  10  13]
 [128  60  40]
 [ 26   7 348]]


The model has a high overall accuracy, but this is mainly driven by its performance on classes 1 and 3.
Class 2 has a low recall, indicating that the model struggles to identify instances of this class.
The high precision for class 2 suggests that when the model does predict class 2, it is often correct, but it tends to be overly cautious, resulting in many false negatives. This makes sense as the training dataset is largely imbalanced with many more 1,2 and 4,5 than 3's.