<h1 style="color: green;">  Smart Cart </h1>

### Modeling. Part 4

#### INDEX

- [KNN Classifier](#6)
- [Hyperparameter Tuning using Grid Search Cross-Validation](#6)
- [Random Forest and SVC](#8)
- [Hyperparameter Tuning using Grid Search Cross-Validation](#9)

<a id='INDEX'></a>

In this notebook we will continue building and analyzing ML models: KNN Classifier, RandomForest and SVC with Hyperparameter tuning.

In [None]:
# Import required libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression


# Get rid of the warning message
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('data/final_df.csv')
df

Unnamed: 0,aa,acai,ad,age,air,ale,alfresco,almond,almondmilk,altern,...,yellow,yoghurt,yogurt,yokid,zero,zucchini,add_to_cart_order,reordered,order_number,days_since_prior_order
0,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,4.0,9.0
1,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,7.0,1.0,14.0,16.0
2,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,15.0,7.0
3,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,11.0,1.0,4.0,14.0
4,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,8.0,0.0,5.0,30.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46643,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,6.0,1.0,52.0,7.0
46644,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,15.0,0.0,7.0,30.0
46645,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,14.0,1.0,31.0,7.0
46646,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,12.0,0.0,6.0,14.0


The Dataset is loaded and ready for modeling

In [None]:
#First we assign X and y. Since our target column is 'reordered', we will drop it and make it our y.

# Store data in X and y
X = df.drop(columns='reordered') #input
y = df['reordered'] #output

In [None]:
# Split our data into train and test sets:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)

## KNN Classifier
<a id='6'></a>

K-Nearest Neighbors (KNN) classifier is a simple yet effective machine learning algorithm used for classification tasks. In KNN, the classification of a data point is determined by the majority class among its 'k' nearest neighbors.

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardizing the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled.shape, X_test_scaled.shape, y_train.shape, y_test.shape

((196923, 502), (49231, 502), (196923,), (49231,))

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score

# Training the KNN classifier
knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)

# Predicting on the test set
y_pred = knn.predict(X_test_scaled)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

accuracy, classification_rep

(0.6264751883975543,
 '              precision    recall  f1-score   support\n\n         0.0       0.54      0.48      0.51     19760\n         1.0       0.68      0.72      0.70     29471\n\n    accuracy                           0.63     49231\n   macro avg       0.61      0.60      0.60     49231\nweighted avg       0.62      0.63      0.62     49231\n')

Overall Accuracy: The model's accuracy is 62.65%. This means that it correctly predicted the class (either 0.0 or 1.0) for about 62.65% of the cases in the dataset.
Precision: 68%. When the model predicts class 1.0, it is correct 68% of the time. For Class 0 - 54%.

While the model shows decent performance, especially for class 1.0, there's potential for improvement, particularly in handling the class imbalance and enhancing the predictive power for class 0.0.

Look and fit specific parameters:

### Hyperparameter Tuning using Grid Search Cross-Validation
<a id='7'></a>
Hyperparameter Tuning, Grid Search, Cross-Validation (CV=5):

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
# Define X and y
X = df.drop('reordered', axis=1)
y = df['reordered']

# Split the data into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [None]:
# Initialize KNeighborsClassifier
knn = KNeighborsClassifier()

# Define the parameter grid
param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')

# Fit the model
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_

# Best score
best_score = grid_search.best_score_

# Validation set evaluation
y_pred_val = grid_search.predict(X_val)
classification_rep_val = classification_report(y_val, y_pred_val)

# Test set evaluation
y_pred_test = grid_search.predict(X_test)
classification_rep_test = classification_report(y_test, y_pred_test)

# Output the best parameters, best score, and classification reports
best_params, best_score, classification_rep_val, classification_rep_test

({'metric': 'manhattan', 'n_neighbors': 9, 'weights': 'uniform'},
 0.6298488387216741,
 '              precision    recall  f1-score   support\n\n         0.0       0.54      0.48      0.51      4856\n         1.0       0.68      0.73      0.70      7192\n\n    accuracy                           0.63     12048\n   macro avg       0.61      0.61      0.61     12048\nweighted avg       0.62      0.63      0.62     12048\n',
 '              precision    recall  f1-score   support\n\n         0.0       0.57      0.49      0.53      4919\n         1.0       0.68      0.74      0.71      7129\n\n    accuracy                           0.64     12048\n   macro avg       0.62      0.62      0.62     12048\nweighted avg       0.63      0.64      0.63     12048\n')

Here is the set of parameters that we want to apply to the model.

The parameters are metric='manhattan', n_neighbors=9, and weights='uniform'.

Here's how we can fit the KNN model with these parameters and evaluate its performance:

In [None]:
from sklearn.metrics import classification_report, accuracy_score

# Initialize KNN with specified parameters
knn = KNeighborsClassifier(metric='manhattan', n_neighbors=9, weights='uniform')

# Fit the model on the training data
knn.fit(X_train, y_train)

# Predict and evaluate on the validation set
y_val_pred = knn.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)

# Predict and evaluate on the test set
y_test_pred = knn.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Output results for comparison
print("Validation Set Accuracy:", val_accuracy)
print("Test Set Accuracy:", test_accuracy)

Validation Set Accuracy: 0.7632114910494158
Test Set Accuracy: 0.7654876741693462


Model has similar performance on both the training/validation data and the unseen test data. No overfitting.

Model has good accuracy of 76,32% and 76,55%.

### Models comparison:

In [39]:
# Create an empty dataframe to store the metrics
metrics_df = pd.DataFrame(columns=['Hyperparameter', 'Training Accuracy', 'Test Accuracy', 'Notes'])

# Add the metrics for the baseline models
metrics_df.loc['Logistic regression'] = ['none', '66.46%', '66.39%', 'good performance']
metrics_df.loc['Logistic regression2'] = ['scaled', '66.574%', '66.514%', 'performance improved after scaling']
metrics_df.loc['Decision Tree'] = ['None', '66.499%', '66.095%', 'performance is relatively consistent']
metrics_df.loc['Decision Tree2'] = ['depth=11,min_s_leaf=15,min_s_split=6', '66.465%', '66.111%', 'slightly better performance for class 0']
metrics_df.loc['KNN Classifier'] = ['scaled', '62.65%', '62.65%', 'low performance']
metrics_df.loc['KNN Classifier2'] = ['manhattan, n_neighbors=9, w=uniform', val_accuracy, test_accuracy, 'high performance']

metrics_df

Unnamed: 0,Hyperparameter,Training Accuracy,Test Accuracy,Notes
Logistic regression,none,66.46%,66.39%,good performance
Logistic regression2,scaled,66.574%,66.514%,performance improved after scaling
Decision Tree,,66.499%,66.095%,performance is relatively consistent
Decision Tree2,"depth=11,min_s_leaf=15,min_s_split=6",66.465%,66.111%,slightly better performance for class 0
KNN Classifier,scaled,62.65%,62.65%,low performance
KNN Classifier2,"manhattan, n_neighbors=9, w=uniform",0.763211,0.776099,high performance


Lets continue our analysis with the Random Forest and SVC models:

## Random Forest and SVC:
<a id='8'></a>


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from tqdm import tqdm

In [None]:
# Initialize the models
rf = RandomForestClassifier(n_estimators=100)
svc = SVC()

# Fit the models and make predictions
results = {}
for model_name, model in tqdm([('RandomForest', rf), ('SVC', svc)], desc='Training Models'):
    # Fit the model
    model.fit(X_train, y_train)

    # Predict on training set
    y_train_pred = model.predict(X_train)
    train_accuracy = accuracy_score(y_train, y_train_pred)

    # Predict on test set
    y_test_pred = model.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_test_pred)

    # Store results
    results[model_name] = {'Training Accuracy': train_accuracy, 'Test Accuracy': test_accuracy}

# Output the results
for model_name, accuracies in results.items():
    print(f"{model_name} Training Accuracy: {accuracies['Training Accuracy']:.4f}")
    print(f"{model_name} Test Accuracy: {accuracies['Test Accuracy']:.4f}")


Training Models: 100%|██████████| 2/2 [04:56<00:00, 148.46s/it]

RandomForest Training Accuracy: 0.9142
RandomForest Test Accuracy: 0.7967
SVC Training Accuracy: 0.8245
SVC Test Accuracy: 0.8243





The SVC model is more balanced in terms of generalization, making it a preferable choice for this particular dataset.
The SVC model shows a good balance between training and test accuracy. The close performance on both datasets suggests that the model is generalizing well without significant overfitting. This model seems more suitable for the dataset based on these results.

The Random Forest model shows high training accuracy but a significant drop in test accuracy.
The Random Forest model is overfitting and needs adjustments to improve its generalization capability.

Look and fit specific parameters to remove overfitting in Random Forest:
### Hyperparameter Tuning using Grid Search Cross-Validation
<a id='9'></a>

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]     # Minimum number of samples required to be at a leaf node
}

# Initialize the model
rf = RandomForestClassifier()

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
                           cv=5, n_jobs=-1, verbose=2, scoring='accuracy')

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Best parameters found
print("Best Parameters: ", grid_search.best_params_)


Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters:  {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 100}


In [16]:
# Split our data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)

In [18]:
# Initialize the Random Forest model with the best parameters
rf_optimized = RandomForestClassifier(max_depth=10, min_samples_leaf=1,
                                      min_samples_split=10, n_estimators=100)

# Fit the model on the training data
rf_optimized.fit(X_train, y_train)

# Make predictions on the training data
y_train_pred = rf_optimized.predict(X_train)
rf2_train = accuracy_score(y_train, y_train_pred)

# Make predictions on the test data
y_test_pred = rf_optimized.predict(X_test)
rf2_test = accuracy_score(y_test, y_test_pred)

# Output the results
print("RandomForest (Optimized) Training Accuracy:", rf2_train)
print("RandomForest (Optimized) Test Accuracy:", rf2_test)


RandomForest (Optimized) Training Accuracy: 0.7820028405284455
RandomForest (Optimized) Test Accuracy: 0.7764201500535906


RandomForest (Optimized) Training Accuracy: 0.7822440174719297

RandomForest (Optimized) Test Accuracy: 0.7760986066452304

Models comparison:

In [40]:
# Add the metrics for the baseline models
metrics_df.loc['Logistic regression'] = ['none', '66.46%', '66.39%', 'good performance']
metrics_df.loc['Logistic regression2'] = ['scaled', '66.574%', '66.514%', 'performance improved after scaling']
metrics_df.loc['Decision Tree'] = ['one', '66.499%', '66.095%', 'performance is relatively consistent']
metrics_df.loc['Decision Tree2'] = ['mdepth=11, min_s_leaf=15, min_s_split=6', '66.465%', '66.111%', 'slightly better performance for class 0']
metrics_df.loc['KNN Classifier'] = ['scaled', '62.65%', '62.65%', 'low performance']
metrics_df.loc['KNN Classifier2'] = ['manhattan, n_neighbors=9, w=uniform', '76.32%', '77.61%', 'high performance']
metrics_df.loc['RandomForest'] = ['none', '91.42%', '79.67%', 'overfitting, needs adjustment']
metrics_df.loc['SVC'] = ['', '82.45%', '82.43%', 'the highest performance']
metrics_df.loc['RandomForest optimized'] = ['mdepth: 10,min_s_leaf: 1,min_s_split: 10,n_est: 100', '78.22%', '77.61%', 'overfitting, good performance']

metrics_df

Unnamed: 0,Hyperparameter,Training Accuracy,Test Accuracy,Notes
Logistic regression,none,66.46%,66.39%,good performance
Logistic regression2,scaled,66.574%,66.514%,performance improved after scaling
Decision Tree,one,66.499%,66.095%,performance is relatively consistent
Decision Tree2,"mdepth=11, min_s_leaf=15, min_s_split=6",66.465%,66.111%,slightly better performance for class 0
KNN Classifier,scaled,62.65%,62.65%,low performance
KNN Classifier2,"manhattan, n_neighbors=9, w=uniform",76.32%,77.61%,high performance
RandomForest,none,91.42%,79.67%,"overfitting, needs adjustment"
SVC,,82.45%,82.43%,the highest performance
RandomForest optimized,"mdepth: 10,min_s_leaf: 1,min_s_split: 10,n_est...",78.22%,77.61%,"overfitting, good performance"


SVC performs the best on the test set, indicating it might be the most effective model among those listed, especially if the unseen data behaves similarly to the test set.

The Support Vector Classifier (SVC) is a powerful and versatile machine learning model used primarily for classification tasks, though it can also be adapted for regression. It's part of the Support Vector Machine (SVM) family and is particularly known for its effectiveness in high-dimensional spaces and its versatility in handling various types of data.

---

**Machine learning (ML) models have become increasingly valuable in predicting customer behavior in online grocery shopping, providing insights that can drive sales, enhance customer experience, and streamline operations.** 

**ML models in online grocery retail can lead to more effective business strategies, improved customer satisfaction, and increased revenue.**