# Tracking multiple SVM model performance using MLflow

## Introduction

Following the generation of our synthetic dataset for E-commerce Shipping Time Prediction, this notebook is dedicated to applying and analyzing Support Vector Machine (SVM) models to our dataset. Our focus will be on predicting the delivery time of packages, a regression problem, utilizing the SVM regression variant, Support Vector Regression (SVR).


- **Objective:** Apply SVM models to predict delivery times for an e-commerce shipping dataset and use MLflow to track the performace of each model.
- **Dataset Features:** Distance to destination and package weight.
- **Target Variable:** Delivery time in hours.
- **Analysis Focus:**
    - Examining the effect of different SVM kernels (linear, RBF, poly) on prediction accuracy.
    - Utilizing hyper-parameter tuning to enhance model performance.

In [1]:
import mlflow

## Importing modules

In [2]:
import pandas as pd
from sklearn.svm import SVR
from matplotlib import pyplot as plt
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

#np.random.seed(1)

I donot need the random seed anymore as using mlflow I can regenerate the model formed or the results obtained at any time or any run of the code below.

## 2. Load data

as it is synthetically generated there in need to furture clean and process the data. So we can import and move on with our application.

In [3]:
df = pd.read_csv('./data/delivery_time.csv') # let's use the same data as we did in the logistic regression example
df.head(3)

Unnamed: 0,distance,weight,delivery_time
0,911.9,101,112
1,419.0,163,110
2,614.5,91,110


## Train Test Split:
I am choosing a train test split of 20% for the following reasons
- Balanced Dataset Size: Using a test size of 0.2 provides a good balance between training and testing datasets, ensuring enough data for model training while still having a substantial amount to validate model performance.
- Sufficient Testing Data: With 1000 observations, a 0.2 split ensures 200 observations for testing, which is adequate to assess model accuracy and generalizability without significantly reducing the training set size.
- Avoid Overfitting: A larger training set (80%) helps in building a more accurate model while the testing set (20%) is sufficient to evaluate overfitting.

In [4]:
# Use sklearn to split df into a training set and a test set

X = df[['distance','weight']]
y = df['delivery_time']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


In [5]:
import mlflow
from sklearn.linear_model import LinearRegression

mlflow.set_experiment("svm_linear")
mlflow.start_run()
mlflow.sklearn.autolog()





The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh()

All git commands will error until this is rectified.

This initial message can be silenced or aggravated in the future by setting the
$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
    - quiet|q|silence|s|silent|none|n|0: for no message or exception
    - error|e|exception|raise|r|2: for a raised exception

Example:
    export GIT_PYTHON_REFRESH=quiet



## Choosing Performance metrics for the data

To evaluate and compare the performance of our tuned SVM models, we can  will consider several metrics: 
- Mean Absolute Error (MAE), 
- Mean Squared Error (MSE), and the 
- R-squared score.

but I will be optimizing on MSE as it optimizes on the overall accuracy on delivery times.

In [6]:
performance = pd.DataFrame({"model": [], "MSE": [], "MAE": [], "R2": [], "Parameters": []})

# Modelling and Hyperparameter tuning

##  SVM Regression model using linear kernal 

In [7]:
# defining parameter range 
param_grid = {'C': [ 0.5, 1, 5, 10],  
              'kernel': ['linear']}
  

grid = GridSearchCV(SVR(), param_grid, scoring='neg_mean_squared_error', refit = True, verbose = 3, n_jobs=-1) 
  
# fitting the model for grid search 
_ = grid.fit(X_train, y_train)



Fitting 5 folds for each of 4 candidates, totalling 20 fits


2024/03/09 11:31:13 INFO mlflow.sklearn.utils: Logging the 5 best runs, no runs will be omitted.


The aim is to use both gridCv and Mlflow and realtivly comapre my current approach with using MLflow. Now after I looged into the MLflow ui I viewed the cv_results.csv document generated that captured all 4 paramters used in the cv and scored them with a detailed analysis of the every iteration gridCv went through.

In [8]:
print (X_test)
#X_test =X_test.drop("delivery_time",axis = 1)

     distance  weight
993     418.8     143
859     940.3     150
298     484.0      70
553    1047.3     127
672     570.6     171
..        ...     ...
679     837.5      35
722     409.8     137
215     990.0      61
653     778.8     121
150     413.8      71

[200 rows x 2 columns]


In [9]:
# print best parameter after tuning 
print(grid.best_params_) 
  
# print how our model looks after hyper-parameter tuning 
print(grid.best_estimator_)


y_pred = grid.predict(X_test) 

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)


performance = pd.concat([performance, pd.DataFrame({"model": ["SVM Linear"], "MSE": [mse], "MAE": [mae], "R2": [r2], "Parameters": [grid.best_params_]})])

mlflow.sklearn.log_model(grid.best_estimator_, "best_model")

mlflow.end_run()

run = mlflow.last_active_run()

run_id = run.info.run_id







{'C': 1, 'kernel': 'linear'}
SVR(C=1, kernel='linear')


In [10]:
print(run_id)

78418611513b4e31879739d230a7238a


In [11]:
#X_test = X_test.drop('y',axis=1)

KeyError: "['y'] not found in axis"

In [12]:

eval_data = X_test.copy()

eval_data['delivery_time']= y_test 
mlflow.evaluate(
f"runs:/{run_id}/model",
eval_data,
targets="delivery_time",
model_type="regressor"
)

2024/03/09 11:33:54 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2024/03/09 11:33:54 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/03/09 11:33:54 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


<mlflow.models.evaluation.base.EvaluationResult at 0x1e4f3c9cc40>

In [13]:
performance

Unnamed: 0,model,MSE,MAE,R2,Parameters
0,SVM Linear,28.960946,4.523503,0.921741,"{'C': 1, 'kernel': 'linear'}"


##   SVM regression model using rbf kernal

In [14]:
mlflow.set_experiment("svm_rbf")
mlflow.start_run()
mlflow.sklearn.autolog()

2024/03/09 11:33:54 INFO mlflow.tracking.fluent: Experiment with name 'svm_rbf' does not exist. Creating a new experiment.


In [15]:
# defining parameter range 
param_grid = {'C': [0.1, 1, 10, 100],  
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'kernel': ['rbf']}
  
grid = GridSearchCV(SVR(), param_grid, scoring='neg_mean_squared_error', refit = True, verbose = 3, n_jobs=-1) 
  
# fitting the model for grid search 
_ = grid.fit(X_train, y_train)



Fitting 5 folds for each of 20 candidates, totalling 100 fits


2024/03/09 11:34:04 INFO mlflow.sklearn.utils: Logging the 5 best runs, 15 runs will be omitted.


In [16]:
# print best parameter after tuning 
print(grid.best_params_) 
  
# print how our model looks after hyper-parameter tuning 
print(grid.best_estimator_)

y_pred = grid.predict(X_test) 

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)



performance = pd.concat([performance, pd.DataFrame({"model": ["SVM rbf"], "MSE": [mse], "MAE": [mae], "R2": [r2],"Parameters": [grid.best_params_]})])



{'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
SVR(C=10, gamma=0.0001)


In [17]:
mlflow.end_run()

run = mlflow.last_active_run()

run_id = run.info.run_id

In [18]:
mlflow.evaluate(
f"runs:/{run_id}/model",
eval_data,
targets="delivery_time",
model_type="regressor"
)

2024/03/09 11:34:04 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2024/03/09 11:34:04 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/03/09 11:34:04 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


<mlflow.models.evaluation.base.EvaluationResult at 0x1e4f0df3cd0>

## SVM classification model using polynomial kernal

In [19]:
mlflow.set_experiment("svm_poly")
mlflow.start_run()
mlflow.sklearn.autolog()

2024/03/09 11:34:05 INFO mlflow.tracking.fluent: Experiment with name 'svm_poly' does not exist. Creating a new experiment.


In [20]:
# defining parameter range 
param_grid = {'C': [0.01, 0.1, 0.5, 1, 5, 10, 50, 100],  
              'coef0': [0.01, 0.1, 0.5, 1, 5, 10, 50, 100],
              'kernel': ['poly']}
  
grid = GridSearchCV(SVR(), param_grid, scoring='neg_mean_squared_error', refit = True, verbose = 3, n_jobs=-1) 
  
# fitting the model for grid search 
_ = grid.fit(X_train, y_train)



Fitting 5 folds for each of 64 candidates, totalling 320 fits


2024/03/09 11:34:50 INFO mlflow.sklearn.utils: Logging the 5 best runs, 59 runs will be omitted.


In [21]:
# print best parameter after tuning 
print(grid.best_params_) 
  
# print how our model looks after hyper-parameter tuning 
print(grid.best_estimator_)

y_pred = grid.predict(X_test) 

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)



performance = pd.concat([performance, pd.DataFrame({"model": ["SVM Poly"], "MSE": [mse], "MAE": [mae], "R2": [r2], "Parameters": [grid.best_params_]})])



{'C': 5, 'coef0': 5, 'kernel': 'poly'}
SVR(C=5, coef0=5, kernel='poly')


In [22]:
mlflow.end_run()

run = mlflow.last_active_run()

run_id = run.info.run_id

In [23]:
mlflow.evaluate(
f"runs:/{run_id}/model",
eval_data,
targets="delivery_time",
model_type="regressor"
)

2024/03/09 11:34:51 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2024/03/09 11:34:51 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/03/09 11:34:51 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


<mlflow.models.evaluation.base.EvaluationResult at 0x1e4f0e657c0>

# Performances of each model

In [24]:
performance.sort_values(by="MSE", ascending=True)

Unnamed: 0,model,MSE,MAE,R2,Parameters
0,SVM Linear,28.960946,4.523503,0.921741,"{'C': 1, 'kernel': 'linear'}"
0,SVM Poly,29.654246,4.563022,0.919867,"{'C': 5, 'coef0': 5, 'kernel': 'poly'}"
0,SVM rbf,35.787535,5.045667,0.903294,"{'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}"


# Summary and Conclusion

- **Model Performance:** Linear and Polynomial kernels showed the best performance, optimized for Mean Squared Error (MSE) to enhance accuracy.

### Pros and Cons of Each Model Choice:
- **Linear Kernel:** Chosen for its simplicity and ease of interpretation. It works well for linearly separable data.
- **Polynomial Kernel:** Selected for its ability to handle non-linear relationships, offering flexibility in modeling complex patterns.
- **RBF Kernel:** Though it is a powerful kernel capable of complex modelings, it did not perform as expected in our case, likely due to overfitting or the specific characteristics of our data.

### Pros and Cons of Each Metric Choice:
- **MSE (Mean Squared Error):** Focuses on penalizing larger errors more heavily, making it suitable for our regression problem where accuracy in predicting delivery times is critical. However, it can be sensitive to outliers.
- **MAE (Mean Absolute Error):** Provides a straightforward measure of error magnitude without heavily penalizing larger errors, offering a more robust metric against outliers compared to MSE. Its downside is that it might not reflect the performance on datasets with large errors well.
- **R2 (R-Squared):** Indicates the proportion of variance in the dependent variable that is predictable from the independent variables. While it gives a good indication of fit quality, it doesn't specify the error magnitude.

### Our Result:
Upon comparing the linear and polynomial models, both have their advantages. The linear model's simplicity and interpretability make it highly valuable for straightforward problems or when explaining the model to stakeholders is necessary. The polynomial model's flexibility is advantageous for capturing more complex relationships in the data, although at the risk of overfitting.

### Linear vs. Polynomial - Pros, Cons, and Output Comparison:
- **Linear Kernel:** Its main advantage lies in simplicity and lower risk of overfitting, making it highly efficient for datasets where the relationship between the variables is approximately linear.
- **Polynomial Kernel:** Offers the ability to capture complex relationships but requires careful tuning of parameters to avoid overfitting.

### Why Choosing Linear is Better than Any Other Model in the Final Output:
The linear model's simplicity, efficiency, and ease of interpretation often make it the preferred choice, especially in a business context where decisions need to be explained to non-technical stakeholders. It strikes a balance between accuracy and model complexity, ensuring that the model is both practical and reliable.

### What These advantages mean in the context of an e-commerce platform:
In the context of e-commerce shipping time prediction, choosing the right model impacts not only the accuracy of predictions but also the operational efficiency and customer satisfaction. A linear model, with its balance of simplicity and effectiveness, supports timely and reliable delivery predictions. This reliability is crucial for planning, resource allocation, and enhancing the overall customer experience by setting realistic expectations for delivery times.
