# Linear SVC
* In this notebook we are going to train a model using Linear SVC, experiment with various params using grid search and try to come up with best `Linear SVC` version.

## Install Libraries

In [2]:
# %pip install scikit-learn

## Import Libraries

In [4]:
import os
import sys
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score,recall_score,precision_score,precision_recall_curve
import seaborn as sns


# Build an absolute path from this notebook's parent directory
module_path = os.path.abspath(os.path.join('..'))

# Add to sys.path if not already present
if module_path not in sys.path:
    sys.path.append(module_path)
    
from src.utils import preprocessing
from src.utils import common
from src.utils.training import refit_strategy

## Initialize Directories

In [5]:
data_root_dir = Path("..", "data/")
models_root_dir = Path("..", "models/")

## Read Data

In [6]:
X_train = pd.read_csv(Path(data_root_dir,"X_train.csv"))
y_train = pd.read_csv(Path(data_root_dir,"y_train.csv"))

## Training Default Model

In [8]:
# import sklearn


# sklearn.metrics.get_scorer_names() 

In [9]:
from sklearn.discriminant_analysis import StandardScaler
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.pipeline import Pipeline


default_linear_svc_model = LinearSVC(max_iter=1000)

model_pipeline = Pipeline([
    ("preprocessing", preprocessing.pipeline),
    ("normalizing", StandardScaler()),
    ("prediction", default_linear_svc_model)
])

scoring = ["recall", "precision", "f1"]

default_linear_svc_scores = cross_validate(
    estimator=model_pipeline, 
    X=X_train, 
    y=y_train.values.ravel(), 
    cv=3, scoring=scoring,
    n_jobs=-1, verbose=2)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 20 concurrent workers.


[CV] END .................................................... total time=   1.9s
[CV] END .................................................... total time=   1.9s
[CV] END .................................................... total time=   1.9s


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    3.2s finished


In [10]:
default_linear_svc_scores

{'fit_time': array([1.19797039, 1.11491108, 1.08650899]),
 'score_time': array([0.72383809, 0.82775974, 0.8064034 ]),
 'test_recall': array([0.89508724, 0.88383838, 0.88360882]),
 'test_precision': array([0.85504386, 0.8519584 , 0.86223118]),
 'test_f1': array([0.87460745, 0.86760563, 0.87278912])}

In [11]:
cv_scores = default_linear_svc_scores


In [12]:
mean_recall,mean_precision,mean_f1 = common.calculate_mean_from_cv(default_linear_svc_scores)

Mean Recall: 0.8875114784205693, Mean Precision: 0.8875114784205693,Mean F1: 0.8716673989116179


In [13]:
# commenting this code out to avoid overwriting the metrics file. 
common.update_models_metrics("Linear SVC", "v0", mean_recall,mean_precision,mean_f1)

Unnamed: 0,model,version,recall,precision,f1,file
0,Logistic Regression,v0,0.885369,0.885369,0.870934,
1,Logistic Regression,v1,0.885675,0.85711,0.87114,logistic_regression_v1.joblib
2,Logistic Regression,v2,0.886746,0.857321,0.871766,logistic_regression_v2.joblib
3,Linear SVC,v0,0.887511,0.887511,0.871667,


Observations:
* With average recall of `0.88` and average precision of `0.88` we already have a better model than baseline estimator. Although the recall is less than most frequent verion of baseline, average precision and F1 score makes this model more promising.
* This model is doing is slightly better than the `Linear Regression` but not a significant improvement.


## GridSearch CV v1

In [None]:
## checking params
# preprocessing.pipeline.get_params()

In [16]:
from sklearn.model_selection import GridSearchCV


model_pipeline = Pipeline([
    ("preprocessing", preprocessing.pipeline),
    ("normalizing", StandardScaler()),
    ("prediction", LinearSVC(max_iter=10000, random_state=42))
])

scoring = ["recall", "precision", "f1"]

param_grid = {
    "prediction__penalty": ["l1", "l2"],
    "prediction__C": [0.1, 1, 10]
}

grid_search = GridSearchCV(model_pipeline, param_grid, scoring=scoring, cv=3,n_jobs=-1,refit=refit_strategy)
grid_search.fit(X_train, y_train.values.ravel())


In [17]:
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Scores : {grid_search.best_index_}")
average_recall,average_precision,average_f1 = common.read_best_mean_grid_search_metrics(grid_search.cv_results_,grid_search.best_index_)

Best Parameters: {'prediction__C': 10, 'prediction__penalty': 'l1'}
Best Scores : 4
Mean Recall: 0.8874349556167737, Mean Precision: 0.856400224418305,Mean F1: 0.8716248628750242


In [18]:
# commenting this code out to avoid overwriting the metrics file.
_, file_name = common.save_model(
    "Linear SVC", "v1", grid_search.best_estimator_)
common.update_models_metrics("Linear SVC", "v1", average_recall,
                             average_precision, average_f1, file_name=file_name)
common.update_model_params(
    "LinearSVC", "v1", grid_search.best_params_)

[{'name': 'LogisticRegression',
  'version': 'v1',
  'params': {'prediction__C': 1,
   'prediction__penalty': 'l1',
   'prediction__solver': 'liblinear'}},
 {'name': 'LogisticRegression',
  'version': 'v2',
  'params': {'prediction__C': 1,
   'prediction__penalty': 'l2',
   'prediction__solver': 'saga',
   'preprocessing__age_pipeline__age_encoding__age_range_encoding__encoding': 'ordinal',
   'preprocessing__cgpa_pipeline__cgpa_encoding__cgpa_range_encoding__encoding': 'ordinal',
   'preprocessing__degree_pipeline__degree_encoding__degree_level_encoding__encoding': 'ordinal',
   'preprocessing__dietary_habits_pipeline__dietary_habits_encoding__encoding': 'ordinal',
   'preprocessing__hours_pipeline__hours_encoding__hours_range_encoding__encoding': 'onehot',
   'preprocessing__sleep_duration_pipeline__sleep_duration_encoding__encoding': 'ordinal'}},
 {'name': 'LinearSVC',
  'version': 'v1',
  'params': {'prediction__C': 10, 'prediction__penalty': 'l1'}}]

## GridSearch CV v2

In [19]:
preprocessing.pipeline.get_params(deep=True)

{'force_int_remainder_cols': True,
 'n_jobs': None,
 'remainder': 'drop',
 'sparse_threshold': 0.3,
 'transformer_weights': None,
 'transformers': [('preprocess_gender',
   Pipeline(steps=[('default_cat_pipeline',
                    Pipeline(steps=[('fill_empty_strings',
                                     FunctionTransformer(feature_names_out='one-to-one',
                                                         func=<function fill_empty_strings_fn at 0x7f0fee6bc680>)),
                                    ('strip_spaces',
                                     FunctionTransformer(feature_names_out='one-to-one',
                                                         func=<function strip_spaces_fn at 0x7f0fee6bc900>)),
                                    ('to_lower_case',
                                     FunctionTransformer(feature_names_out='one-to-one',
                                                         func=<function to_lower_case_fn at 0x7f0fee04c7c0>)),
                

In [20]:

model_pipeline = Pipeline([
    ("preprocessing", preprocessing.pipeline),
    ("normalizing", StandardScaler()),
    ("prediction", LinearSVC(max_iter=10000, random_state=42))
])

scoring = ["recall", "precision", "f1"]

## experiment between onehot and ordinal encoding of various features.

param_grid = {
    "preprocessing__sleep_duration_pipeline__sleep_duration_encoding__encoding":["onehot", "ordinal"],
    "preprocessing__dietary_habits_pipeline__dietary_habits_encoding__encoding":["onehot", "ordinal"],
    "preprocessing__degree_pipeline__degree_encoding__degree_level_encoding__encoding":["onehot", "ordinal"],
    "preprocessing__age_pipeline__age_encoding__age_range_encoding__encoding":["onehot", "ordinal"],
    "preprocessing__cgpa_pipeline__cgpa_encoding__cgpa_range_encoding__encoding":["onehot", "ordinal"],
    "preprocessing__hours_pipeline__hours_encoding__hours_range_encoding__encoding":["onehot", "ordinal"],
    "prediction__penalty": ["l1", "l2"],
    "prediction__C": [0.1, 1, 10]
}

grid_search = GridSearchCV(model_pipeline, param_grid, scoring=scoring, cv=3,n_jobs=-1,refit=refit_strategy)
grid_search.fit(X_train, y_train.values.ravel())


In [21]:
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Scores : {grid_search.best_index_}")
average_recall,average_precision,average_f1 = common.read_best_mean_grid_search_metrics(grid_search.cv_results_,grid_search.best_index_)

Best Parameters: {'prediction__C': 0.1, 'prediction__penalty': 'l1', 'preprocessing__age_pipeline__age_encoding__age_range_encoding__encoding': 'ordinal', 'preprocessing__cgpa_pipeline__cgpa_encoding__cgpa_range_encoding__encoding': 'ordinal', 'preprocessing__degree_pipeline__degree_encoding__degree_level_encoding__encoding': 'ordinal', 'preprocessing__dietary_habits_pipeline__dietary_habits_encoding__encoding': 'ordinal', 'preprocessing__hours_pipeline__hours_encoding__hours_range_encoding__encoding': 'onehot', 'preprocessing__sleep_duration_pipeline__sleep_duration_encoding__encoding': 'onehot'}
Best Scores : 60
Mean Recall: 0.887894092439547, Mean Precision: 0.8568450952713432,Mean F1: 0.8720757855317863


In [22]:
# commenting this code out to avoid overwriting the metrics file.
_, file_name = common.save_model(
    "Linear SVC", "v2", grid_search.best_estimator_)
common.update_models_metrics("Linear SVC", "v2", average_recall,
                             average_precision, average_f1, file_name=file_name)
common.update_model_params(
    "LinearSVC", "v2", grid_search.best_params_)

[{'name': 'LogisticRegression',
  'version': 'v1',
  'params': {'prediction__C': 1,
   'prediction__penalty': 'l1',
   'prediction__solver': 'liblinear'}},
 {'name': 'LogisticRegression',
  'version': 'v2',
  'params': {'prediction__C': 1,
   'prediction__penalty': 'l2',
   'prediction__solver': 'saga',
   'preprocessing__age_pipeline__age_encoding__age_range_encoding__encoding': 'ordinal',
   'preprocessing__cgpa_pipeline__cgpa_encoding__cgpa_range_encoding__encoding': 'ordinal',
   'preprocessing__degree_pipeline__degree_encoding__degree_level_encoding__encoding': 'ordinal',
   'preprocessing__dietary_habits_pipeline__dietary_habits_encoding__encoding': 'ordinal',
   'preprocessing__hours_pipeline__hours_encoding__hours_range_encoding__encoding': 'onehot',
   'preprocessing__sleep_duration_pipeline__sleep_duration_encoding__encoding': 'ordinal'}},
 {'name': 'LinearSVC',
  'version': 'v1',
  'params': {'prediction__C': 10, 'prediction__penalty': 'l1'}},
 {'name': 'LinearSVC',
  'versi

Observations:
* In general there is not significan't improvement between 2 versions of Linear SVC. 
* We need to try few more linear and non linear models to see if we can get better performance. 