In [1]:
import pandas as pd
from sklearn.svm import SVC
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    confusion_matrix,
    precision_score,
    recall_score,
    accuracy_score,
)
import pickle

# Support Vector Classifier (SVC)

In this section, we delve into the domain of Support Vector Classifier (SVC), a versatile and powerful classification algorithm. SVC is particularly adept at separating data into distinct classes by finding an optimal hyperplane that maximizes the margin between classes while minimizing classification errors.

## Key Characteristics of Support Vector Classifier (SVC):

- **Margin Maximization**: SVC aims to find a decision boundary (hyperplane) that maximizes the margin between data points of different classes.
- **Kernel Trick**: It can work efficiently in high-dimensional spaces, thanks to the kernel trick, allowing it to handle complex data transformations.
- **Effective in Non-Linear Cases**: SVC can capture complex, non-linear relationships between features using different kernel functions.

Support Vector Classifier is a valuable tool for classification tasks in various domains, including image recognition, text classification, and bioinformatics.

In this section, we will explore, optimize, and evaluate an SVC model for our specific dataset, preparing it for later comparison with other classification models.

For detailed implementation steps, please consult the notebook "Final_Project_Data_Gen."

---


Let's load our dataset from CSV files, with `X_train` and `y_train` as our training features and labels, and `X_test` and `y_test` for testing, using Pandas DataFrames to facilitate data manipulation and analysis.


In [2]:
X_train = pd.read_csv("train_X_In-Car-Rec.csv")
y_train = pd.read_csv("train_y_In-Car-Rec.csv")
X_test = pd.read_csv("test_X_In-Car-Rec.csv")
y_test = pd.read_csv("test_y_In-Car-Rec.csv")

Here is where we will establish our custom `scorer`. We have decided to use a custom F2 score to prioritize recall (minimizing the chance of not giving a coupon to someone who would use it) while considering the value of precision (minimizing the chance of giving a coupon to someone who will not use it).


In [3]:
# Defining a function
scorer = make_scorer(fbeta_score, beta=2)

### Building a Pipeline with Parameter Grid for Hyperparameter Tuning

Now that we have our synthetic dataset, we can proceed to build a pipeline. A pipeline streamlines a lot of the routine processes, making it easier to manage complex workflows. In this example, our pipeline will consist of the following steps:
**Model Training**: Using `SVC` (Support Vector Classifier) for classification.
Let's go ahead and build this pipeline.


Determining the grid-search parameters


In [4]:
param_grid_svc = {
    "svc__C": [0.1, 1, 10],
    "svc__kernel": ["linear", "rbf"],
    "svc__gamma": ["scale", "auto", 0.1, 1],
    "svc__degree": [2, 3, 4],
    "svc__coef0": [0.0, 0.1, 1.0],
    "svc__shrinking": [True, False],
    "svc__class_weight": [None, "balanced"],
    "svc__max_iter": [10000, 50000],
}

We're constructing a pipeline for the Support Vector Classifier (SVC), which includes three key preprocessing and modeling steps. First, we reduce the dimensionality of the data using Principal Component Analysis (PCA) to 10 components. We include Principal Component Analysis (PCA) in our pipeline to reduce the dimensionality of the data, which can enhance the efficiency and effectiveness of our Support Vector Classifier (SVC) by focusing on the most informative features while reducing computational complexity. Then, we standardize the features with the StandardScaler, followed by the SVC as our classifier in the pipeline.


In [5]:
# Create pipelines for SVC
pipeline = Pipeline(
    [("pca", PCA(n_components=10)), ("scaler", StandardScaler()), ("svc", SVC())]
)

Let's establish a `GridSearchCV` object, `grid_search`, to perform a systematic search over the hyperparameter space defined in `param_grid_svc`. This grid search will optimize our Support Vector Classifier (SVC) pipeline using 5-fold cross-validation, evaluating its performance with the custom scoring function `scorer`, and making use of all available CPU cores (`n_jobs=-1`) for parallel processing.


In [6]:
# Create GridSearchCV object
grid_search = GridSearchCV(pipeline, param_grid_svc, cv=5, scoring=scorer, n_jobs=-1)

We initiate the `GridSearchCV` process by fitting it to our training data, `X_train` and `y_train`. This step systematically explores different combinations of hyperparameters for the Support Vector Classifier (SVC) within our pipeline, aiming to identify the optimal configuration that maximizes classification performance.


In [7]:
# Fit GridSearchCV
grid_search.fit(X_train, y_train.values.ravel())

Now let's extract the best-performing estimator, `best_svc`, from our grid search results. Additionally, we retrieve the optimal hyperparameters, stored in `best_params`, and the highest cross-validation score achieved, indicated by `best_score`. These values provide valuable insights into the ideal configuration of our Support Vector Classifier (SVC) model and its performance on the dataset.


In [8]:
# Get the best parameters and score
best_svc = grid_search.best_estimator_
best_params = grid_search.best_params_
best_score = grid_search.best_score_
best_params, best_score

({'svc__C': 0.1,
  'svc__class_weight': None,
  'svc__coef0': 0.0,
  'svc__degree': 2,
  'svc__gamma': 1,
  'svc__kernel': 'rbf',
  'svc__max_iter': 10000,
  'svc__shrinking': True},
 0.8673946364835304)

### Final pipeline


We establish our final pipeline, `svc_pipeline`, which encompasses three essential components: Principal Component Analysis (PCA) with 10 components for dimensionality reduction, feature scaling using StandardScaler, and a Support Vector Classifier (SVC) with predefined hyperparameters. This configuration is prepared for the training and evaluation of our SVC model.


In [6]:
# final pipeline
svc_pipeline = Pipeline(
    [
        ("pca", PCA(n_components=10)),
        ("scaler", StandardScaler()),
        (
            "svc",
            SVC(
                C=0.1,
                class_weight=None,
                coef0=0.0,
                degree=2,
                gamma=1,
                kernel="rbf",
                max_iter=10000,
                shrinking=True,
            ),
        ),
    ]
)

Let's proceed to train our final Support Vector Classifier (SVC) pipeline, `svc_pipeline`, using the training dataset, `X_train`, and the corresponding target values, `y_train`, with the labels reshaped into a one-dimensional array using `ravel()` for compatibility.


In [7]:
# Train the final pipeline
svc_pipeline.fit(X_train, y_train.values.ravel())

We employ our trained Support Vector Classifier (SVC) pipeline, `svc_pipeline`, to generate predictions on the test dataset, `X_test`. The predicted labels are stored in `y_pred`, which we will use for evaluating the model's performance on unseen data.


In [8]:
# Predict on the test set
y_pred = svc_pipeline.predict(X_test)

We conduct a comprehensive evaluation of our Support Vector Classifier (SVC) pipeline on the test data:

- **F2 Score**: We calculate the weighted F2 score, emphasizing recall, using the `fbeta_score` function with `beta=2`. This score provides a balanced assessment of the model's performance, particularly relevant when dealing with imbalanced classes.

The F2 score for our SVC Model is then printed to provide a holistic view of its predictive capability.


In [9]:
# Evaluate the pipeline on the test data
score = svc_pipeline.score(X_test, y_test)

# Calculate fbeta_score where beta = 2 on the test data
f2_score = fbeta_score(y_test, y_pred, average="weighted", beta=2)
print(f"F2Score for the SVC Model is: " + str(f2_score))

F2Score for the SVC Model is: 0.5010476501580502


With an F2 score of approximately 0.5010, we observe that our Support Vector Classifier (SVC) Model exhibits suboptimal performance, particularly in terms of recall, which indicates its difficulty in correctly identifying positive instances. This lower score prompts us to investigate potential areas for improvement. Firstly, we need to consider whether the dataset is imbalanced, as this can heavily influence the model's behavior. Secondly, hyperparameter tuning is essential, and we should explore different kernel functions, regularization parameters, and model complexities. Additionally, feature engineering and threshold adjustment should be considered to enhance the model's ability to correctly classify positive cases.


## Pickling Our Model


In [10]:
# Specify the filename where you want to save the model
filename = "SVC_Model.pkl"

# Export the model to the file using pickle.dump
with open(filename, "wb") as file:
    pickle.dump(svc_pipeline, file)