In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import fbeta_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import make_scorer, fbeta_score
from sklearn.metrics import (
    confusion_matrix,
    precision_score,
    recall_score,
    accuracy_score,
)
import pickle

## KNN Classifier Model

In this section, we delve into the K-Nearest Neighbors (KNN) algorithm, a straightforward yet powerful classification method. The KNN algorithm classifies new examples based on the majority class of the 'k' nearest points from the training dataset. This simplicity makes KNN a versatile choice for a wide range of classification problems.

## Key Characteristics of KNN:

- **Non-parametric**: KNN makes no assumptions about the underlying data distribution, which is useful with real-world data that can be complex and messy.
- **Lazy Learning**: It doesn't explicitly learn a model. Instead, it classifies instances based on their similarity to other instances present in the training dataset.
- **Distance Metric**: The algorithm uses a distance metric, typically Euclidean, to identify the 'k' closest instances (neighbors).

By adjusting the number of neighbors ('k') and the distance metric, we can fine-tune KNN to achieve substantial performance across various datasets and contexts.

The purpose of this notebook is to create and tune a KNN Classifier model and then pickle that model for a future full-fledged comparison.

To see a full breakdown of the establishment of our code, refer to the notebook "Final_Project_Data_Gen."

---


Here is where we will establish our custom `scorer`. We have decided to use a custom F2 score to prioritize recall (minimizing the chance of not giving a coupon to someone who would use it) while considering the value of precision (minimizing the chance of giving a coupon to someone who will not use it).


In [3]:
def f2_func(y_true, y_pred):
    f2_score = fbeta_score(y_true, y_pred, beta=2, average="weighted")
    return f2_score


def my_f2_scorer():
    return make_scorer(f2_func)

Import the pre-split data files.


In [4]:
X_train = pd.read_csv("train_X_In-Car-Rec.csv")
y_train = pd.read_csv("train_y_In-Car-Rec.csv")
X_test = pd.read_csv("test_X_In-Car-Rec.csv")
y_test = pd.read_csv("test_y_In-Car-Rec.csv")

We calculate the square root of the number of training samples in our dataset using `np.sqrt(X_train.shape[0])`, and then round this number to the nearest whole number with `round()`. This value is often used to determine an optimal 'k' for K-Nearest Neighbors (KNN) classifiers.


In [5]:
round(np.sqrt(X_train.shape[0]))

101

We've set up a `param_grid` dictionary for hyperparameter tuning of our KNN classifier. The grid includes two hyperparameters:

- `n_neighbors`: We're considering a range of values from 1 to 100, incremented by 5, to find the optimal number of nearest neighbors.
- `metric`: We're testing both 'euclidean' and 'cosine' distance metrics to see which one provides better results for our model.

This parameter grid will be used to systematically explore different combinations of `n_neighbors` and `metric` in the search for the best-performing KNN classifier configuration.


In [5]:
param_grid = {"n_neighbors": list(range(1, 101, 5)), "metric": ["euclidean", "cosine"]}

We are constructing a machine learning pipeline that encapsulates the K-Nearest Neighbors (KNN) classifier. Here's a brief outline of our approach:

- **Pipeline Creation**: We use the `Pipeline` class from Scikit-learn's `pipeline` module to create a sequence of transformations and a final estimator.
- **KNN Initialization**: Within the pipeline, we initialize the `KNeighborsClassifier` as the step named `"knn"`. This step is set to be the final estimator in our pipeline.

The pipeline is designed to ensure a streamlined process where any additional steps, such as preprocessing or dimensionality reduction, can be easily added in the future.


In [6]:
# Create the full pipeline
pipeline = Pipeline([("knn", KNeighborsClassifier())])

We create a GridSearchCV object with our KNN classifier and a predefined grid of hyperparameters to determine the best combination through 10-fold cross-validation, using a custom F2 scorer for optimization. Then, we fit this grid search to our training data, X_train and y_train, which tunes the hyperparameters of the KNN model to maximize the F2 score.


In [7]:
# Create GridSearchCV object
grid_search = GridSearchCV(
    KNeighborsClassifier(), param_grid, cv=10, scoring=my_f2_scorer(), n_jobs=-1
)

# Fit the pipeline (including hyperparameter tuning) to your data
grid_search.fit(X_train, y_train.values.ravel())

We save the best-performing model from the grid search as best_estimator. We also retrieve the optimal hyperparameters and the corresponding best cross-validation score, storing them in best_params and best_score, respectively. This allows us to inspect the most effective configuration and its performance.


In [8]:
# Store best estimator
best_estimator = grid_search.best_estimator_

# Get the best parameters and score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

best_params, best_score

({'metric': 'euclidean', 'n_neighbors': 36}, 0.6980771345788905)

We initialize a new KNeighborsClassifier instance and apply the optimal hyperparameters found by the grid search. By unpacking best_params with \*\*, we set these parameters directly on the KNN classifier, effectively configuring it with the settings that yielded the best cross-validation performance during the tuning process.


In [9]:
knn = KNeighborsClassifier()
knn.set_params(**best_params)

We train our KNN classifier with the best hyperparameters obtained from the grid search on the training dataset X_train and the target values y_train. By calling ravel() on y_train, we ensure that the target data is in the appropriate shape expected by the fit method, which is a one-dimensional array.


In [10]:
# Train the final pipeline
knn.fit(X_train, y_train.values.ravel())

Using our trained KNN classifier to make predictions on the test dataset X_test. The predicted labels for each instance in the test set are stored in the variable y_pred.


In [11]:
# Predict on the test set
y_pred = knn.predict(X_test)

## Evaluate the model's performance

To evaluate the performance of our trained KNN classifier on the test data X_test by calculating the default accuracy score, which is stored in the variable score. Additionally, we compute the F2 score, which gives more weight to recall than precision, using the fbeta_score function with a beta of 2. This weighted F2 score is particularly useful when false negatives carry a higher cost than false positives. The calculated F2 score for our KNN model is then printed out.


In [12]:
# Evaluate the pipeline on the test data
score = knn.score(X_test, y_test)

# Calculate f1_score on the test data
f2_score = fbeta_score(y_test, y_pred, average="weighted", beta=2)
print(f"F2Score for the KNN Model is: " + str(f2_score))

F2Score for the KNN Model is: 0.707296744500565


We've computed the F2 score for our KNN model, which came out to be approximately 0.707. This suggests that our model has a reasonably good balance between precision and recall, with a particular emphasis on reducing false negatives, as the F2 score weighs recall higher than precision.


Let's display the confusion matrix and our accuracy, precision, and recall for the KNN model.


In [13]:
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Precision
precision = precision_score(y_test, y_pred, average="weighted")
print(f"\nPrecision (weighted): {precision:.4f}")

# Recall
recall = recall_score(y_test, y_pred, average="weighted")
print(f"Recall (weighted): {recall:.4f}")

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Confusion Matrix:
[[ 632  446]
 [ 291 1168]]

Precision (weighted): 0.7071
Recall (weighted): 0.7095
Accuracy: 0.7095


We have analyzed the performance of our model using a confusion matrix and other classification metrics. The confusion matrix shows that we have 632 true negatives and 1168 true positives, indicating instances correctly identified by our model. However, there are also 446 false positives and 291 false negatives, where our model's predictions were incorrect.

The precision of our model, calculated as the weighted average due to class imbalance, is 0.7071. This means that when our model predicts a positive outcome, it is correct about 70.71% of the time. Our model's recall, also weighted, is 0.7095, which tells us that it correctly identifies 70.95% of all positive instances.

Lastly, the accuracy of our model is 0.7095, meaning that it makes the correct prediction for approximately 70.95% of the cases. Considering these metrics together gives us insight into the areas where our model performs well and where there may be room for improvement, especially in reducing the number of false positives and false negatives.


### Pickling Our Model


In [8]:
# final pipeline

pipeline_knnClassifier = Pipeline(
    [
        (
            "knn",
            KNeighborsClassifier(metric="euclidean", n_neighbors=36),
        ),
    ]
)

In [10]:
# Specify the filename where you want to save the model
filename = "KNNClassifier_Model.pkl"

# Export the model to the file using pickle.dump
with open(filename, "wb") as file:
    pickle.dump(pipeline_knnClassifier, file)