# Modeling - KNN
In this notebook, I have created a K-Nearest Neighbors Model for predicting churn for an ISP and tuned its hyperparameters using Optuna which is a Hyperparameter Optimization Framework that uses Tree-structured Parzen Estimator (TPE) to find the most optimal parameters.

## Table of Contents:
1. Data Loading
2. Modeling
    - Finding Best Hyperparameters
    - Building Model with tuned parameters

In [1]:
# Importing required libraries and modules
import os
import sys
import optuna
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import QuantileTransformer

from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold

In [2]:
# Setting seaborn figure size
sns.set(rc={'figure.figsize':(10,8)})

# Setting the seed
np.random.seed(42)

## Data Loading

In [3]:
train_prepared = pd.read_csv('../data/processed/train-prepared.csv')

In [4]:
print('Shape=>', train_prepared.shape)
train_prepared.head()

Shape=> (62273, 11)


Unnamed: 0,is_tv_subscriber,is_movie_package_subscriber,subscription_age,bill_avg,remaining_contract,is_contract,service_failure_count,download_avg,upload_avg,download_over_limit,churn
0,1,1,1.77,7,0.19,1,0,114.1,8.7,0,0
1,1,0,0.05,6,0.59,1,0,12.7,1.3,0,0
2,0,0,1.42,18,0.0,0,0,0.4,0.0,0,1
3,1,0,0.73,20,0.0,1,0,9.3,0.4,0,1
4,1,0,0.25,17,0.0,1,0,6.1,0.5,0,1


In [5]:
# Getting an overview of the dataset
train_prepared.info(show_counts=True,verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62273 entries, 0 to 62272
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   is_tv_subscriber             62273 non-null  int64  
 1   is_movie_package_subscriber  62273 non-null  int64  
 2   subscription_age             62273 non-null  float64
 3   bill_avg                     62273 non-null  int64  
 4   remaining_contract           62273 non-null  float64
 5   is_contract                  62273 non-null  int64  
 6   service_failure_count        62273 non-null  int64  
 7   download_avg                 61948 non-null  float64
 8   upload_avg                   61948 non-null  float64
 9   download_over_limit          62273 non-null  int64  
 10  churn                        62273 non-null  int64  
dtypes: float64(4), int64(7)
memory usage: 5.2 MB


## Modeling

In [6]:
# Separating predictors and target
X, y = train_prepared.loc[:, train_prepared.columns != 'churn'], train_prepared.loc[:, 'churn']

### Finding Best Hyperparameters

In [7]:
def create_knn_pipeline(quantile_transform: str, n_neighbors: int,
                        weights: str, metric: str) -> Pipeline:
    """ Returns a pipeline object created around KNN algorithm
    
    Takes data preparation and KNN modeling parameters as input,
        creates a Scikit-learn pipeline object and returns it
    
    Parameters
    ----------
    quantile_transform : str
        - "Yes": Quantile Transformation will be performed
        - "No": No Transformation
    
    n_neighbors : int
        n_neighbors argument of KNeighborsClassifier
    
    weights : str
        weights argument of KNeighborsClassifier
    
    metric : str
        metric argument of KNeighborsClassifier
        
    Returns
    -------
    pipeline : Pipeline
        The pipeline object from Scikit-Learn
    """
    model = KNeighborsClassifier(n_neighbors = n_neighbors,
                                 weights = weights,
                                 metric = metric)
    imputer = SimpleImputer(strategy='median')

    if quantile_transform == "Yes":
        quantile_transformer = QuantileTransformer(n_quantiles=1000,
                                           output_distribution='normal',
                                           random_state=42)

        transformer = ColumnTransformer(transformers=[('quantile_transformer',
                                                       quantile_transformer,
                                                       [2, 3, 4, 6, 7, 8, 9])],
                                        n_jobs=-1,
                                        remainder='passthrough')
        pipeline = Pipeline(steps = [('median_imputer', imputer),
                                     ('transformer', transformer),
                                     ('knn', model)])
    else:
        pipeline = Pipeline(steps = [('median_imputer', imputer),
                                     ('knn', model)])
    
    return pipeline

In [8]:
def objective(trial: optuna.trial.Trial) -> np.ndarray:
    """ Returns ROC-AUC score for KNN algorithm
    
    Objective function for optimizing KNN algorithm
        using Optuna. Takes optuna Trial object as
        input, performs 10-fold cross-validation 5 times
        and returns mean ROC-AUC score for a set of
        hyperparameters of KNN modeling pipeline.
        
    Parameters
    ----------
    trial : optuna.trial.Trial
        A trial is a process of evaluating an objective function.
        This object is passed to an objective function and provides
        interfaces to get parameter suggestion, manage the trial’s
        state, and set/get user-defined attributes of the trial.
    
    Returns
    -------
    roc_auc_score : np.ndarray
        Mean ROC-AUC Score of 10-fold cross-validation
        repeated 5 times for a KNN modeling pipeline
        with a set of hyperparameters.
    """
    # Data preparation parameters
    quantile_transform = trial.suggest_categorical("quantile_transform", ["Yes", "No"])
    
    # Modeling parameters
    n_neighbors = trial.suggest_int("n_neighbors", low=5, high=101, step=1)
    weights = trial.suggest_categorical("weights", ["uniform", "distance"])
    metric = trial.suggest_categorical("metric", ["minkowski", 
                                                  "euclidean",
                                                  "manhattan"])
    
    # Building modeling pipeline
    pipeline = create_knn_pipeline(quantile_transform, n_neighbors, weights, metric)
    
    # Defining Cross-Validation
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=42)
    scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
    
    return np.mean(scores)

In [9]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, show_progress_bar = True, n_trials = 30)

[32m[I 2021-10-12 09:34:40,198][0m A new study created in memory with name: no-name-98d1c1af-ded3-443c-887f-df7ba71ed644[0m
  self._init_valid()


  0%|          | 0/30 [00:00<?, ?it/s]

[32m[I 2021-10-12 09:34:53,625][0m Trial 0 finished with value: 0.9297784416439074 and parameters: {'quantile_transform': 'No', 'n_neighbors': 7, 'weights': 'distance', 'metric': 'manhattan'}. Best is trial 0 with value: 0.9297784416439074.[0m
[32m[I 2021-10-12 09:35:02,053][0m Trial 1 finished with value: 0.9124641789496056 and parameters: {'quantile_transform': 'No', 'n_neighbors': 8, 'weights': 'distance', 'metric': 'euclidean'}. Best is trial 0 with value: 0.9297784416439074.[0m
[32m[I 2021-10-12 09:35:40,464][0m Trial 2 finished with value: 0.9706911166345559 and parameters: {'quantile_transform': 'Yes', 'n_neighbors': 61, 'weights': 'uniform', 'metric': 'euclidean'}. Best is trial 2 with value: 0.9706911166345559.[0m
[32m[I 2021-10-12 09:35:52,527][0m Trial 3 finished with value: 0.9087124289332473 and parameters: {'quantile_transform': 'No', 'n_neighbors': 40, 'weights': 'distance', 'metric': 'minkowski'}. Best is trial 2 with value: 0.9706911166345559.[0m
[32m[I 20

In [10]:
print("Highest Score: ", study.best_value)
print("Best Parameters: ", study.best_params)
print("Best Trial: ", study.best_trial)

Highest Score:  0.9747468017695787
Best Parameters:  {'quantile_transform': 'Yes', 'n_neighbors': 93, 'weights': 'distance', 'metric': 'manhattan'}
Best Trial:  FrozenTrial(number=21, values=[0.9747468017695787], datetime_start=datetime.datetime(2021, 10, 12, 9, 47, 0, 570678), datetime_complete=datetime.datetime(2021, 10, 12, 9, 48, 0, 898156), params={'quantile_transform': 'Yes', 'n_neighbors': 93, 'weights': 'distance', 'metric': 'manhattan'}, distributions={'quantile_transform': CategoricalDistribution(choices=('Yes', 'No')), 'n_neighbors': IntUniformDistribution(high=101, low=5, step=1), 'weights': CategoricalDistribution(choices=('uniform', 'distance')), 'metric': CategoricalDistribution(choices=('minkowski', 'euclidean', 'manhattan'))}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=21, state=TrialState.COMPLETE, value=None)


### Building Model with tuned parameters

In [11]:
tuned_params = study.best_params

In [12]:
# Building modeling pipeline
pipeline = create_knn_pipeline(tuned_params["quantile_transform"],
                               tuned_params["n_neighbors"],
                               tuned_params["weights"],
                               tuned_params["metric"])

# Defining model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=42)

# Evaluating Model
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)

In [13]:
print('Mean AUC-ROC Score of KNN: %.4f \u00B1 %.4f' % (np.mean(scores), np.std(scores)))

Mean AUC-ROC Score of KNN: 0.9748 ± 0.0020
