# Classification 2

## Exercise 2: Classification using KNN

### Overview

In this exercise, we will begin our classification journey by building a baseline model using KNN. KNN is a simple, easy to understand method that is also very easy to use. It is a nonparametric algorithm that does not make any deep underlying assumption of the data. As such, they are free to 'learn' from the data without restriction. However they do have disadvantages such as:

- More data: Require a lot more training data to estimate the mapping function.
- Slower: A lot slower to train as they often have far more parameters to train.
- Overfitting: More of a risk to overfit the training data and it is harder to explain why specific predictions are made.

[This](https://machinelearningmastery.com/parametric-and-nonparametric-machine-learning-algorithms/) blog article explains this further.

To do this exercise, you will need to have completed Exercise 1 and use the data saved from that. Complete the tasks in the text or in the code comments. You will also need to refer to the [Scikit documentation](https://scikit-learn.org/stable/documentation.html).

### Library Imports

In [None]:
# Basic Library Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

In [None]:
# TASK: Import the relevant Scikit-learn functions and Classes as required. You may have to keep modifying this cell as you discover more


## Data Ingestion

In [None]:
# TASK: Read in the CSV file saved from Exercise 1
file_path = # Filename
# The 1st column of the csv file should be the customer ID, which is loaded in the the Dataframe's index
input_data = pd.read_csv(file_path, index_col=0)

In [None]:
# Validate that data is as expected
input_data.head()

In [None]:
# Size of data - TASK: Validate that it's (7032, 20)
input_data.shape

## Feature Engineering

In [None]:
# Outcome variable
# Instead of keeping the values, we will encode as 1s and 0s using the map function
output_var_name = 'ChurnLabel'
output_var = input_data[output_var_name]
output_var = output_var.map({'Yes': 1, 'No': 0})
# Note that the map function can be run only once. You will get an error if you try to run this cell again as Yes/No are no longer valid values in this feature. 

# Count the number of rows for each outcome value
print("Row count for each outcome")
print(output_var.value_counts())

# Remove the outcome variable from the main dataframe
input_data.drop(output_var_name, axis=1, inplace=True)

In [None]:
# Next, we want to define 3 lists for each of the data types found in our data i.e. Numerical, Categorical (more than 2 values), Binary (2 values only)

# Numerical features
num_features = [key for key in dict(input_data.dtypes) if dict(input_data.dtypes)[key] in ['int64', 'float64']]
print(num_features) # TASK: Confirm the columns based on Exercise 1

In [None]:
# TASK: Define the 4 categorical features as a list of strings. These are the non-numerical features that do not have Yes/No values
cat_features = # Categorical feature names

In [None]:
# TASK: Define the binary features. Complete the steps denoted in this cell.
# 1. Get the list of non-numerical features (both categorical and binary). Hint: Add 'not' to the code from num_features
bin_features = # Copy then modify the code from num_features

# 2. Remove the categorical feature names from this list
for col in cat_features:
    # Hint: There is a list method to remove an element
print(f"List of binary features: {bin_features}") # TASK: Confirm the resulting list

In [None]:
# Encoding the binary features. Similar to the outcome variable, we will need to convert the values of these features from Yes/No to 1/0.
# Note: As an alternative, this could have been done when building the pipeline.
# TASK: Complete the code 
for col in bin_features:
    input_data[col] = 

In [None]:
# Display values after encoding
input_data.head()

## Model Building

In [None]:
# Define preprocessing pipeline. Reminder that the binary features have already been encoded and thus only passed through
# TASK: Match the list of features to the correct encoding operation. 
# Remember to add the library imports for ColumnTransformer, StandardScaler, OneHotEncoder to the imports above
preprocess = ColumnTransformer(
    transformers=[
        ('standardscaler', StandardScaler(), ),
        ('onehotencoder', OneHotEncoder(), )
    ],
    remainder='passthrough'
)

In [None]:
# TASK: Complete the pipeline by adding the KNN Algorithm i.e. KNeighborClassifier
# At this moment, use the n_neighbors=5 as the parameter to KNeighborClassifier. Once this script is working, feel free to try out other values.
model = make_pipeline(
    preprocess,
    # TASK
)

In [None]:
# Train/Test Split
# TASK: Split the data into 70:30 train/test. Use the random_state=42
x_train, x_test, y_train, y_test = # TASK

In [None]:
# Check the dimensions of the data. TASK: Confirm as (2110, 19)
x_test.shape

In [None]:
# Train the pipeline. You can add a semi-colon (';') at the end of the line to supresses the output printing
model.fit(x_train, y_train)

### Evaluation

For regression problems, we are familiar with common metrics such as Root Means Square Error (RMSE) and the Coefficient of Determination (R<sup>2</sup> value).

With classification problems, we need a different set of metrics to evaluate the model. Here, we use metrics such as:

- Confusion Matrix
- Precision
- Recall
- F1 score
- ROC and AUC

Read the following blog posts to get familiar with these terms:

- https://hackernoon.com/idiots-guide-to-precision-recall-and-confusion-matrix-b32d36463556
- https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

Once you are done, proceed to the next cells.


In [None]:
# Apply the model on the test data
pred_test = model.predict(x_test)

In [None]:
# TASK: Import the following metrics from scikit-learn in the library imports section above
# confusion_matrix, accuracy_score, precision, recall, f1_score, classification_report, roc_curve, auc
# Note that all these functions have the same parameter profile i.e. the first parameter contains the actual values while the second parameter contains the predicted values from the model.

In [None]:
# TASK: Calculate the confusion matrix
cm = # TASK
print(cm)

In [None]:
# Confusion Matrices typically are displayed in a graphical manner. Run this cell to display the matrix using this code snippet found online.
labels = ['No','Yes']
ax= plt.subplot()
sn.heatmap(cm, annot=True, ax = ax, fmt="d"); #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(labels); ax.yaxis.set_ticklabels(labels)
plt.show()

In [None]:
# TASK: Calculate the 4 following metrics using the correct function
# OPTIONAL: Calculate these metrics by hand (using the formulas in the reference blog posts) to validate the values

# 1. Accuracy = Sum of correctly predicted outcomes divided by total number of samples
accuracy = # TASK
print("Accuracy: {:.5f}".format(accuracy))

In [None]:
# 2. Precision - Of those predicted positive, how many of them are actual positive.
precision = # TASK
print("Precision: {:.5f}".format(precision))

In [None]:
# 3. Recall - how many of the actual positives our model is predicting as positives
recall = # TASK
print("Recall: {:.5f}".format(recall))

In [None]:
# 4. F1 score
f1 = #TASK
print("F1 Score: {:.5f}".format(f1))

In [None]:
# Alternatively, we can calculate all these metrics in one call using the classification_report function
print(classification_report(y_test, pred_test, digits=5))

At this point, take a step back and try to understand these numbers. 

- What do these metrics mean in the context of the problem? 
- Which error (Type 1 or Type 2) is more important for this problem? And thus which metric is more important, Precision or Recall?

In [None]:
# TASK: Compute ROC and AUC. Note that roc_curve() returns 3 values. You will only need the first 2 as input to auc() i.e. use _ as the 3rd output
fpr, tpr, _ = # TASK
#TASK = auc(fpr, tpr)

In [None]:
# Run this cell to plot the ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange',
         lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic curve')
plt.legend(loc="lower right")
plt.show()

Congratulations! You have built a basic classication model. Complete the lesson quiz and proceed to the next lesson.