### Introduction to Classification and K-Nearest Neighbors

**Objectives**

- Identify *Classification* problems in supervised learning
- Use `KNeighborsClassifier` to model classification problems using scikitlearn
- Use `StandardScaler` to prepare data for KNN models
- Use `Pipeline` to combine the preprocessing
- Use `KNNImputer` to impute missing values


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.datasets import make_blobs
from sklearn import set_config
set_config('display')

### Classification

Unlike regression, classification problems involve predicting a categorical variable.  For example, the breed of dog, whether or not a customer purchases an item, the presence of a disease, and so on.  Today, we will examine the examples of predicting whether or not a person survived the titanic sinking, the species of a penguin, and whether or not a person defaults on their credit card.  For each of these problems, we will use the K-Nearest Neighbors algorithm, which we introduce below.

#### Problem Motivation

In [None]:
X, y = make_blobs(centers = 2, cluster_std=2, random_state = 42)

In [None]:
data_1 = pd.DataFrame(X, columns = ['X1', 'X2'])
data_1['y'] = y

In [None]:
sns.scatterplot(data = data_1, x = 'X1', y = 'X2', hue = 'y')
plt.title('Sample Classification Data')
plt.grid();

In [None]:
sns.scatterplot(data = data_1, x = 'X1', y = 'X2', hue = 'y')
plt.title('Sample Classification Data')
plt.plot(0, 4, 'ro', markersize = 10, label = 'New Data')
plt.legend()
plt.grid();

#### The Intuition

KNN relies on the idea of distance and classifying new datapoints based on the new datapoints distance from known data.  There is no equation to be learned as we had with linear regression so we call this a *non-parametric* model.  Essentially, we decide how many points we want to use for voting on the nearness.  Below, we demonstrate this with a small sample of the `titanic` data.

In [None]:
titanic = sns.load_dataset('titanic')

In [None]:
titanic.head()

In [None]:
sample_train = titanic[['pclass', 'age', 'survived']].head()
sample_train

In [None]:
new_data = titanic[['pclass', 'age']].iloc[30]

In [None]:
new_data

In [None]:
np.linalg.norm(sample_train.iloc[0, :2] - new_data)

In [None]:
distances = sample_train[['pclass', 'age']].apply(lambda x: np.linalg.norm(x - new_data), axis = 1)
distances

In [None]:
sample_train['distance'] = distances

In [None]:
sample_train

In [None]:
sample_train.sort_values('distance')

#### Question

If you determine the outcome based on the 1 nearest neighbor, what would you predict? 5 nearest neighbors?

### Using `KNeighborsClassifier`

The `KNeighborsClassifier` works just like the earlier `LinearRegression` estimator.  You will instantiate, fit, predict, and score the model as before.  Additionally, we have a parameter `n_neighbors` that will control how many neighbors we make our classification by.  To begin, let us form our training and testing data using `pclass` and `age` with 5 neighbors.

In [None]:
# X and y


In [None]:
# train/test split
# random_state = 22


In [None]:
# instantiate


In [None]:
# fit


In [None]:
# score


#### `.score`

Here, we score the model using the total percent correct or **accuracy**.  Later, we will explore additional metrics for classification but for now this is an intuitive way to score a classifier.  

$$\text{accuracy} = \frac{\text{number correct}}{\text{number total}}$$



### Comparing to Baseline

Typically, you will use the majority class to serve as a baseline predictor.  Here, assume you predict just guessing what the majority class is.  For this example, it is easy to use the `.value_counts(normalize = True)` to create a baseline accuracy.

In [None]:
#baseline


In [None]:
#which was better?


**PROBLEM**

Use `KNeighborsClassifier` to predict the `default` column using `balance` and `income`.  Create a train/test split and report the score on both train and test data.

In [None]:
default = pd.read_csv('data/Default.csv', index_col = 0)
default.head()

### Improving the Model

Now, we can try two things to improve our model.  First, is to change the data we are using and incorporate more features into the model.  To do so, we may want to encode categorical features and use these to feed into the model.  To do so, we again will use `make_column_transformer` and select the categorical features to one-hot-encode, while passing the other features through.

In [None]:
titanic.head(2)

In [None]:
cat_cols = ['sex', 'embarked', 'class', 'adult_male', 'alone']
num_cols = ['pclass', 'age', 'fare']
#select columns
X = titanic.loc[:, cat_cols + num_cols]
y = titanic['survived']

In [None]:
#create OHE


In [None]:
#transformer


In [None]:
# train/test


In [None]:
# fit and transform train


In [None]:
# transform the test


In [None]:
# instantiate the KNN estimator


In [None]:
# fit on train


In [None]:
# score on test


### Another Important Transformation

In addition to using the `OneHotEncoder` to encode the categorical features, existing numeric features need to be put on the same scale.  To do this, we convert the data to $z$-scores, computed by:

$$z = \frac{x_i - \mu}{\sigma}$$

You can accomplish this transformation using the `StandardScaler`.  One way to streamline this is to replace the `passthrough` argument in the `make_column_transformer`.

In [None]:
# transformer for scaling


In [None]:
# fit and transform

# transform


In [None]:
# instantiate and fit


In [None]:
# score train and test


### Streamlining data preparation and modeling with `Pipeine`

The `Pipeline` object allows you to chain together different transformers and estimator objects from scikitlearn.  In our example, this involves first using the `make_column_transformer` and then to `KNearestNeighbor` classifier.  See the user guide [here](https://scikit-learn.org/stable/modules/compose.html#pipeline-chaining-estimators) for more examples.

In [None]:
# create a Pipeline


In [None]:
# fit the train data


In [None]:
# score the train and test


**PROBLEM**

Revisit the `default` problem and use a pipeline to transform the `student` column.  Score your model on train and test data.

#### Other Uses of KNN

Another place the `KNeighborsClassifier` can be used is to impute missing data.  Here, we use the nearest datapoints to fill in missing values.  Scikitlearn has a `KNNImputer` that will fill in missing values based on the average of $n$ neighbors averages.  

In [None]:
from sklearn.impute import KNNImputer

In [None]:
titanic = sns.load_dataset('titanic')
titanic.info()

In [None]:
# instantiate


In [None]:
# fit and transform


In [None]:
# encoder


In [None]:
# pipeline


In [None]:
# fit on train


In [None]:
# score on train and test


#### Selecting the right `k`

![](images/neighbors.png)

In [None]:
# loop over different neighbor options
# fitting estimators to each
# and tracking the train/test scores


#### `GridSearchCV`

- A dictionary of parameters
- An estimator or pipeline


#### Summary

While the KNN model is easy to understand and implement, there are many other classification algorithms that frequently will perform better and contain interpretable parameters.  Next class, we will examine one such example with `LogisticRegression` and the following week we will examine tree models and ensembles.