<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

#  K-Nearest Neighbors with scikit-learn

_Authors: Alex Sherman (DC)_

<a id="overview-of-the-iris-dataset"></a>
## Loading the Iris Data Set
---

#### Read the iris data into a pandas DataFrame, including column names.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from pathlib import Path

# not necessary with newest versions of Jupyter
%matplotlib inline

# Increase default figure and font sizes for easier viewing.
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

plt.style.use('fivethirtyeight')

In [None]:
data = Path('..', 'assets', 'data', 'iris.data') # Works better cross-platform than hard-coding path as a string
iris = pd.read_csv(data)

In [None]:
iris.head(30)

<a id="terminology"></a>

- **150 observations** (n=150): Each observation is one iris flower.
- **Four features** (p=4): sepal length, sepal width, petal length, and petal width.
- **Response**: One of three possible iris species (setosa, versicolor, or virginica)

![](../assets/images/petal_sepal.jpeg)

In the last two lessons, we built models to predict **numeric variables**, such as median housing prices. Predicting a continuous quantity in this way is called **regression**.

In the next few lessons, we build models to predict **categorical variables**, such as flower species. Predicting a discrete value in this way is called **classification**.

<a id="exercise-human-learning-with-iris-data"></a>
## Guided Practice: "Human Learning" With Iris Data

**Question:** Can we predict the species of an iris using petal and sepal measurements? Together, we will:

1. Read the iris data into a Pandas DataFrame, including column names.
2. Gather some basic information about the data.
3. Use sorting, split-apply-combine, and/or visualization to look for differences between species.
4. Write down a set of rules that could be used to predict species based on iris measurements.

Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data and check the accuracy of your predictions.

#### Gather some basic information about the data.

In [None]:
# Get the number of rows and columns in the iris dataset.


In [None]:
# Check the data types


In [None]:
# Verify the basic stats look appropriate


In [None]:
# Test for imbalanced classes


In [None]:
# Verify we are not missing any data


#### Use sorting, split-apply-combine, and/or visualization to look for differences between species.

In [None]:
# Mean of all numeric columns, grouped by species.


In [None]:
# Box plot of petal_width, grouped by species.
# Using .boxplot() convenience method, which returns its Axes


In [None]:
# Box plot of all numeric columns, grouped by species.


In [None]:
# Map species to a numeric value so that plots can be colored by species.


In [None]:
iris.head()

In [None]:
# Scatterplot of petal_length vs. petal_width, colored by species


In [None]:
# Ack -- continuous colorbar is not appropriate.
# Better approach:


In [None]:
# Scatter matrix of all features, colored by species.
# scatter_matrix returns 2D array of Axes


**Exercise (3 mins.)**

To illustrate how classifiers work, write down a set of rules for classifying iris species in the following form:

1. If XYZ, choose Species A.
2. Otherwise if ABC, choose Species B.
3. Otherwise, choose Species C.

Don't expect perfect results -- in real machine learning problems, perfect accuracy is impossible.

$\blacksquare$


#### Example

In [None]:
# Define a new feature that represents petal area ("feature engineering").
# As iris petals are more ovular shaped as opposed to rectangular,
# we're going to use the formula for area of an ellipse:
# r1 * r2 * 3.14.


In [None]:
# Description of petal_area, grouped by species.


In [None]:
# Box plot of petal_area, grouped by species.


In [None]:
# Only show irises with a petal_area between 5 and 8.


My set of rules for predicting species:

- If petal_area is less than 2, predict **setosa**.
- Else if petal_area is less than 6, predict **versicolor**.
- Otherwise, predict **virginica**.

**Exercise (6 mins.)** Implement these rules to make your own classifier!

Write a function that accepts a row of data and returns a predicted species. Then, apply that function to `iris` to make predictions for all existing rows of data and check the accuracy of your predictions.

In [None]:
# Starter code

def predict_flower(row):
    if row.loc['petal_area'] < 2:
        prediction = 'Iris-setosa'
#     What about the other cases?
    return prediction

# Apply your classifier row-wise
iris.loc[:, 'prediction'] = None


$\blacksquare$

### Examine results

In [None]:
iris.head()

In [None]:
# Let's see what percentage your manual classifier gets correct!
# 0.3333 means 1/3 are classified correctly


In [None]:
# Create a scatterplot of PETAL LENGTH versus PETAL WIDTH and color by SPECIES and by PREDICTED SPECIES.


<a id="human-learning-on-the-iris-dataset"></a>
## Human Learning on the Iris Data Set
---

How did we (as humans) predict the species of an iris?

1. We observed that the different species had (somewhat) dissimilar measurements.
2. We focused on features that seemed to correlate with the response.
3. We created a set of rules (using those features) to predict the species of an unknown iris.

We assumed that if an **unknown iris** had measurements similar to **previous irises**, then its species was most likely the same as those previous irises.

<a id="k-nearest-neighbors-knn-classification"></a>
## K-Nearest Neighbors (KNN) Classification
---

Predict that the value of the target variable for an iris is the most popular value among its K "nearest neighbors."

Which points count as "nearest neighbors" depend on how you measure distance. The most common approach is to use Euclidean distance (square root of the sum of squared differences) in the feature space. 

The plots below illustrate KNN for various k and two features: `x='sepal_length'` and `y='sepal_width'`. The points are the values in the training set, and the background colors indicate what we would predict for values in the test set.

<a id="knn-classification-map-for-iris-k"></a>
### KNN Classification Map for Iris (K=1)

![1NN classification map](../assets/images/iris_01nn_map.png)

### KNN Classification Map for Iris (K=5)

![5NN classification map](../assets/images/iris_05nn_map.png)

### KNN Classification Map for Iris (K=15)

![15NN classification map](../assets/images/iris_15nn_map.png)

<a id="knn-classification-map-for-iris-k"></a>
### KNN Classification Map for Iris (K=50)

![50NN classification map](../assets/images/iris_50nn_map.png)

**Exercise (2 mins., post to Slack right away.)**

- How does increasing $k$ affect the bias and the variance of a KNN model?

- How can you choose a good $k$ for a particular application?

$\blacksquare$

# KNN Applied to NBA Stats

For the rest of the lesson, we will be using a dataset containing the 2015 season statistics for ~500 NBA players. This dataset leads to a nice choice of K, as we'll see below. The columns we'll use for features (and the target 'pos') are:


| Column | Meaning |
| ---    | ---     |
| pos | C: Center. F: Front. G: Guard |
| ast | Assists per game | 
| stl | Steals per game | 
| blk | Blocks per game |
| tov | Turnovers per game | 
| pf  | Personal fouls per game | 

For information about the other columns, see [this glossary](https://www.basketball-reference.com/about/glossary.html).

In [None]:
# Read the NBA data into a DataFrame.
path = Path('..', 'assets', 'data', 'NBA_players_2015.csv')
nba = pd.read_csv(path)

In [None]:
nba.head()

In [None]:
nba.shape

In [None]:
# Map positions to numbers.


In [None]:
# Create feature matrix (X).


In [None]:
# Create response vector (y).


<a id="using-the-traintest-split-procedure-k"></a>
### Using the Train/Test Split Procedure (K=1)

In [None]:
# Import estimator class and other sklearn tools


In [None]:
# 1. Split X and y into training and testing sets (using `random_state` for reproducibility).


In [None]:
# 2. Train the estimator on the training set (using K=1).


In [None]:
# 3. Test the estimator on the testing set and check the accuracy.


In [None]:
# Repeat for K=50.


**Exercise (2 mins., post to Slack right away)**

- What accuracy would you expect a KNN model with $k=1$ to achieve on the *training set*? Would we expect accuracy on the training set to be higher or lower with $k=50$?

$\blacksquare$

#### Comparing Testing Accuracy With Null Accuracy

For a classification model, a null model **always predicts the most frequent class**. For example, if most players in our data set are Centers, we would always predict Center. It is important to make sure that your model is outperforming the null model.

In [None]:
# first create an array with the same shape as y
# then fill it in with the most common value -- numpy "broadcasts" the sum over the whole array


In [None]:
# # then compare predicting the mean every time to the true values


<a id="tuning-a-knn-model"></a>
## Getting Probabilities from a KNN Model

In [None]:
# Instantiate the estimator class (using the value K=5).


In [None]:
# Fit the estimator with data.


A classification estimator's `.predict` method returns the estimator's "categorical" predictions -- in this case, 0, 1, or 2 indicating whether the estimator thinks each player is most likely a center, forward, or guard.

A classification estimator also has a `.predict_proba` method that returns the *probabilities* that the estimator assigns to each class -- in this case, the probability that a given player is a center, is a forward, or is a guard. The `predict` method just returns the class corresponding to the highest of these probabilities.

For KNN, the probabilities that `.predict_proba` returns are just the class frequencies among the given point's K neareset neighbors.

In [None]:
# Calculate predicted probabilities of class membership.
# Each row sums to one and contains the probabilities of the point being a 0-Center, 1-Front, 2-Guard.


<a id="what-happen-if-we-view-the-accuracy-of-our-training-data"></a>
### Accuracy as a Function of $k$

In [None]:
# Store k and associated training scores in a DataFrame


In [None]:
# Plot training scores against k


**Exercise (2 mins., post to Slack right away.)**

- Why does the accuracy on the training set decrease as $k$ increases?

$\blacksquare$

#### Search for the "best" value of K.

In [None]:
# Calculate TRAINING ERROR and TESTING ERROR for K=1 through 100.


In [None]:
# Add test scores to `scores_df`


In [None]:
# Plot test scores against k


In [None]:
# Plot train scores and test scores together


In [None]:
# Find the minimum testing error and the associated K value.


- **Training error** decreases as model complexity increases (lower value of K).
- **Testing error** is minimized at the optimum model complexity.

**Evaluating training and testing error:**

- If training error is unacceptably high, then you have a bias problem.
- If training error is low enough but there is a big gap between training and test error, then you have a variance problem.

**Conclusions**

- When using KNN on this data set with these features, the **best value for K** is likely to be around 14.
- Given the statistics of an **unknown player**, we estimate that we would be able to correctly predict his position about 74% of the time.

<a id="standardizing-features"></a>
## Standardizing Features
---

Many machine learning models are sensitive to feature scale. 

> KNN in particular is sensitive to feature scale because it (by default) uses the Euclidean distance metric. To determine closeness, Euclidean distance sums the square difference along each axis. So, if one axis has large differences and another has small differences, the former axis will contribute much more to the distance than the latter axis.

This means that it matters whether our feature are centered around zero and have similar variance to each other.

In the case of KNN on the iris data set, imagine we measure sepal length in kilometers, but we measure sepal width in millimeters. Our data will show variation in sepal width, but almost no variation in sepal length.

Unfortunately, KNN cannot automatically adjust to this. Other models tend to struggle with scale as well, even linear regression, when you get into more advanced methods such as regularization.

Fortunately, this is an easy fix.

<a id="use-standardscaler-to-standardize-our-data"></a>
### Use `StandardScaler` to Standardize our Data

StandardScaler standardizes our data by subtracting the mean from each feature and dividing by its standard deviation.

In [None]:
# Create feature matrix (X).


In [None]:
# Create the train/test split.
# Notice that we create the train/test split first before fitting the StandardScaler


In [None]:
# Instantiate and fit `StandardScaler`.


#### Fit a KNN estimator and look at the testing error.
Can you find a number of neighbors that improves our results from before?

In [None]:
# Calculate testing error.


<a id="comparing-knn-with-other-models"></a>
## Comparing KNN With Other Models
---

**Advantages of KNN:**

- It's simple to understand and explain.
- Model training is fast.
- It can be used for classification and regression! (For regression, take the average value of the K nearest points.)
- Being a non-parametric method, it is often successful in classification situations where the decision boundary is very irregular.

**Disadvantages of KNN:**

- It must store all of the training data.
- Its prediction phase can be slow when n is large.
- It is sensitive to irrelevant features.
- It is sensitive to the scale of the data.
- Accuracy is (generally) not competitive with the best supervised learning methods.