# Chapter 06 - Statistical Machine Learning

In [1]:
import numpy as np 
import pandas as pd
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse

import math
import os
import random
from pathlib import Path
from collections import defaultdict
from itertools import product

## K-Nearest Neighbors

The idea behind KNN is pretty simple:

1. Find K records that have similar features (i.e., similar predictor values).
2. For classification, find out what the majority class is among those similar records and assign that class to the new record.
3. For prediction (also called KNN regression), find the average among those similar records, and predict that average for the new record.

### A Small Example: Predicting Loan Default

In [2]:
LOAN200_CSV = '../data/loan200.csv'
LOAN3000_CSV = '../data/loan3000.csv'
LOAN_DATA_CSV = '../data/loan_data.csv.gz'

In [3]:
loan200 = pd.read_csv(LOAN200_CSV)
loan200.head()

Unnamed: 0,outcome,payment_inc_ratio,dti
0,target,9.0,22.5
1,default,5.46933,21.33
2,paid off,6.90294,8.97
3,paid off,11.148,1.83
4,default,3.7212,10.81


In [4]:
predictors = ['payment_inc_ratio', 'dti']
outcome = 'outcome'

newloan = loan200.loc[0:0, predictors]
X = loan200.loc[1:, predictors]
y = loan200.loc[1:, outcome]

In [5]:
knn = KNeighborsClassifier(n_neighbors=20)
knn.fit(X, y)
knn.predict(newloan)
print(knn.predict_proba(newloan))

[[0.45 0.55]]


The probability of belonging to a class is given by the proportion of neighbors from that class among K.  In the preceding example, this probability of default would have been estimated at 9/20, or 0.45.

### Distance Metrics

In measuring distance between two vectors, we must watch out for difference in scale between variables. Features that have a higher order of magnitude will likely dominate and have almost all influence over the algorithm. To mitigate this problem, we can use the standardized version of the features, such as the Z-Score.

### Choosing K

Generally speaking, if K is too low (close to 1), we may be overfitting; including the noise in the data. Higher values of K provide smoothing that reduces the risk of overfitting in the training data. On the other hand, if K is too high, we may oversmooth the data and miss out on KNN's ability to capture the local structure in the data, one of its main advantages.

The K that best balances between overfitting and oversmoothing is typically determined by accuracy metrics and, in particular, accuracy with holdout or validation data. There is no general rule about the best K—it depends greatly on the nature of the data. For highly structured data with little noise, smaller values of K work best. Borrowing a term from the signal processing community, this type of data is sometimes referred to as having a high signal-to-noise ratio (SNR). Examples of data with a typically high SNR are data sets for handwriting and speech recognition. For noisy data with less structure (data with a low SNR), such as the loan data, larger values of K are appropriate. Typically, values of K fall in the range 1 to 20. Often, an odd number is chosen to avoid ties.

### KNN as a Feature Engine

s. In practical model fitting, however, KNN can be used to add “local
knowledge” in a staged process with other classification techniques:

1. KNN is run on the data, and for each record, a classification (or quasi-probability of a class) is derived.
2. That result is added as a new feature to the record, and another classification method is then run on the data. The original predictor variables are thus used twice.

At first you might wonder whether this process, since it uses some predictors twice, causes a problem with multicollinearity (see “Multicollinearity” on page 172). This is not an issue, since the information being incorporated into the second-stage model is highly local, derived only from a few nearby records, and is therefore additional information and not redundant.

You can think of this staged use of KNN as a form of ensemble learning, in which multiple predictive modeling methods are used in conjunction with one another. It can also be considered as a form of feature engineering in which the aim is to derive features (predictor variables) that have predictive power. Often this involves some manual review of the data; KNN gives a fairly automatic way to do this.

For example, consider the King County housing data. In pricing a home for sale, a realtor will base the price on similar homes recently sold, known as “comps.” In essence, realtors are doing a manual version of KNN: by looking at the sale prices of similar homes, they can estimate what a home will sell for. We can create a new fea‐ ture for a statistical model to mimic the real estate professional by applying KNN to recent sales. The predicted value is the sales price, and the existing predictor variables could include location, total square feet, type of structure, lot size, and number of bedrooms and bathrooms. The new predictor variable (feature) that we add via KNN is the KNN predictor for each record (analogous to the realtors’ comps). Since we are predicting a numerical value, the average of the K-Nearest Neighbors is used instead of a majority vote (known as KNN regression).

Similarly, for the loan data, we can create features that represent different aspects of the loan process. For example, the following code would build a feature that represents a borrower’s creditworthiness:

In [6]:
loan_data = pd.read_csv(LOAN_DATA_CSV)

predictors = ['dti', 'revol_bal', 'revol_util', 'open_acc',
 'delinq_2yrs_zero', 'pub_rec_zero']
outcome = 'outcome'

X = loan_data[predictors]
y = loan_data[outcome]

knn = KNeighborsClassifier(n_neighbors=20)
knn.fit(X, y)

loan_data['borrower_score'] = knn.predict_proba(X)[:, 1]
loan_data['borrower_score'].describe()

count    45342.000000
mean         0.498902
std          0.128736
min          0.050000
25%          0.400000
50%          0.500000
75%          0.600000
max          1.000000
Name: borrower_score, dtype: float64

The result is a feature that predicts the likelihood a borrower will default based on his credit history.