# Class 4 Lab: K-nearest neighbors and scikit-learn

## Agenda

- **K-nearest neighbors (KNN)**
    - Review of the iris dataset
    - Visualizing the iris dataset
    - KNN classification
    - Review of supervised learning
- **scikit-learn**
    - Requirements for working with data in scikit-learn
    - scikit-learn's 4-step modeling pattern
    - Tuning a KNN model
    - Comparing KNN with other models

## The iris dataset

In [48]:
import csv as csv
TitanicList = []

with open('../homework/titanic.csv', 'rb') as csv_file_object:
    reader = csv.reader(csv_file_object)
    header = csv_file_object.next()
    for line in reader:
        if line[5] == 'Nan' or line[5] == '':
            line[5] = 28
            TitanicList.append(line)
        else:
            TitanicList.append(line)
    
print TitanicList

[['1', '0', '3', 'Braund, Mr. Owen Harris', 'male', '22', '1', '0', 'A/5 21171', '7.25', '', 'S'], ['2', '1', '1', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 'female', '38', '1', '0', 'PC 17599', '71.2833', 'C85', 'C'], ['3', '1', '3', 'Heikkinen, Miss. Laina', 'female', '26', '0', '0', 'STON/O2. 3101282', '7.925', '', 'S'], ['4', '1', '1', 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', 'female', '35', '1', '0', '113803', '53.1', 'C123', 'S'], ['5', '0', '3', 'Allen, Mr. William Henry', 'male', '35', '0', '0', '373450', '8.05', '', 'S'], ['6', '0', '3', 'Moran, Mr. James', 'male', 28, '0', '0', '330877', '8.4583', '', 'Q'], ['7', '0', '1', 'McCarthy, Mr. Timothy J', 'male', '54', '0', '0', '17463', '51.8625', 'E46', 'S'], ['8', '0', '3', 'Palsson, Master. Gosta Leonard', 'male', '2', '3', '1', '349909', '21.075', '', 'S'], ['9', '1', '3', 'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)', 'female', '27', '0', '2', '347742', '11.1333', '', 'S'], ['10', '1', '2', 'Nasser, 

In [49]:
# read the iris data into a DataFrame
import pandas as pd

col_names = ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
#titanic = pd.read_csv()

titanic = pd.DataFrame(TitanicList, columns=col_names)


In [115]:
titanic.head(0)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,28,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,30.0708,,C


### Terminology

- **150 observations** (n=150): each observation is one iris flower
- **4 features** (p=4): sepal length, sepal width, petal length, and petal width
- **Response**: iris species
- **Classification problem** since response is categorical

![iris](images/iris_petal_sepal.png)

### Let's plot the iris dataset

...and see what we can learn.

In [50]:
# allow plots to appear in the notebook
%matplotlib inline

# create a custom colormap
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#FF0000', '#00FF00'])

# map each iris species to a number
titanic['Sex_num'] = titanic.Sex.map({'male':1, 'female':2})


## K-nearest neighbors (KNN) classification

1. Pick a value for K.
2. Search for the K observations in the data that are "nearest" to the measurements of the unknown iris.
    - Euclidian distance is often used as the distance metric, but other metrics are allowed.
3. Use the most popular response value from the K "nearest neighbors" as the predicted response value for the unknown iris.

In [4]:
titanic.head(0)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_num
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,2
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,1
5,6,0,3,"Moran, Mr. James",male,28,0,0,330877,8.4583,,Q,1
6,7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S,1
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S,1
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S,2
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,30.0708,,C,2


### KNN classification map for iris (K=1)

![1NN classification map](images/iris_01nn_map.png)

### KNN classification map for iris (K=5)

![5NN classification map](images/iris_05nn_map.png)

### KNN classification map for iris (K=15)

![15NN classification map](images/iris_15nn_map.png)

### KNN classification map for iris (K=50)

![50NN classification map](images/iris_50nn_map.png)

**Question:** What's the "best" value for K in this case?

**Answer:** The value which produces the most accurate predictions on unseen data. We want to create a model that generalizes!

## Review of supervised learning

![Supervised learning diagram](images/supervised_learning.png)

## Requirements for working with data in scikit-learn

1. Features and response are **separate objects**
2. Features and response should be entirely **numeric**
3. Features and response should be **NumPy arrays** (or easily convertible to NumPy arrays)
4. Features and response should have **specific shapes** (outlined below)

In [51]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_num
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,2
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,1


In [None]:
# store feature matrix in "X"
#feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
#X = iris[feature_cols]

In [59]:
# alternative ways to create "X"
X = titanic.drop(['PassengerId', 'Survived', 'Name', 'Ticket', 'Cabin', 'Embarked', 'Sex'], axis=1)
#X = iris.loc[:, 'sepal_length':'petal_width']
#X = iris.iloc[:, 0:4]

In [6]:
# store response vector in "y"
y = titanic.Survived

In [54]:
# check X's type
print type(X)
print type(X.values)

<class 'pandas.core.frame.DataFrame'>
<type 'numpy.ndarray'>


In [60]:
print X.values[1]


['1' '38' '1' '0' '71.2833' 2]


In [8]:
# check y's type
print type(y)
print type(y.values)

<class 'pandas.core.series.Series'>
<type 'numpy.ndarray'>


In [29]:
print y.values

[0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 0 0 1
 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0
 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0
 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1
 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0
 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1
 1 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0
 0 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 0 0
 0 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 1
 1 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0
 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 1 0 1 0 0 1 0 0 1
 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0
 0 0 1 1 0 1 0 0 1 0 0 0 

In [9]:
# check X's shape (n = number of observations, p = number of features)
print X.shape

(891, 7)


In [10]:
# check y's shape (single dimension with length n)
print y.shape

(891,)


## scikit-learn's 4-step modeling pattern

**Step 1:** Import the class you plan to use

In [61]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** "Instantiate" the "estimator"

- "Estimator" is scikit-learn's term for "model"
- "Instantiate" means "make an instance of"

In [62]:
knn = KNeighborsClassifier(n_neighbors=1)

- Name of the object does not matter
- Can specify tuning parameters (aka "hyperparameters") during this step
- All parameters not specified are set to their defaults
- **QUESTION: How do we know what those defaults are? How do we even know what all the parameters are?**

**Step 3:** Fit the model with data 

- Model is "learning" the relationship between X and y in our "training data"
- Process through which learning occurs varies by model
- Subtlety: In general, this is training the model, although "instance-based" models such as KNN don't really have a training step.

In [63]:
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=1, p=2, weights='uniform')

- Once a model has been fit with data, it's called a "fitted model"

**Step 4:** Predict the response for a new observation

- New observations are called "out-of-sample" data
- Model uses the information it learned during the fitting / training process

In [64]:
knn.predict([3,22,1,0,7.25,1]) #[ u'Pclass', u'Age', u'SibSp', u'Parch', u'Fare',u'Sex_num']

array(['0'], dtype=object)

- Returns a NumPy array, and we keep track of what the numbers "mean"
- Can predict for multiple observations at once

In [65]:
X_new = [[3,22,1,0,7.25,1], [1,38,1,0,71.2833,2]]
knn.predict(X_new)

array(['0', '1'], dtype=object)

## Tuning a KNN model

In [66]:
# instantiate the model (using the value K=1)
knn = KNeighborsClassifier(n_neighbors=800)

# fit the model with data
knn.fit(X, y)

# predict the response for new observations
knn.predict(X_new)

array(['0', '0'], dtype=object)

In [67]:
# calculate predicted probabilities of class membership
knn.predict_proba(X_new)

array([[ 0.65875,  0.34125],
       [ 0.625  ,  0.375  ]])

In [69]:
# print distances to nearest neighbors (and their identities)
knn.kneighbors([1,38,1,0,71.2833,2])

(array([[  0.        ,   3.01334679,   6.62674572,   7.87934455,
           7.89018054,   8.48537971,   8.77387365,  10.19476821,
          11.34346239,  11.43056546,  11.95891404,  12.01997749,
          12.274275  ,  12.3554776 ,  12.55537928,  12.64928175,
          12.64928175,  12.64928175,  12.68874812,  12.68874812,
          12.68874812,  12.68874812,  12.79351941,  12.92972269,
          14.09388339,  14.17510067,  14.17510067,  14.22163665,
          14.2798375 ,  14.61354252,  14.66284682,  14.78301919,
          14.81680317,  15.41576851,  15.50741303,  15.64505914,
          15.93777459,  16.04830081,  16.14528279,  16.14528279,
          16.14540612,  16.37075652,  16.43445152,  16.46329818,
          16.67133165,  17.05302393,  17.17703169,  17.2205679 ,
          17.23118565,  17.23118565,  17.45760223,  18.01860584,
          18.01860584,  18.01860584,  18.01860584,  18.11187971,
          18.11216215,  18.2382126 ,  18.29086179,  18.38520125,
          18.42911823,  1

## Comparing KNN with other models

Advantages of KNN:

- Simple to understand and explain
- Model fitting is fast
- Can be used for classification and regression!

Disadvantages of KNN:

- Must store all of the training data
- Prediction phase can be slow when n is large
- Sensitive to irrelevant features
- Sensitive to the scale of the data. Feature scaling is important!
- Accuracy is (generally) not competitive with the best supervised learning methods