<img src="https://gist.githubusercontent.com/jakubczakon/10e5eb3d5024cc30cdb056d5acd3d92f/raw/5c464c16ccbc7150b4025e0a2a05b84ab99a7bc3/logo_DS_AI.png" alt="Drawing" width="600"/>

# deepsense.ai's workshop

# 1.2. *k* nearest neighbors (bikes)

In this notebook we use the *k* nearest neighbors algorithm to predict whether a day is a winter day.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

from sklearn.neighbors import KNeighborsClassifier

In [None]:
# again, let us use the Bike Sharing Dataset
df = pd.read_csv("data/Bike-Sharing-Dataset/day.csv")

## Initial exploration

In [None]:
### warning: official description was wrong!
seasons = {1: "winter", 2: "spring", 3: "summer", 4: "fall"}

In [None]:
# recoding seasons
df['season'] = df['season'].map(seasons)

In [None]:
# grouping by "seasons", selecting "cnt" columns and then taking mean 
df.groupby("season")["cnt"].mean()

In [None]:
# let's define some colors we will be using
colors = {"winter": "#5555dd", "spring": "#55dd55", "summer": "#fcc969", "fall": "#dd5555"}

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
# temperatures in seasons
for name, df_part in df.groupby("season")["temp"]:
    sns.distplot(df_part, hist=False, label=name, color=colors[name], ax=ax)

### Exercises

* Plot humidity by season.
* Plot casual and (on a separate plot) registered rentals by season.
* ★ Plot total usage by weekday.  

## Cross-validation

Random split into train and test set is not a good idea here. Why?

Instead, we take data from 2011 to the train set and from 2012 to the test set.

In [None]:
min(df.dteday), max(df.dteday)

In [None]:
train_mask = df.dteday < '2012-01-01'

In [None]:
df_train = df[train_mask]
df_test = df[~train_mask]

In [None]:
len(df_train), len(df_test)

In [None]:
df_train.tail()

In [None]:
df_test.head()

## *k* Nearest Neighbors for 3 variables

### Exercise

* What's the accuracy of the best constant model?

In [None]:
# let's predict whether the season is winter

feature_1 = "temp"
feature_2 = "casual"

# input
X_train = df_train[[feature_1, feature_2]]
X_test = df_test[[feature_1, feature_2]]

# output
Y_train = df_train["season"] == "winter"
Y_test = df_test["season"] == "winter"

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
Y_train.head()

In [None]:
Y_train.value_counts()

In [None]:
Y_test.head()

In [None]:
Y_test.value_counts()

In [None]:
data = X_train.assign(winter=Y_train)
# we could do the same with data by typing:
# data = X_train.copy()
# data['winter'] = Y_train

sns.lmplot(x=feature_1, y=feature_2, data=data, fit_reg=False, hue="winter")

In [None]:
sns.lmplot(x=feature_1, y=feature_2, data=X_test, fit_reg=False)

### Normalization

- We need to normalize data to put data with no misleading information to the model.
- We transform data so that each feature for the training set has the mean equal to 0 and the standard deviation equal to 1.
- Note that we transform both train and test data with use of the same statistics.

In [None]:
m = X_train.mean()
s = X_train.std()

In [None]:
m

In [None]:
s

In [None]:
X_train = (X_train - m) / s
X_test = (X_test - m) / s

In [None]:
X_train.mean()

In [None]:
X_test.mean()

In [None]:
X_train.std()

In [None]:
X_test.std()

### Exercises

* We can normalize data 'on the fly' in two lines of code. How to do that? Only one of the three propositions below is correct. Which one and why?
  * Proposition A
  ```
  X_train = (X_train - X_train.mean()) / X_train.std()
  X_test = (X_test - X_test.mean()) / X_test.std()
  ```
  
  * Proposition B
  ```
  X_train = (X_train - X_train.mean()) / X_train.std()
  X_test = (X_test - X_train.mean()) / X_train.std()
  ```
  
  * Proposition C
  ```
  X_test = (X_test - X_train.mean()) / X_train.std()
  X_train = (X_train - X_train.mean()) / X_train.std()
  ```
* Data can be normalized in other ways as well. One of commonly used solutions is to squeeze data into a standard interval.
  * What `m` and `s` should we choose to squeeze training data into interval `[0,1]`?
  * ★ What `m` and `s` should we choose to squeeze training data into interval `[-1,1]`?

In [None]:
data = X_train.assign(winter=Y_train)
sns.lmplot(x=feature_1, y=feature_2, data=data, fit_reg=False, hue="winter")

In [None]:
sns.lmplot(x=feature_1, y=feature_2, data=X_test, fit_reg=False)

### Training the model

In [None]:
# creating a knn classifier
knn = KNeighborsClassifier(n_neighbors=3)

In [None]:
# training KNN model on data
knn.fit(X_train, Y_train)

In [None]:
# score - percent of correct answers
# on the train set
knn.score(X_train, Y_train)

In [None]:
# score - percent of correct answers
# on the test set
knn.score(X_test, Y_test)

In [None]:
# let's check some other k
test_score_list = []
k_list = range(1, 201)

for k in k_list:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, Y_train)
    test_score_list.append(knn.score(X_test, Y_test))

In [None]:
# best k and best score
k_list[np.argmax(test_score_list)], test_score_list[np.argmax(test_score_list)]

In [None]:
sns.lineplot(x=k_list, y=test_score_list)

### Exercises

* What is the score for *k* around 200 and why? Answer without computations.
* What is the smallest *k* for which the score is the same as for 200 and why? Answer without computations.

In [None]:
# let's check for n=41
knn = KNeighborsClassifier(n_neighbors=41)
# training Linear Regression on data
knn.fit(X_train, Y_train)

In [None]:
knn.score(X_train, Y_train)

In [None]:
knn.score(X_test, Y_test)

### Exercises

* ★ Plot the scores for *k* between 75 and 100. Why does the plot form a 'ridge'?
* Repeat the procedure, but this time use `registered` instead of `temp`. Analyze the results.
* ★ Repeat the procedure for distance L1 instead of L2. Compare the results.
* ★★ Plot the decision boundaries.
* ★ We chose 'the best' *k* using the score on the test set. Is that reasonable?