# On k-Nearest Neighbors Modeling

The idea of k-nearest neighbors is to make predictions about the values of your dependent variable of interest based on the values of that variable for _nearby_ points.

We have a dataset of people who regularly check the weather. We're purveyors of SmartWatches, and we're interested in whether people would be likely to check their weather on their SmartWatch, if they had one.

As predictors of this, we'll focus on gender and age group.

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
weather = pd.read_csv('weather-check.csv')

In [20]:
weather.head()

Unnamed: 0,RespondentID,Do you typically check a daily weather report?,How do you typically check the weather?,A specific website or app (please provide the answer),"If you had a smartwatch (like the soon to be released Apple Watch), how likely or unlikely would you be to check the weather on that device?",Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,3887201482,Yes,The default weather app on your phone,-,Very likely,30 - 44,Male,"$50,000 to $74,999",South Atlantic
1,3887159451,Yes,The default weather app on your phone,-,Very likely,18 - 29,Male,Prefer not to answer,-
2,3887152228,Yes,The default weather app on your phone,-,Very likely,30 - 44,Male,"$100,000 to $124,999",Middle Atlantic
3,3887145426,Yes,The default weather app on your phone,-,Somewhat likely,30 - 44,Male,Prefer not to answer,-
4,3887021873,Yes,A specific website or app (please provide the ...,Iphone app,Very likely,30 - 44,Male,"$150,000 to $174,999",Middle Atlantic


In [21]:
weather.rename({'If you had a smartwatch (like the soon to be released Apple Watch), how likely or unlikely would you be to check the weather on that device?': 'watch'}, axis=1, inplace=True)

In [22]:
weather['watch'].value_counts()

Very likely          362
Somewhat likely      274
Very unlikely        208
Somewhat unlikely     73
-                     11
Name: watch, dtype: int64

In [23]:
watch_dict = {'Very likely': 4, 'Somewhat likely': 3,
              'Somewhat unlikely': 2, 'Very unlikely': 1,
             '-': 0}
weather['watch'] = weather['watch'].map(watch_dict)

In [24]:
weather.rename({'What is your gender?': 'gender'}, axis=1, inplace=True)

In [25]:
weather['gender'].value_counts()

Female    527
Male      389
-          12
Name: gender, dtype: int64

In [26]:
gender_dict = {'Female': 1, 'Male': 0, '-': 0}
weather['gender'] = weather['gender'].map(gender_dict)

In [27]:
weather['Age'].value_counts()

45 - 59    278
60+        258
30 - 44    204
18 - 29    176
-           12
Name: Age, dtype: int64

In [28]:
age_dict = {'60+': 4, '45 - 59': 3, '30 - 44': 2, '18 - 29': 1, '-': 0}
weather['Age'] = weather['Age'].map(age_dict)

In [29]:
X = weather[['gender', 'Age']]
y = weather['watch']

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [31]:
knn = KNeighborsClassifier(n_neighbors=1, p=100)
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=100,
           weights='uniform')

In [32]:
knn.score(X_test, y_test)

0.31896551724137934

In [33]:
knn.predict_proba([[1, 1]])

array([[0., 0., 0., 0., 1.]])

In [34]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, knn.predict(X_test))

array([[ 5,  0,  0,  0,  0],
       [ 0,  0,  7,  2, 47],
       [ 0,  0,  1,  1, 18],
       [ 1,  0, 11,  9, 52],
       [ 0,  0, 17,  2, 59]])

## Now: Question: How the heck did this work?

More precisely:
- What are these different parameters? (Especially: 'metric', 'p',
    'weights', and 'n_neighbors')
- How is distance calculated? (What is 'Minkowski'?)