# On k-Nearest Neighbors Modeling

The idea of k-nearest neighbors is to make predictions about the values of your dependent variable of interest based on the values of that variable for _nearby_ points.

We have a dataset of people who regularly check the weather. We're purveyors of SmartWatches, and we're interested in whether people would be likely to check their weather on their SmartWatch, if they had one.

As predictors of this, we'll focus on gender and age group.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
weather = pd.read_csv('weather-check.csv')

In [2]:
weather.head()

Unnamed: 0,RespondentID,Do you typically check a daily weather report?,How do you typically check the weather?,A specific website or app (please provide the answer),"If you had a smartwatch (like the soon to be released Apple Watch), how likely or unlikely would you be to check the weather on that device?",Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,3887201482,Yes,The default weather app on your phone,-,Very likely,30 - 44,Male,"$50,000 to $74,999",South Atlantic
1,3887159451,Yes,The default weather app on your phone,-,Very likely,18 - 29,Male,Prefer not to answer,-
2,3887152228,Yes,The default weather app on your phone,-,Very likely,30 - 44,Male,"$100,000 to $124,999",Middle Atlantic
3,3887145426,Yes,The default weather app on your phone,-,Somewhat likely,30 - 44,Male,Prefer not to answer,-
4,3887021873,Yes,A specific website or app (please provide the ...,Iphone app,Very likely,30 - 44,Male,"$150,000 to $174,999",Middle Atlantic


In [3]:
# Let's take the 'if you had a smartwatch' column as our target
# ... and rename it.
weather.rename({'If you had a smartwatch (like the soon to be released Apple Watch), how likely or unlikely would you be to check the weather on that device?': 'watch'}, 
               axis=1, inplace=True)


In [4]:
# Let's look at the different values of this column.

weather['watch'].value_counts()

Very likely          362
Somewhat likely      274
Very unlikely        208
Somewhat unlikely     73
-                     11
Name: watch, dtype: int64

In [5]:
# Let's build a dictionary and replace these text
# values with numbers. We'll go ahead and fill
# nulls with 0. (Is this a good idea in this case?)

watch_dict = {'Very likely':4, 'Somewhat likely':3, 'Somewhat unlikely':2, 'Very unlikely':1, '-':0}
weather['watch'] = weather['watch'].map(watch_dict)


In [6]:
# Let's do the same thing with the 'what is your gender'
# column: rename and remap

weather.rename({'What is your gender?': 'gender'}, 
               axis=1, inplace=True)


In [7]:
# values
weather['gender'].value_counts()


Female    527
Male      389
-          12
Name: gender, dtype: int64

In [8]:
# remap
gender_dict = {'Female': 1, 'Male': 0, '-': 0}
weather['gender'] = weather['gender'].map(gender_dict)

In [9]:
weather['gender']

0      0
1      0
2      0
3      0
4      0
5      0
6      0
7      0
8      0
9      0
10     0
11     0
12     0
13     0
14     0
15     0
16     0
17     0
18     1
19     0
20     0
21     0
22     0
23     0
24     0
25     0
26     0
27     0
28     1
29     0
      ..
898    1
899    0
900    1
901    1
902    0
903    1
904    0
905    1
906    1
907    1
908    1
909    1
910    1
911    1
912    1
913    1
914    0
915    1
916    1
917    1
918    1
919    1
920    1
921    1
922    1
923    1
924    1
925    1
926    0
927    0
Name: gender, Length: 928, dtype: int64

In [10]:
# Age column: check values

weather['Age'].value_counts()

45 - 59    278
60+        258
30 - 44    204
18 - 29    176
-           12
Name: Age, dtype: int64

In [11]:
# Let's replace these values
# with numbers also. We'll fill
# in nulls with 0. Is this
# appropriate in this case?

age_dict = {'60+': 4, '45 - 59':3, '30 - 44':2, '18 - 29':1, '-':0}
weather['Age'] = weather['Age'].map(age_dict)

In [12]:
weather['Age']

0      2
1      1
2      2
3      2
4      2
5      1
6      2
7      0
8      2
9      2
10     1
11     2
12     0
13     2
14     1
15     1
16     2
17     2
18     2
19     1
20     1
21     2
22     2
23     2
24     0
25     2
26     1
27     2
28     3
29     1
      ..
898    1
899    4
900    2
901    2
902    3
903    4
904    4
905    3
906    3
907    2
908    3
909    2
910    2
911    3
912    2
913    3
914    2
915    3
916    3
917    3
918    2
919    1
920    3
921    2
922    4
923    3
924    2
925    4
926    0
927    0
Name: Age, Length: 928, dtype: int64

In [13]:
# Let's use gender and age as our predictors.
# Our target will be 'watch'.

X = weather[['gender', 'Age']]
y = weather['watch']



In [14]:
# Train-test splitting

X_train, X_test, y_train, y_test = train_test_split(X,y)


In [15]:
# Now let's fit a knn classifier to our training data
# I had the best luck with p=100, n_neighbors=1
knn = KNeighborsClassifier(n_neighbors=1,p=100, weights = 'distance')
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=100,
           weights='uniform')

#### KNeighborsClassifier possible input parameters
- n_neighbors: # of closest neighbors to use
- weights: we may want to weight/bias classifier to have closer points count MORE (possibilities: 'uniform', 'distance')
- metric: default is minkowksi

In [16]:
# We can score it on our test set.
# What is the score here?

knn.score(X_test, y_test) 
#ACCURACY OF THE CLASSIFIER, not the R^2

0.22844827586206898

In [23]:
# Let's use the predict_proba() method on a possible input

knn.predict_proba([[3,3]]) #18-29 year old female as the input

#the result yields which 'watch' result is being predicted as TRUE ordered from least to greatest

array([[0., 0., 0., 1., 0.]])

In [18]:
knn.classes_

array([0, 1, 2, 3, 4])

In [24]:
# Let's import the confusion matrix metric and construct
# a matrix for our data

from sklearn.metrics import confusion_matrix
#confusion matrix: plot of how your points were classified vs. how they should be classified

confusion_matrix(y_test, knn.predict(X_test))

array([[ 3,  0,  0,  0,  0],
       [ 0,  6,  9, 44,  0],
       [ 0,  7,  2, 12,  0],
       [ 0,  8, 14, 42,  0],
       [ 0, 18, 19, 48,  0]])

In [26]:
(42+2+6+3) / 232 #sum of numbers on main diagonal (correctly classified) divided by total

#number should match knn.score

0.22844827586206898

## Now: Question: How the heck did this work?

More precisely:
- What are these different parameters? (Especially: 'metric', 'p',
    'weights', and 'n_neighbors')
- How is distance calculated? (What is 'Minkowski'?)