# Supervised Learning and K Nearest Neighbors Exercises

## Introduction

We will be using customer churn data from the telecom industry for the first week's exercises. The data file is called 
`Orange_Telecom_Churn_Data.csv`. We will load this data together, do some preprocessing, and use K-nearest neighbors to predict customer churn based on account characteristics.

In [5]:
from __future__ import print_function
import os
data_path = ['data']

## Question 1

* Begin by importing the data. Examine the columns and data.
* Notice that the data contains a state, area code, and phone number. Do you think these are good features to use when building a machine learning model? Why or why not? 

We will not be using them, so they can be dropped from the data.

In [19]:
import pandas as pd

filepath = os.sep.join(data_path + ['Orange_Telecom_Churn_Data.csv'])
data = pd.read_csv(filepath)


data.drop(['state','area_code','phone_number'], axis = 1, inplace = True)

data.head()

Unnamed: 0,account_length,intl_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churned
0,128,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,107,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,137,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,84,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,75,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


## Question 2

* Notice that some of the columns are categorical data and some are floats. These features will need to be numerically encoded using one of the methods from the lecture.
* Finally, remember from the lecture that K-nearest neighbors requires scaled data. Scale the data using one of the scaling methods discussed in the lecture.

In [27]:
from sklearn.preprocessing import StandardScaler

# I will use pandas.get_dummies to convert categorical data into 0/1

data_encoded = pd.get_dummies(data, columns=['intl_plan', 'voice_mail_plan', 'churned'], drop_first=True)

# Also will use StandardScaler instead of the MinMaxScaler
scaler = StandardScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(data_encoded), columns=data_encoded.columns)


scaled_data.head()



Unnamed: 0,account_length,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,intl_plan_yes,voice_mail_plan_yes,churned_True
0,0.698941,1.273145,1.573802,0.502824,1.574074,-0.064032,-0.060077,-0.063849,0.876999,-0.446928,0.876286,-0.094809,-0.584236,-0.095509,-0.436676,-0.32324,1.66712,-0.405816
1,0.169849,1.346973,-0.346802,1.158422,-0.347082,-0.101621,0.141693,-0.101089,1.068992,0.154374,1.069818,1.245227,-0.584236,1.245982,-0.436676,-0.32324,1.66712,-0.405816
2,0.925695,-0.572549,1.171125,0.704546,1.171286,-1.571562,0.494791,-1.572084,-0.748012,0.204483,-0.746737,0.701969,0.229917,0.695971,-1.202236,-0.32324,-0.599837,-0.405816
3,-0.409634,-0.572549,2.210292,-1.463971,2.210457,-2.744745,-0.614946,-2.745155,-0.06911,-0.547145,-0.069377,-1.326194,1.044069,-1.329681,0.328885,3.093675,-0.599837,-0.405816
4,-0.636388,-0.572549,-0.252163,0.654116,-0.252115,-1.035419,1.100103,-1.034426,-0.267041,1.056327,-0.267307,-0.058592,-0.584236,-0.055264,1.094445,3.093675,-0.599837,-0.405816


## Question 3

* Separate the feature columns (everything except `churned`) from the label (`churned`). This will create two tables.
* Fit a K-nearest neighbors model with a value of `k=3` to this data and predict the outcome on the same data.

In [31]:
from sklearn.neighbors import KNeighborsClassifier


X = scaled_data.drop('churned_True', axis=1)  # everything except churned
y = data_encoded['churned_True']  # churned 


knn = KNeighborsClassifier(n_neighbors=3)


knn.fit(X, y)

predictions = knn.predict(X)

print(predictions[:10])




[False False False False False  True False False False False]


## Question 4

Ways to measure error haven't been discussed in class yet, but accuracy is an easy one to understand--it is simply the percent of labels that were correctly predicted (either true or false). 

* Write a function to calculate accuracy using the actual and predicted labels.
* Using the function, calculate the accuracy of this K-nearest neighbors model on the data.

In [36]:
def accuracy(real, predicted):
    return sum(real == predicted) / float(real.shape[0])

print(accuracy(y, predictions))  

print(y.head())  
print(predictions[:5])  


0.9396
0    False
1    False
2    False
3    False
4    False
Name: churned_True, dtype: bool
[False False False False False]


## Question 5

* Fit the K-nearest neighbors model again with `n_neighbors=3` but this time use distance for the weights. Calculate the accuracy using the function you created above. 
* Fit another K-nearest neighbors model. This time use uniform weights but set the power parameter for the Minkowski distance metric to be 1 (`p=1`) i.e. Manhattan Distance.

When weighted distances are used for part 1 of this question, a value of 1.0 should be returned for the accuracy. Why do you think this is? *Hint:* we are predicting on the data and with KNN the model *is* the data. We will learn how to avoid this pitfall in the next lecture.

In [40]:
from sklearn.neighbors import KNeighborsClassifier


knn_distance = KNeighborsClassifier(n_neighbors=3, weights='distance')
knn_distance.fit(X, y)  
y_pred_distance = knn_distance.predict(X)


print(accuracy(y, y_pred_distance)) 


knn_manhattan = KNeighborsClassifier(n_neighbors=3, weights='uniform', p=1)
knn_manhattan.fit(X, y)
y_pred_manhattan = knn_manhattan.predict(X)


print(accuracy(y, y_pred_manhattan))


1.0
0.9416


## Question 6

* Fit a K-nearest neighbors model using values of `k` (`n_neighbors`) ranging from 1 to 20. Use uniform weights (the default). The coefficient for the Minkowski distance (`p`) can be set to either 1 or 2--just be consistent. Store the accuracy and the value of `k` used from each of these fits in a list or dictionary.
* Plot (or view the table of) the `accuracy` vs `k`. What do you notice happens when `k=1`? Why do you think this is? *Hint:* it's for the same reason discussed above.

In [42]:
!pip install matpl
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

score_list = []

for k in range(1, 21):  
    knn = KNeighborsClassifier(n_neighbors=k, weights='uniform')  
    knn.fit(X, y) 
    y_pred = knn.predict(X)  
    score = accuracy(y, y_pred)  
    score_list.append((k, score))  


score_df = pd.DataFrame(score_list, columns=['k', 'accuracy'])

sns.set_context('talk')
sns.set_style('ticks')
sns.set_palette('dark')

plt.figure(figsize=(10, 6))
ax = sns.lineplot(data=score_df, x='k', y='accuracy')
ax.set(xlabel='k', ylabel='Accuracy')
ax.set_xticks(range(1, 21))  
plt.title("KNN Accuracy vs. k")
plt.show()


ModuleNotFoundError: No module named 'matplotlib'