<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# KNN Classification and Imputation: Cell Phone Churn Data

_Authors: Kiefer Katovich (SF)_

---

In this lab you will practice using KNN for classification (and a little bit for regression as well).

The dataset is one on "churn" in cell phone plans. It has information on the usage of the phones by different account holders and whether or not they churned or not.

Our goal is to predict whether a user will churn or not based on the other features.

We will also be using the KNN model to **impute** missing data. There are a couple of columns in the dataset with missing values, and we can build KNN models to predict what those missing values will most likely be. This is a more advanced imputation method than just filling in the mean or median.

In [10]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.neighbors import KNeighborsClassifier

### 1. Load the cell phone "churn" data containing some missing values.

In [11]:
churn = pd.read_csv('../assets/data/churn_missing.csv')

### 2. Examine the data. What columns have missing values?

Remember to use our standard 4 or 5 commands (head, describe, info, isnull, dtypes....)

In [3]:
# A:

### 3. Convert the `vmail_plan` and `intl_plan` colums to binary integer columns.

Make sure that if a value is missing that you don't fill it in with a new value! Preserve the missing values.

In [4]:
# A Some code to help you - turns vmail_plan into 0 or 1 column value
churn.loc[:,'vmail_plan'] = churn.vmail_plan.map(lambda x: 1 if x == 'yes' else 0 if x == 'no' else x)
churn.loc[:,'intl_plan'] = churn.intl_plan.map(lambda x: 1 if x == 'yes' else 0 if x == 'no' else x)

### 4. Create dummy coded columns for state and concatenate it to the churn dataset.

> **Remember:** You will need to leave out one of the state dummy coded columns to serve as the "reference" column since we will be using these for modeling.

use pd.get_dummies(..., drop_first = True)

In [6]:
# A:
states = pd.get_dummies(churn.state, drop_first=True)
states.head(3)

Unnamed: 0,AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
states.shape

(3333, 50)

In [8]:
## Concatenate back to a single dataset
churn = pd.concat([churn, states], axis=1)

### 5. Create a version of the churn data that has no missing values.

Use dropna().  Calculate the shape 

In [6]:
# A:

churn_nona = 

### 6. Create a target vector and predictor matrix.

- Target should be the `churn` column.
- Predictor matrix should be all columns except `area_code`, `state`, and `churn`.

In [None]:
# A:
X = churn_nona.drop(['area_code','state','churn'], axis =1)
y = churn_nona.churn.values

### 7. Calculate the baseline accuracy for `churn`.

What percent of the churn target values (y) == 1? (this is just the average value of the column.  Why is that?)

In [8]:
# A:


### 8. Cross-validate a KNN model predicting `churn`. 

- Number of neighbors should be 5.
- Make sure to standardize the predictor matrix.
- Set cross-validation folds to 10.

Report the mean cross-validated accuracy.

In [9]:
# A:

### 9. Iterate from k=1 to k=49 (only odd k) and cross-validate the accuracy of the model for each.

Plot the cross-validated mean accuracy for each score. What is the best accuracy?

In [10]:
# A:k_values = list(range(1,50,2))
accs = []
for k in k_values:
    knn = #fill in here
    scores = cross_val_score(#fill in here)
    accs.append(np.mean(#fill in here))

In [None]:
fig, ax = plt.subplots(figsize=(8,5))
ax.plot(k_values, accs, lw=3)
plt.show()

print(np.max(accs))