# KNN classifier

The k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification (and also regression). In k-NN classification, The input consists of the k closest training examples in the feature space. 
The output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd

In [None]:
np.random.seed = 10

## Data loading

We will use the iris data-set which we can directly load from sklearn library. This data-set has 3 class labels.

In [None]:
from sklearn import datasets
iris = datasets.load_iris()

In [None]:
X = iris.data
y = iris.target
print(X.shape, y.shape)

## First KNN classifier
let's train a KNN classifier with $k = 5$

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')

As you know there is no training involved in knn classification. It is just the data features and class labels that we memorize and use them to classifiy a new unseen data point (or a query). Therefore the concept of train/test splitting is quiet non-sense here. However we still split the data to use the data points in the test set as queries to the knn classifer to compute its accuracy.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

In [None]:
knn.fit(X_train, y_train)

In [None]:
# test accuracy
knn.score(X_test, y_test)

### Decision Boundary

As we want to show the decision boundary in a 2D plot, we only use the first 2 dimensions of the iris data-set and fit the knn classifier on it

In [None]:
disp_X = iris.data[:, :2]
disp_y = iris.target
knn.fit(disp_X, disp_y)

In [None]:
from matplotlib.colors import ListedColormap
# defining color-maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = disp_X[:, 0].min() - 1, disp_X[:, 0].max() + 1
y_min, y_max = disp_X[:, 1].min() - 1, disp_X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=disp_y, cmap=cmap_bold,
            edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i)"
          % (5))

plt.show()

### Changing the value of $k$: over-fitting / under-fitting
The choice of $k$ in KNN classifier can lead to over-fitting or under-fitting. Therefore we need to tune the value of $k$.

#### Exercise:
Tune the value of $k$ and find the best value. What is the precision and recall for the KNN classifier with the best $k$?

In [None]:
scores = []
for k in range(1, 100, 5):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_train, y_train)
    scores.append(clf.score(X_test, y_test))

In [None]:
plt.plot(range(1, 100, 5), scores)
plt.ylabel('accuracy', fontsize=15)
plt.xlabel('$k$', fontsize=15)

What would happen if we set $k = $ size of training set?

### Decision boundaries for different value of $k$

Let's visualize the decision boundaries for $k = 1, 20, 60, 90$.

In [None]:
fig, ax = plt.subplots(2,2, figsize=(10,7));
ax = ax.reshape(4,-1)
for i, k in enumerate([1, 20, 60, 90]):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(disp_X, disp_y)
    x_min, x_max = disp_X[:, 0].min() - 1, disp_X[:, 0].max() + 1
    y_min, y_max = disp_X[:, 1].min() - 1, disp_X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    ax[i,0].pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    ax[i,0].scatter(X[:, 0], X[:, 1], c=disp_y, cmap=cmap_bold,
                edgecolor='k', s=20)
    ax[i,0].set_xlim(xx.min(), xx.max())
    ax[i,0].set_ylim(yy.min(), yy.max())
    ax[i,0].set_title("$k$ = {}".format(k))
ax = ax.reshape(2,2)
plt.show();

### Exercise: skewed classes

Skewed classes happens when we have unbalanced number of classes in our data (eg: we have more training example of one class compared to the others). In such situations KNN classifiers performance in predicting queris degrade. This is due to the majority voting that KNN classifier uses to predict the class label of a query.

Below we have splitted the data to train and test set manually so that class labels 0 and 1 are over-represented in the training set. Apply a KNN classifier with $k = 16$ and observe the accuracy on the test set (queries). How would you resolve this problem?


In [None]:
y0_train_indices = np.random.permutation(np.where(y==0)[0])[:45]
y0_test_indices = np.random.permutation(np.where(y==0)[0])[45:]
#-----
y1_train_indices = np.random.permutation(np.where(y==1)[0])[:45]
y1_test_indices = np.random.permutation(np.where(y==1)[0])[45:]
#-----
y2_train_indices = np.random.permutation(np.where(y==2)[0])[:30]
y2_test_indices = np.random.permutation(np.where(y==2)[0])[30:]

In [None]:
test_indices = np.concatenate((y0_test_indices, y1_test_indices, y2_test_indices))
train_indices = np.concatenate((y0_train_indices, y1_train_indices, y2_train_indices))

In [None]:
X_train_skewed = X[train_indices]
y_train_skewed = y[train_indices]
X_test_skewed = X[test_indices]
y_test_skewed = y[test_indices]

In [None]:
clf = KNeighborsClassifier(n_neighbors=16, weights='uniform')
clf.fit(X_train_skewed, y_train_skewed)
print("accuracy = ", clf.score(X_test_skewed, y_test_skewed))

Try to improve the accuracy for this skewed data-set

In [None]:
# TO DO: improve the accuracy

## Exercise: Predicting churn with KNN classifier

Now let's try to observe how the knn clssifier performs on predicting churn.

### Data loading

First we load the data and remove the rows which don't have any value in the `TotalCharges` column (As we did in Assignment 3)


In [None]:
df = pd.read_csv("PATH_TO_CHURN_DATA") #or load it from github
z = df["TotalCharges"].map(lambda x: x.replace('.', '', 1).isdigit()) #check if the string contains digits (the dtype of this is object )
df = df[z]
df.reset_index(inplace=True)
print(df.shape)
df["TotalCharges"] = df["TotalCharges"].astype(float)
df.head()

Use the following features:
`tenure`, `MonthlyCharges`, `TotalCharges`, `gender`, `PhoneService`, `TechSupport`, `StreamingTV`, `PaperlessBilling`

### Categorical Encoding

Apply the categorical encoding of your choice to the categorical features. Note that we will use the same features as of Assignment 3.

In [None]:
# TO DO: categorical encoding

### Train-Test split

Split your data to train and test sets. Take 20% of the data for the test set. It is better to fix the `random_state` to ensure reproducibility of your results! (However, note that in general your learning algorithm should perform (almost) in the same way with any random split of the data give that the split soes not make the data imbalanced and there are sufficiently large number of data samples)

In [None]:
# TO DO: train-test split

### Hyper-parameter tuning
Find the best `k` for a KNN classifier trained on this data-set. Also play with other parameters of the KNN classifier, eg: what is the difference between giving a *uniform* weights to the points in a neighbourhood and giving weights based on the inverse distance (between a point in the neigbourhood and the query point)

In [None]:
# TO DO: hyper-parameter tuning

### Decision boundary
Visualize the decision boundary of your best KNN classifier. You can use the same code as above to plot the decision boundary. Note that for ploting the decision boundary you need to train the classifier with only 2 features. which features would you choose?


In [None]:
# TO DO: decision boundary