# Chapter 3
## Geometry and Nearest Neighbors

In this chapter, we view data through a geometric lense. A feature vector is a way of visualizing a collection of feature variables.

Procedure for mapping data -> features:
* For real valued data -> individual features
* For boolean data -> 0/1 for each features
* For catagorical data -> 0/1 binary indicators for each possible value

For example, a color feature with red, green and blue values would recieve an encoding of isRed, isBlue and is Green.


## Data Preperation

In [133]:
# K-Nearest Neighbor Data

# Class Ratings dataset from CIML.
# Columns: Course Rating (Label), Easy, AI, Systems, Theory, Morning. 

In [179]:
import pandas as pd
from collections import Counter


df = pd.read_csv("class_ratings.csv")
lh = df.copy()

# Convert integer evaluations to neg/pos (0/1) evaluations.
lh.loc[df["y"] >= 0, "y"] = 1
lh.loc[df["y"] < 0, "y"] = 0

# Replace "n" and "y" with 0 and 1. 
lh.replace("y", 1, inplace=True)
lh.replace("n", 0, inplace=True)

In [180]:
lh.head()

Unnamed: 0,y,easy,ai,sys,thy,morning
0,1,1,1,0,1,0
1,1,1,1,0,1,0
2,1,0,1,0,0,0
3,1,0,0,0,1,0
4,1,0,1,1,0,1


In [181]:
# 80-20 split for training and validation
training_lh = lh.drop(df.index[[2, 4, 15, 17]])
validation_lh = lh.iloc[[2, 4, 15, 17]]

In [186]:
# iloc is strictly the array index of dataframe
# loc uses the row "id" number
validation_lh.iloc[0]

y          1
easy       0
ai         1
sys        0
thy        0
morning    0
Name: 2, dtype: int64

## K-Nearest Neighbors

In the K-Nearest Neighbors algorithm, we compute labels for an particular instance by observing the k-nearest examples of that instance in feature space. Each of those neighbors will vote: the most common label will be the result.

A smaller k-value would have overfitting effects, a larger k-value would generally be underfitting.

In [188]:
def kNearestNeighbors(dataset, k, x):
    '''
    Computes label of x using the k Nearest Neighbor Algorithm
    Naive implementation: O(k * n) for each query
    Params
    Dataset: a pandas dataframe of with features and labels
    k: the k-nearest neighbors to consider
    x: the input feature vector
    
    We will be using the hamming distance to evaluate distance between points
    '''
    # Create a dictionary of distances
    distances = dict()
    for index, row in dataset.iterrows():
        feature_vec = list(row.drop('y'))
        distances[index] = distance(x, feature_vec)
    
    # sort by closest to farthest features
    dist_sorted = sorted(distances, key=distances.get) 

    # Vote based on the k nearest values
    y_sum = 0
    for i in range(k):
        feature_id = dist_sorted[i]
        val = dataset.loc[feature_id]["y"]
        y_sum += 1 if val else -1
        
    # neg: dislike non-neg: like
    return y_sum 

In [174]:
def distance(x1, x2):
    '''x1 and x2 are n-dim binary arrays'''
    return sum([x1[i] != x2[i] for i in range(len(x1))])

a_test = [1, 0, 1, 0]
b_test = [0, 1, 1, 1]
assert distance(a_test, b_test) == 3

In [171]:
# Validation Phase
k = 3
for index, row in validation_lh.iterrows():
    x = list(row.drop("y"))
    predicted = kNearestNeighbors(training_lh, k, x)
    print("Data ID[{0}] Predicted: {1}, Actual: {2}".format(index, predicted, row["y"]))

Data ID[2] Predicted: 3, Actual: 1
Data ID[4] Predicted: -1, Actual: 1
Data ID[15] Predicted: -3, Actual: 0
Data ID[17] Predicted: -1, Actual: 0
