# Chapter 12. k-Nearest Neighbors

In [8]:
from __future__ import division
from collections import Counter
from linear_algebra import distance
from statistics import mean
import math, random
import matplotlib.pyplot as plt

Imagine that you're trying to predict how I'm going to vote in the next presidential election.  
If you know nothing else about me, one sensible approach is to look at how my *neighbors* are planning to vote.  
Now imagine that you know more about me than just geography -- my age, my income, marital status, and so on.  
To the extent that my behavior is influenced (or characterized) by those things, looking just at my neighbors who are close to me among all of those dimensions seems likely to be an even better predictor than looking at all of my neighbors.  
This is the idea behind [nearest neighbors classification](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm).

## The Model

Nearest neighbors requires:
- Some notion of distance.
- An assumption that points that are close to one another are similar. 

KNN deliberately neglects a lot of information, since the prediction for each new point depends only on the handful of points closest to it.  
What's more, KNN is probably not going to help you understand the drivers of whatever phenomenon you're looking at.  
In the general situation, we have some data points and a corresponding set of labels that can be binary (eg.T/F), categorical, or numeric.  

For our example, the data points will be vectors, so we can use the `distance` function from Chapter 4.  
Let's say we've picked a number `k` like 3 or 5.  
When we want to classify some new data point, we find the `k` nearest labeled points and let them vote on the new output.  
To do this, we need a function that counts votes:

In [9]:
def raw_majority_vote(labels):
    votes = Counter(labels)
    winner, _ = votes.most_common(1)[0]
    return winner

That's a good start, but what about tie votes?  
We have several options:
- Pick one of the winners at random.
- Weight the votes by distance and pick the weighted winner.
- Reduce `k` until we find a unique winner.  

Let's implement the third option:

In [10]:
def majority_vote(labels):
    """ assumes that labels are ordered from nearest to farthest """
    vote_counts = Counter(labels)
    winner, winner_count = vote_counts.most_common(1)[0]
    num_winners = len([count for count in vote_counts.values() if count == winner_count])
    if num_winners == 1:
        return winner   # unique winner, so return it
    else:
        return majority_vote(labels[:-1])   # try again without the furthest data point

This approach will work eventually, since in the worst case we go all the way down to just one label, at which point that one label wins.  
With this function, it's easy to create a classifier:

In [11]:
def knn_clasify(k, labeled_points, new_point):
    """ each labeled point should be a pair (point, label) """
    # order the labeled points from nearest to farthest
    by_distance = sorted(labeled_points,
                         key=lambda (point, _):
                         distance(point, new_point))
    # find the labels for the closest k
    k_nearest_labels = [label for _, label in by_distance[:k]]
    # and let them vote
    return majority_vote(k_nearest_labels)

Let's take a look at how this works.

## Example: Favorite Languages