# An Introduction to K-Nearest Neighbors

## Introduction to the Data

Before we get started with the K-Nearest Neighbors (kNN) algorithm, let's take a look at our data. Each row contains information on how a player performed during the 2013-2014 NBA season.<br>

Here are a few of the columns:
* `player` - The player's name
* `pos` - The player's position
* `g` - The number of games the player was in
* `gs` - The number of games in which the player started
* `pts` - The total points the player scored

See [this site](http://www.databasebasketball.com/about/aboutstats.htm) for descriptions of the remaining columns.

In [1]:
import pandas as pd
nba = pd.read_csv("data/nba_2013.csv")

# The names of the columns in the data
print(nba.columns.values)

['player' 'pos' 'age' 'bref_team_id' 'g' 'gs' 'mp' 'fg' 'fga' 'fg.' 'x3p'
 'x3pa' 'x3p.' 'x2p' 'x2pa' 'x2p.' 'efg.' 'ft' 'fta' 'ft.' 'orb' 'drb'
 'trb' 'ast' 'stl' 'blk' 'tov' 'pf' 'pts' 'season' 'season_end']


## Understanding the KNN Algorithm

The kNN algorithm is based on the idea that we can predict values we don't know by matching them with the most similar values we do know.<br>

Imagine that we have three different types of cars:

```python
car,horsepower,racing_stripes,is_fast
Honda Accord,180,False,False
Yugo,500,True,True
Delorean DMC-12,200,True,True
```

Let's say that we now have another car:

```python
Chevrolet Camaro,400,True,Unknown
```

We don't know whether or not this car is fast, but we can make a prediction based on the most similar car whose speed we know. In this case, we would compare the `horsepower` and `racing_stripes` values to find the most similar car, which is the `Yugo`. Because the Yugo is fast, we would predict that the Camaro is also fast. This is an example of 1-nearest neighbors, because we only looked at the most similar car.<br>

If we performed a 2-nearest neighbors, we would end up with two `True` values (for the Delorean and the Yugo), which would average out to `True`.<br>

If we did 3-nearest neighbors, we would end up with two True values and a `False` value, which would average out to `True`.

## Finding Simliar Rows With Euclidean Distance

Before we can make predictions with kNN, we need to find some way to figure out which data rows are "closest" to the row we're trying to predict.<br>

A simple way to do this is to use Euclidean distance. The formula is:

$$
\sqrt{(q_1-p_1)^2 + (q_2-p_2)^2 + \cdots + (q_n-p_n)^2}
$$

Let's say we have these two rows (we've converted True/False to 1/0), and we want to find the distance between them:

```python
Honda Accord,180,0
Chevrolet Camaro,400,1
```

We'd only select the numeric columns. The distance becomes $\sqrt{(180-400)^2 + (0-1)^2}$, which is about equal to `220`.

* Create a function for calculating the Euclidean distance between two pandas series objects.
* Use the function to find the Euclidean distance between `selected_player` and each row in `nba`.
  * Use the `.apply(func, axis=1)` method on dataframes to apply function `func` to each row.
  * The function should take `row` as its first argument.
  * Only use the columns in `distance_columns` to compute the distance. - See the [documentation on the apply() method](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html) if you get stuck.
* Assign the resulting pandas series to `lebron_distance`.

In [13]:
import math

def euclidean_distance(row):
    inner_value = 0
    
    for k in distance_columns:
        inner_value += (row[k] - selected_player[k]) ** 2
    
    return math.sqrt(inner_value)

In [14]:
selected_player = nba[nba["player"] == "LeBron James"].iloc[0]
distance_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 
                    'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 
                    'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 
                    'trb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts']

In [15]:
lebron_distance = nba.apply(euclidean_distance, axis=1)

## Normalizing Columns

You may have noticed that `horsepower` in the last example had a much larger impact on the final distance than `racing_stripes` did. That's because `horsepower` values are much larger in absolute terms, and therefore dwarf the impact of `racing_stripes` values in the Euclidean distance calculations.<br>

This can be bad, because having larger values doesn't necessarily make a variable better at predicting which rows are similar.<br>

A simple way to deal with this problem is to normalize all of the columns to have a mean of 0 and a standard deviation of 1. This ensures that no single column has a dominant impact on the Euclidean distance calculations.<br>

To set a column's mean to 0, we have to find its current mean, then subtract it from every value in that column. To set the standard deviation to 1, we divide every value in the column by the current standard deviation. The formula is:

$$x = \frac{x-\mu}{\sigma}$$

* Normalize the columns in `nba_numeric`.
  * Using `.mean()` on a dataframe will return the mean of each column.
  * Using `.std()` will return the standard deviation of each column.
* Assign the result to `nba_normalized`.

In [32]:
nba_numeric = nba[distance_columns]
nba_normalized = (nba_numeric - nba_numeric.mean())/nba_numeric.std()

In [33]:
nba_normalized.head()

Unnamed: 0,age,g,gs,mp,fg,fga,fg.,x3p,x3pa,x3p.,...,ft.,orb,drb,trb,ast,stl,blk,tov,pf,pts
0,-0.835906,0.384886,-0.862207,-0.435088,-0.738401,-0.768505,0.319884,-0.700282,-0.716608,-0.117009,...,-0.389712,0.26069,-0.129462,-0.013116,-0.64522,-0.468056,0.06141,-0.66765,0.226515,-0.734621
1,-1.550487,1.095711,-0.187863,-0.045011,-0.581271,-0.649215,0.674593,-0.778936,-0.829601,,...,-0.88295,1.387883,0.18702,0.565852,-0.530733,0.02068,1.065446,-0.01376,1.363938,-0.534801
2,0.116868,-0.010016,-0.4576,-0.308035,-0.290291,-0.405214,0.84688,-0.778936,-0.829601,,...,-0.520826,0.743773,0.28334,0.436083,-0.568895,-0.439307,0.385292,-0.524113,0.029924,-0.328603
3,0.355062,0.779789,1.599148,1.465144,1.577804,1.590172,0.228673,1.737992,1.430256,0.898007,...,0.578033,-0.38342,0.462221,0.216475,1.033919,-0.123066,-0.68352,1.18238,0.423107,1.729123
4,-0.359519,0.108454,0.149309,-0.31918,-0.331028,-0.475703,1.110379,-0.778936,-0.822068,-1.808704,...,0.709147,0.614951,0.138859,0.291341,-0.55363,-0.468056,0.709175,-0.141348,1.139262,-0.400878


## Finding the Nearest Neighbor

Now we know enough to find the nearest neighbor of a given row. Instead of the Euclidean distance formula, we can use the `distance.euclidean` function from `scipy.spatial`, which is a much faster way to calculate Euclidean distance.

* Find the player who's most similar to LeBron James by our distance metric.
  * You can accomplish this by finding the second lowest value in the `euclidean_distances` series (the lowest value will correspond to Lebron, as he is most similar to himself), and then cross-referencing the NBA dataframe with the same index.
* Assign the name of the player to `most_similar_to_lebron`.

In [36]:
from scipy.spatial import distance

# Fill in the NA values in nba_normalized
nba_normalized.fillna(0, inplace=True)

# Find the normalized vector for Lebron James
lebron_normalized = nba_normalized[nba["player"] == "LeBron James"]

# Find the distance between Lebron James and everyone else.
euclidean_distances = nba_normalized.apply(\
                        lambda row: distance.euclidean(row, lebron_normalized), 
                                            axis=1)

In [66]:
most_similar_to_lebron_index = euclidean_distances.sort_values(ascending=True).index[1]

most_similar_to_lebron = nba.iloc[most_similar_to_lebron_index]['player']
most_similar_to_lebron

'Carmelo Anthony'

## Generating Training and Testing Tests

Now that we know how to find the nearest neighbors, we can make predictions on a test set.<br>

First, we have to generate testing and training sets. We'll use random sampling to do this. We'll randomly shuffle the index of the `nba` dataframe, and then pick rows using the randomly shuffled values.<br>

If we didn't do this, we'd end up predicting and training on the same data set, which would overfit. We could do cross-validation also, which would be slightly better, but also slightly more complex.

In [71]:
nba.shape

(481, 31)

In [70]:
nba.dropna(how='any', axis=0).shape

(403, 31)

In [72]:
nba = nba.dropna(how='any', axis=0)

In [73]:
import random
from numpy.random import permutation

# Randomly shuffle the index of nba
random_indices = permutation(nba.index)

# Set a cutoff for how many items we want in the test set (in this case 1/3 of the items)
test_cutoff = math.floor(len(nba)/3)

# Generate the test set by taking the first 1/3 of the randomly shuffled indices
test = nba.loc[random_indices[1:test_cutoff]]

# Generate the train set with the rest of the data
train = nba.loc[random_indices[test_cutoff:]]

## Using sklearn

Instead of having to do it all ourselves, we can use the [kNN implementation in scikit-learn](http://scikit-learn.org/stable/modules/neighbors.html). While scikit-learn (Sklearn for short) makes a regressor and a classifier available, we'll be using the regressor, as we have continuous values to predict on.<br>

Sklearn performs the normalization and distance finding automatically, and lets us specify how many neighbors we want to look at.

In [74]:
# The columns that we'll be using to make predictions
x_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf']
# The column we want to predict
y_column = ["pts"]

from sklearn.neighbors import KNeighborsRegressor

# Create the kNN model
knn = KNeighborsRegressor(n_neighbors=5)

# Fit the model on the training data
knn.fit(train[x_columns], train[y_column])

# Make predictions on the test set using the fit model
predictions = knn.predict(test[x_columns])

## Computing Error

Now that we know our predictions, we can compute the error involved as [mean squared error](http://en.wikipedia.org/wiki/Mean_squared_error). The formula is:

$$\frac{1}{n}\sum_{i=1}^{n}(\hat{y_{i}} - y_{i})^{2}$$

In [75]:
from sklearn.metrics import mean_squared_error

In [76]:
actual = test[y_column]

mse = mean_squared_error(actual, predictions)

In [77]:
mse

8801.5864661654141