In [1]:
import pandas as pd
nba = pd.read_csv("nba_2013.csv")

# The names of the columns in the data
print(nba.columns.values)

['player' 'pos' 'age' 'bref_team_id' 'g' 'gs' 'mp' 'fg' 'fga' 'fg.' 'x3p'
 'x3pa' 'x3p.' 'x2p' 'x2pa' 'x2p.' 'efg.' 'ft' 'fta' 'ft.' 'orb' 'drb'
 'trb' 'ast' 'stl' 'blk' 'tov' 'pf' 'pts' 'season' 'season_end']


Here are a few of the columns:

* `player` - The player's name
* `pos` - The player's position
* `g` - The number of games the player was in
* `gs` - The number of games in which the player started
* `pts` - The total points the player scored

See [this site](https://www.rotowire.com/) for descriptions of the remaining columns.

The **KNN algorithm** is based on the idea that we can predict values we don't know by matching them with the most similar values we do know.

Before we can make predictions with KNN, we need to find some way to figure out which data rows are **closest** to the row we're trying to predict.

A simple way to do this is to use Euclidean distance. The formula is:

![image.png](attachment:image.png)

Let's say we have these two rows and we want to find the distance between them:

- `car`,`horsepower`,`racing_stripes`

  `Honda Accord`,`180`,`0`
  
  `Chevrolet Camaro`,`400`,`1`

We'd only select the numeric columns. The distance becomes , which is about equal to 220.


![image.png](attachment:image.png)

In [2]:
# Create a function for calculating the Euclidean distance between two pandas series objects.

selected_player = nba[nba["player"] == "LeBron James"]
distance_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 
                    'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb',
                    'trb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts']

def euclidean_distance(row):
    import math
    euclidean_distance= 0
    for col in distance_columns:
        euclidean_distance += (row[col] - selected_player[col])**2
    return math.sqrt(euclidean_distance)

In [35]:
lebron_distance = nba.apply(euclidean_distance, axis = 1)
lebron_distance.head(5)

0    3475.792868
1            NaN
2            NaN
3    1189.554979
4    3216.773098
dtype: float64

We may have noticed that `horsepower` in the above example had a much larger impact on the final distance than `racing_stripes` did. That's because `horsepower` values are much larger in absolute terms, and therefore dwarf the impact of `racing_stripes` values in the Euclidean distance calculations.

This can be bad, because having larger values doesn't necessarily make a variable better at predicting which rows are similar.

A simple way to deal with this problem is to normalize all of the columns to have a **mean** of `0` and a **standard deviation** of `1`. This ensures that no single column has a dominant impact on the Euclidean distance calculations

In [36]:
nba_numeric = nba[distance_columns]
nba_normalized = (nba_numeric - nba_numeric.mean())/nba_numeric.std()
nba_normalized.head()

Unnamed: 0,age,g,gs,mp,fg,fga,fg.,x3p,x3pa,x3p.,...,ft.,orb,drb,trb,ast,stl,blk,tov,pf,pts
0,-0.835906,0.384886,-0.862207,-0.435088,-0.738401,-0.768505,0.319884,-0.700282,-0.716608,-0.117009,...,-0.389712,0.26069,-0.129462,-0.013116,-0.64522,-0.468056,0.06141,-0.66765,0.226515,-0.734621
1,-1.550487,1.095711,-0.187863,-0.045011,-0.581271,-0.649215,0.674593,-0.778936,-0.829601,,...,-0.88295,1.387883,0.18702,0.565852,-0.530733,0.02068,1.065446,-0.01376,1.363938,-0.534801
2,0.116868,-0.010016,-0.4576,-0.308035,-0.290291,-0.405214,0.84688,-0.778936,-0.829601,,...,-0.520826,0.743773,0.28334,0.436083,-0.568895,-0.439307,0.385292,-0.524113,0.029924,-0.328603
3,0.355062,0.779789,1.599148,1.465144,1.577804,1.590172,0.228673,1.737992,1.430256,0.898007,...,0.578033,-0.38342,0.462221,0.216475,1.033919,-0.123066,-0.68352,1.18238,0.423107,1.729123
4,-0.359519,0.108454,0.149309,-0.31918,-0.331028,-0.475703,1.110379,-0.778936,-0.822068,-1.808704,...,0.709147,0.614951,0.138859,0.291341,-0.55363,-0.468056,0.709175,-0.141348,1.139262,-0.400878


Now we know enough to find the nearest neighbor of a given row. Instead of the Euclidean distance formula, we can use the `distance.euclidean` function from `scipy.spatial`, which is a much faster way to calculate Euclidean distance.

In [37]:
from scipy.spatial import distance

# Fill in the NA values in nba_normalized
nba_normalized.fillna(0, inplace=True)

# Find the normalized vector for Lebron James
lebron_normalized = nba_normalized[nba["player"] == "LeBron James"]

# Find the distance between Lebron James and everyone else.
euclidean_distances = nba_normalized.apply(lambda row:distance.euclidean(row, lebron_normalized), axis = 1)

In [38]:
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)

In [39]:
distance_frame.head()

Unnamed: 0,dist,idx
225,0.0,225
17,4.171854,17
136,4.206786,136
128,4.382582,128
185,4.489928,185


In [40]:
second_smallest = distance_frame.iloc[1]["idx"]
second_smallest

17.0

In [41]:
most_similar_to_lebron = nba.loc[int(second_smallest)]["player"]
most_similar_to_lebron

'Carmelo Anthony'

Now that we know how to find the nearest neighbors, we can make predictions on a test set.

First, we have to generate **testing** and **training sets**.

In [44]:
import random
from numpy.random import permutation
import math

nba.fillna(0, inplace=True)

# Randomly shuffle the index of nba
random_indices = permutation(nba.index)
# Set a cutoff for how many items we want in the test set (in this case 1/3 of the items)
test_cutoff = math.floor(len(nba)/3)
# Generate the test set by taking the first 1/3 of the randomly shuffled indices
test = nba.loc[random_indices[:test_cutoff]]
# Generate the train set with the rest of the data
train = nba.loc[random_indices[test_cutoff:]]

Instead of having to do it all ourselves, we can use the [KNN implementation in scikit-learn](http://scikit-learn.org/stable/modules/neighbors.html). While scikit-learn (Sklearn for short) makes a **regressor** and a **classifier** available, we'll be using the **regressor**, as we have continuous values to predict on.

Sklearn performs the normalization and distance finding automatically, and lets us specify how many neighbors we want to look at.

In [45]:
# The columns that we'll be using to make predictions
x_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 
             'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb',
             'trb', 'ast', 'stl', 'blk', 'tov', 'pf']

# The column we want to predict
y_column = ["pts"]

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# Create the kNN model
knn = KNeighborsRegressor(n_neighbors=5)

# Fit the model on the training data
knn.fit(train[x_columns], train[y_column])

# Make predictions on the test set using the fit model
predictions = knn.predict(test[x_columns])

Now that we know our predictions, we can compute the error involved as [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error). 

![image.png](attachment:image.png)

In [46]:
actual = test[y_column]
mse = (((predictions - actual) ** 2).sum()) / len(predictions)
mse

pts    8652.99225
dtype: float64