# Exercise 10

## KNN exercise with NBA player data

## Introduction

- NBA player statistics from 2014-2015 (partial season): [data](https://github.com/justmarkham/DAT4-students/blob/master/kerry/Final/NBA_players_2015.csv), [data dictionary](https://github.com/justmarkham/DAT-project-examples/blob/master/pdf/nba_paper.pdf)
- **Goal:** Predict player position using assists, steals, blocks, turnovers, and personal fouls

## Read the data into Pandas

In [1]:
# read the data into a DataFrame
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT4-students/master/kerry/Final/NBA_players_2015.csv'
nba = pd.read_csv(url, index_col=0)

In [2]:
# examine the columns
nba.columns

Index(['season_end', 'player', 'pos', 'age', 'bref_team_id', 'g', 'gs', 'mp',
       'fg', 'fga', 'fg_', 'x3p', 'x3pa', 'x3p_', 'x2p', 'x2pa', 'x2p_', 'ft',
       'fta', 'ft_', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf',
       'pts', 'G', 'MP', 'PER', 'TS%', '3PAr', 'FTr', 'TRB%', 'AST%', 'STL%',
       'BLK%', 'TOV%', 'USG%', 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM',
       'BPM', 'VORP'],
      dtype='object')

In [24]:
nba.head()

Unnamed: 0,season_end,player,pos,age,bref_team_id,g,gs,mp,fg,fga,...,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,pos_num
0,2015,Quincy Acy,F,24,NYK,52,21,19.2,2.2,4.6,...,14.7,0.6,0.5,1.0,0.05,-2.6,-0.7,-3.4,-0.3,1
1,2015,Jordan Adams,G,20,MEM,18,0,7.3,1.0,2.1,...,17.7,0.0,0.2,0.2,0.076,-2.3,1.8,-0.5,0.0,2
2,2015,Steven Adams,C,21,OKC,51,50,24.2,3.0,5.5,...,14.8,1.0,1.8,2.8,0.109,-2.0,2.0,-0.1,0.6,0
3,2015,Jeff Adrien,F,28,MIN,17,0,12.6,1.1,2.6,...,14.1,0.2,0.2,0.4,0.093,-2.6,0.8,-1.8,0.0,1
4,2015,Arron Afflalo,G,29,TOT,60,54,32.5,5.0,11.8,...,19.6,1.4,0.7,2.1,0.051,-0.2,-1.4,-1.6,0.2,2


In [23]:
nba.shape

(478, 50)

In [3]:
# examine the positions
nba.pos.value_counts()

G    200
F    199
C     79
Name: pos, dtype: int64

## Create X and y

Use the following features: assists, steals, blocks, turnovers, personal fouls

In [4]:
# map positions to numbers
nba['pos_num'] = nba.pos.map({'C':0, 'F':1, 'G':2})

In [6]:
# create feature matrix (X)
feature_cols = ['ast', 'stl', 'blk', 'tov', 'pf']
X = nba[feature_cols]

In [7]:
# alternative way to create X
X = nba.loc[:, 'ast':'pf']

In [8]:
# create response vector (y)
y = nba.pos_num

# Exercice 10.1

* Split the data in train and test
* Train a KNN model (K=5)
* Evaluate the accuracy

In [10]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, random_state=123)

In [12]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
y_pred_prob = knn.predict_proba(X_test)


In [13]:
#Accuracy
(y_pred== y_test).mean()

0.69999999999999996

# Exercice 10.2 

Predict player position and calculate predicted probability of each position

Predict for a player with these statistics: 1 assist, 1 steal, 0 blocks, 1 turnover, 2 personal fouls

In [14]:
# create a list to represent a player
import numpy as np
player = np.array([1, 1, 0, 1, 2]).reshape(1, -1) 

In [21]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(player)
y_pred_prob = knn.predict_proba(player)

print(y_pred)
print(y_pred_prob)

#2:Forwards

[2]
[[ 0.   0.4  0.6]]


In [20]:
#Accuracy
(y_pred== y_test).mean()

0.36666666666666664

# Exercice 10.3  

Repeat steps 10.1 and 10.2 using K=50

In [15]:
####10.1 con k=50
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
y_pred_prob = knn.predict_proba(X_test)
(y_pred== y_test).mean()


0.625

In [22]:
###10.2 con k=50
import numpy as np
player = np.array([1, 1, 0, 1, 2]).reshape(1, -1) 

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(X_train, y_train)
y_pred = knn.predict(player)
y_pred_prob = knn.predict_proba(player)

print(y_pred)
#1:Guards
print(y_pred_prob)

[1]
[[ 0.06  0.52  0.42]]


# Exercice 10.4 (3 points) 

Explore the features to decide which ones are predictive

In [28]:
feature_cols=['player', 'pos', 'age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg_', 'x3p', 'x3pa', 'x3p_', 'x2p', 'x2pa', 'x2p_', 'ft',
       'fta', 'ft_', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf','pts', 'G', 'MP', 'PER', 'TS%', '3PAr', 'FTr', 'TRB%', 
              'AST%', 'STL%','BLK%', 'TOV%', 'USG%', 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM',
       'BPM', 'VORP']

X = nba[feature_cols]
y = nba.pos_num


In [30]:
nba.head()

Unnamed: 0,season_end,player,pos,age,bref_team_id,g,gs,mp,fg,fga,...,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,pos_num
0,2015,Quincy Acy,F,24,NYK,52,21,19.2,2.2,4.6,...,14.7,0.6,0.5,1.0,0.05,-2.6,-0.7,-3.4,-0.3,1
1,2015,Jordan Adams,G,20,MEM,18,0,7.3,1.0,2.1,...,17.7,0.0,0.2,0.2,0.076,-2.3,1.8,-0.5,0.0,2
2,2015,Steven Adams,C,21,OKC,51,50,24.2,3.0,5.5,...,14.8,1.0,1.8,2.8,0.109,-2.0,2.0,-0.1,0.6,0
3,2015,Jeff Adrien,F,28,MIN,17,0,12.6,1.1,2.6,...,14.1,0.2,0.2,0.4,0.093,-2.6,0.8,-1.8,0.0,1
4,2015,Arron Afflalo,G,29,TOT,60,54,32.5,5.0,11.8,...,19.6,1.4,0.7,2.1,0.051,-0.2,-1.4,-1.6,0.2,2


In [29]:
from sklearn.feature_selection import SelectPercentile, f_classif
sel = SelectPercentile(f_classif, percentile=50)
sel.fit(X, y)
sel.get_support()

ValueError: could not convert string to float: 'C'