# Classification with 2017 and 2018 team stats

In this notebook, we will perform classification with the team statistics data from the 2016-2017 and 2017-2018 season. More specifically, we will apply first apply a PCA with 7 components on 2016-2017 team data. We will then train the transformed on some of the 2016-2017 data using different machine learning classification techniques. We will try to find the classification algorithm that when trained on the 2016-2017 data, has the greatest accuracy and smallest training time. The following table summarizes what we will find in this notebook.

| Method | Approximate Training time | '16-'17 percent accuracy | '17-'18 percent accuracy |
| :-----------------------------------------: | ------------------- | ----------------------------- | ------------------------- |
| Logistic Regression | $0.007$ sec | $79.0\%$ | $76.3\%$ |
| Decision Tree | $0.014$ sec | $67.2\%$ | $65.6\%$ |
| Support Vector Machine | $1$ hr $25$ min | $77.7\%$ | $75.0\%$ |
| Random Forest | $0.046$ sec | $73.4\%$ | $70.5\%$ |
| Gradient Tree Boosting | $0.934$ sec | $75.2\%$ | $72.3\%$ |
| k-Nearest Neighbors (k-NN)| $0.004$ sec | $70.5\%$ | $68.8\%$ |
| Voting (k-NN, Log. Reg., and Ran. For.) | $0.049$ sec | $76.3\%$ | $75.0\%$ | 



We begin by importing some necessary libraries.

In [None]:
% matplotlib inline

import numpy as np

import pandas as pd
pd.set_option('display.max_columns',None)
from pandas.util.testing import assert_frame_equal

import matplotlib.pyplot as plt
import seaborn as sns

import matplotlib as mpl
mpl.rcParams.update({'axes.titlesize' : 20,
                     'axes.labelsize' : 18,
                     'legend.fontsize': 16})

# Set default seaborn plotting style
sns.set_style('white')

from datetime import datetime
import time
from nose.tools import assert_equal

import sqlite3

## Collecting 2016-17 and 2017-18 team stats data

We begin by importing all of the team stats data from the 2016-17 and 2017-18 seasons. We recall that we stored the team stats data for the past 10 seasons (2009-10 and on) in the file 'all_team_stats_2009_to_2018.csv'. Unfortunately, this file does not directly contain the year in which the season occurred. We will need to use the information from the file 'all_games_04_on.csv', which contains this information. We use SQL on these two files to filter to the games occuring in the past two seasons.

In [None]:
predicted_18 = lr_model.predict(trans_team_stats_18)

lr_score_18 = 100.0 * accuracy_score(team_stats_18.loc[:,'won'], predicted_18)

print("Logistic regression score for '17-'18 data: {0:4.1f}%\n".format(lr_score_18))

print(classification_report(team_stats_18.loc[:,'won'], predicted_18))

### Decision tree classification

We will now run the same process using decision trees.

In [None]:
from sklearn.tree import DecisionTreeClassifier

start_time = time.time()

dtc = DecisionTreeClassifier(random_state=23)

dtc = dtc.fit(x_train_17, y_train_17)

predicted = dtc.predict(x_test_17)

print('Training decision tree classifier took ' + str(time.time() - start_time) + ' seconds.')

In [None]:
dtc_score = 100.0 * accuracy_score(y_test_17,predicted)

print("Decision tree score for '16-'17 data: {0:4.1f}%\n".format(dtc_score))

print(classification_report(y_test_17,predicted))

In [None]:
predicted_18 = dtc.predict(trans_team_stats_18)

dtc_score_18 = 100.0 * accuracy_score(team_stats_18.loc[:,'won'], predicted_18)

print("Decision tree score for '17-'18 data: {0:4.1f}%\n".format(dtc_score_18))

print(classification_report(team_stats_18.loc[:,'won'], predicted_18))

We find that using a base model for a Decision Tree Classifier predicts the '16-17 testing data with $67\%$ accuracy and the '17-'18 games with $65\%$. It also takes a split second to train the decision tree classifier.

### Support vector machine classification

We now employ a Support Vector Machine with linear kernel for classification. I suspect that it will take longer, but be more accurate.

In [None]:
from sklearn.svm import SVC

start_time = time.time()

svc_model = SVC(kernel='linear', C=1E6, random_state=23)
svc_model = svc_model.fit(x_train_17, y_train_17)
predicted = svc_model.predict(x_test_17)

print('Training SVC took ' + str(time.time() - start_time) + ' seconds.')


In [None]:
svc_score = 100.0 * accuracy_score(y_test_17,predicted)

print("SVC score for '16-'17 data: {0:4.1f}%\n".format(svc_score))

print(classification_report(y_test_17,predicted))

In [None]:
predicted_18 = svc_model.predict(trans_team_stats_18)

svc_score_18 = 100.0 * accuracy_score(team_stats_18.loc[:,'won'], predicted_18)

print("SVC score for '17-'18 data: {0:4.1f}%\n".format(svc_score_18))

print(classification_report(team_stats_18.loc[:,'won'], predicted_18))

We find that the Support Vector Classification works with similar accuracy as Logistic Regression. The main difference is that it took around 1 hour 50 minutes to train the Support Vector Classifier while it took less than 1 second to train the Logistic Regressor.

### Classification with Random forests

While classifying with Decision Trees had very fast training speed, it was much less accurate than classifying with Logistic Regression and Support Vector Classifiers. It may be that employing many Decision Trees in the form of Random Forest may be more accurate.



In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier

start_time = time.time()

#input estimators
clf1 = KNeighborsClassifier(n_neighbors=5)
clf2 = LogisticRegression(random_state=23)
clf3 = RandomForestClassifier(random_state=23, max_depth=10)

#create list of tuples, matching names to input estimator
est_list = [('knn', clf1), ('lr', clf2), ('rfc', clf3)]

#soft voting- classifying based on summed classification probs
vclf = VotingClassifier(estimators=est_list, voting='soft') 

#fit to '16-'17 team stats
vclf = vclf.fit(x_train_17, y_train_17)

print('Training Voting Classifier took ' + str(time.time() - start_time) + ' seconds.')

In [None]:
vclf_score = 100.0 * vclf.score(x_test_17, y_test_17)

print("Voting Classifier score for '16-'17 data using K-NN, Logistic Regression, and Random Forest: {0:4.1f}%\n".format(vclf_score))

In [None]:
vclf_score_18 = 100.0 * vclf.score(trans_team_stats_18, team_stats_18.loc[:,'won'])

print("Voting Classifier score for '17-'18 data using K-NN, Logistic Regression, and Random Forest: {0:4.1f}%\n".format(vclf_score_18))

The voting classifier performs similarly to the Support Vector Classifier. The training time is still sub-second.

## Summary

In this notebook, we applied a Principal Component Analysis with the team stats from the 2016-17 NBA season. We used the elbow method to decide to keep the 7 most important components, which explained $77.2\%$ of the variance. We then transformed the 2017-18 team stats data using this fitted PCA. 

We then applied several different machine learning classification methods on this data. We trained the methods on the same $70\%$ of the '16-'17 data. The methods worked with varying training speed and accuracy, summarized as fellows.

Each item describes a method, its training time, its accuracy in classifying '16-'17 data, and its accuracy in classifying '17-'18 data.

- __Logistic Regression__, Training time: $0.007$ sec, '16-'17 classification: $79.0\%$, '17-'18 classification: $76.3\%$
- __Decision Tree__, TT: $0.014$ sec, '16-'17: $67.2\%$, '17-'18: $65.6\%$ 
- __Support Vector Machine__, TT: $1$ hr $25$ min, '16-'17: $77.7\%$, '17-'18: $75.0\%$ 
- __Random Forest__, TT: $0.046$ sec, '16-'17: $73.4\%$, '17-'18: $70.5\%$
- __Gradient Tree Boosting__, TT: $0.934$ sec, '16-'17: $75.2\%$, '17-'18: $72.3\%$
- __k-Nearest Neighbors (k-NN)__, TT: $0.004$ sec, '16-'17: $70.5\%$, '17-'18: $68.8\%$
- __Voting (k-NN, Log. Reg., $\&$ Ran. For.)__, TT: $0.049$ sec, '16-'17: $76.3\%$, '17-'18: $75.0\%$

The best method surprisingly was the first one we used: simple Logistic Regression. It had the best accuracy for both seasons and had the second fastest training time clocking in at nearly a hundredth of a second.

More work can be done on why Logistic Regression performed the best, what happens when we include more components, and which stats are most influential towards winning.

## Applying Logistic Regression to all '16-'17 and '17-'18 team stats

We will conclude by applying Logistic Regression to all of the team stats from the '16-'17 and '17-'18 seasons. Recall we applying a PCA, keeping 7 components, to the '16-'17 team stats. We then applied Logistic Regression to the transformed stats, keeping in mind that the 7 kept components only explained $77.2\%$ of the variance. As Logistic Regression was very quick, we will apply it to all 27 original team stats.

Recall that all of the team stats for the '16-'17 season, '17-'18 season were stored in the DataFrames `num_team_stats_17`, `num_team_stats_18`, respectively.

In [None]:
#scale stats
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

scaled_team_stats_17 = ss.fit_transform(num_team_stats_17)

In [None]:
#train/test split keeping 30% for testing
x_train_all_17, x_test_all_17, y_train_all_17, y_test_all_17 = train_test_split(scaled_team_stats_17, \
                                                                team_stats_17.loc[:,'won'],\
                                                                test_size=0.3, \
                                                                stratify=team_stats_17.loc[:,'won'],\
                                                                random_state=23)

In [None]:
start_time = time.time()

#fit logistic regression to '16-'17 data
#high value for C to reduce regularization

model = LogisticRegression(C=1E6, random_state=23)
lr_model = model.fit(x_train_all_17, y_train_all_17)
predicted = lr_model.predict(x_test_all_17)


print('Training took ' + str(time.time() - start_time) + ' seconds.')

In [None]:
lr_score_all = 100.0 * accuracy_score(y_test_all_17,predicted)

print("Logistic regression score for '16-'17 data: {0:4.1f}%\n".format(lr_score_all))

print(classification_report(y_test_all_17,predicted))

In [None]:
scaled_team_stats_18 = ss.transform(num_team_stats_18)

predicted_all_18 = lr_model.predict(scaled_team_stats_18)

lr_score_all_18 = 100.0 * accuracy_score(team_stats_18.loc[:,'won'], predicted_all_18)

print("Logistic regression score for '17-'18 data: {0:4.1f}%\n".format(lr_score_all_18))

print(classification_report(team_stats_18.loc[:,'won'], predicted_all_18))

We find that Logistic Regression increases $6.6\%$, $10.2\%$ in accuracy as one jumps from 7 components to all of the stats. Also the training time is still negligible. (Training the support vector machine would have taken many more hours if we used all of the stats.) More work could be done on tuning parameters and studying how accurate by team Logistic Regression is for classification.

In [None]:
#save DataFrames in tables

all_team_stats.to_sql(name='team_stats_tb', con=con, if_exists='replace', \
                     index=False, chunksize=1000)

all_game_info_17.to_sql(name='game_info_17_tb', con=con, if_exists='replace', \
                       index=False, chunksize=1000)

all_game_info_18.to_sql(name='game_info_18_tb', con=con, if_exists='replace', \
                       index=False, chunksize=1000)

We now check that the first few rows of each table looks correct.

In [None]:
#all team stats table
sql_team_stats_access = "\
SELECT * \
FROM team_stats_tb \
LIMIT 3 \
"

cur.execute(sql_team_stats_access)

for row in cur:
    print(row)

In [None]:
sql_game_17_access = "\
SELECT * \
FROM game_info_17_tb \
LIMIT 3\
"

cur.execute(sql_game_17_access)

for row in cur:
    print(row)

In [None]:
sql_game_18_access = "\
SELECT *\
FROM game_info_18_tb \
LIMIT 3\
"

cur.execute(sql_game_18_access)

for row in cur:
    print(row)

We are now to filter the team stats data to the 2016-17 and 2017-18 seasons.

In [None]:
#filter team stats for 2016-17 season

sql_game_stats_17_join = "\
SELECT DISTINCT game_tb.season_end_year, game_tb.season_type,\
game_tb.game_date, game_tb.matchup_id AS game_matchup_id, \
team_tb.*\
FROM game_info_17_tb AS game_tb \
JOIN team_stats_tb AS team_tb \
ON game_tb.matchup_id = team_tb.matchup_id \
"

team_stats_17 = pd.read_sql(sql_game_stats_17_join, con)

print('Number of team stat rows: ' + str(team_stats_17.shape[0]) + ' (2 for each distinct game)')

team_stats_17.head()

In [None]:
#filter to team stats data for 2017-2018 season

sql_game_stats_18_join = "\
SELECT DISTINCT game_tb.season_end_year, game_tb.season_type,\
game_tb.game_date, game_tb.matchup_id AS game_matchup_id, \
team_tb.*\
FROM game_info_18_tb AS game_tb \
JOIN team_stats_tb AS team_tb \
ON game_tb.matchup_id = team_tb.matchup_id \
"

team_stats_18 = pd.read_sql(sql_game_stats_18_join, con)

print('Number of team stat rows: ' + str(team_stats_18.shape[0]) + ' (2 for each distinct game)')

team_stats_18.head()

We conclude this section by finding the summary statistics of these team stats.

In [None]:
#summary stats of 2016-17 team stats
team_stats_17.describe()

In [None]:
#summary stats of 2017-18 team stats
team_stats_18.describe()

## PCA with components on team stats

We now have two DataFrames: the team stats data for all games during the 2016-2017 season and the same for the 2017-2018 season. The 2016-17 DataFrame has the team stats for 1309 games while the 2017-18 DataFrame has them for 1312 games.

We will run Principal Component Analysis on these two DataFrames, keeping some number of principal components. We will take into account all of the team stats, from `first_qtr_points` to `flagrant_fouls`. We will fit our PCA to the 2016-2017 team stats data. We will then transform the data on the past two seasons to this fitted PCA model. The exact number of components we consider will depend on how much variance we wish to explain relative to the complexity of our data.

We begin by extracting the numerical stats from the two team stats DataFrames. 

In [None]:
#all column names (including season end year)
all_stat_names = team_stats_17.columns.tolist()

#find indices for first qtr points and flagrant fouls
first_qtr_points_idx = all_stat_names.index('first_qtr_points')
flagrant_fouls_idx = all_stat_names.index('flagrant_fouls')

#stat names for numerical stats
num_stat_names = all_stat_names[first_qtr_points_idx: flagrant_fouls_idx+1]

In [None]:
#extract 2016-17 numerical team stats data
num_team_stats_17 = team_stats_17[num_stat_names]

#2017-18 numerical team stats data
num_team_stats_18 = team_stats_18[num_stat_names]

In [None]:
#check first few rows of restricted 2016-17 team stats
num_team_stats_17.head()

In [None]:
from sklearn.decomposition import PCA

#conduct PCA to understand how much variance explained with components
pca_exploring = PCA()

#fit PCA to '16-'17 team stats
pca_exploring.fit(num_team_stats_17)

#explained variance of components
exp_vars = pca_exploring.explained_variance_ratio_

print('Variance: Projected dimension')
print('-----------------------------')

for idx, row in enumerate(pca_exploring.components_):
    output = '{0:4.1f}%:     '.format(100.0*exp_vars[idx])
    output += " + ".join("{0:5.2f} * {1:s}".format(val, name) \
                        for val, name in zip(row, num_stat_names))
    print(output + '\n')

In [None]:
#print cumulative explained variances by number of principal components
cum_exp_vars = []

#total number of principal components
num_cmpts = pca_exploring.explained_variance_ratio_.shape[0]

for idx in range(num_cmpts):
    if idx == 0: #first include variance of first component
        cum_exp_vars.append(100 * pca_exploring.explained_variance_ratio_[0])
    else:
        cum_exp_vars.append(cum_exp_vars[idx-1] + 100 * pca_exploring.explained_variance_ratio_[idx])

for idx in range(num_cmpts):
    if idx == 0:
        print('1 component: {0:4.1f}% variance explained'.format(cum_exp_vars[0]))
        
    else:
        print('{0} components: {1:4.1f}% variance explained'.format(idx+1, cum_exp_vars[idx]))

In [None]:
#show explained variance vs. number of components 
        
plt.plot(np.arange(1,28), cum_exp_vars, 'bo')
plt.title('Explained variance by number of components')
plt.ylabel('Explained variance (%)')
plt.xlabel('Number of components')
plt.show()

Taking a look at this graph, we decide to choose 7 components. Up to 7 components, there is an increase in nearly $5\%$ of explained variance for each added component. From thereon out, there is an increase of at most $3.2\%$ per added component. Thus, we refit our PCA keeping only 7 components. We then transform the '16-'17 and '17-'18 team stats data with this PCA.

In [None]:
#keep 7 components of PCA
pca_7_cmpts = PCA(n_components=7)

#fit to '16-'17 team stats 
pca_7_cmpts.fit(num_team_stats_17)

In [None]:
#transform '16-'17 data with PCA
trans_team_stats_17 = pca_7_cmpts.transform(num_team_stats_17)

print(trans_team_stats_17[:5,:])

In [None]:
#transform '17-'18 data with fitted PCA
trans_team_stats_18 = pca_7_cmpts.transform(num_team_stats_18)

print(trans_team_stats_18[:5,:])

## Classification with transformed '16-'17 and '17-'18 team stats data

We will now apply several classification methods to our transformed data, keeping track of training time and accuracy. 

We train our data on $70\%$ of the '16-'17 team stats data. We then use the model to classify the testing '16-'17 data and _all_ of the '17-'18 data. We start by splitting the data and use the same split when employing all the methods. We will stratify according to wins and losses, so there are the same proportions of winning and losing teams in the training and testing data. We will also fix a random state for reproducibility.

In [None]:
from sklearn.model_selection import train_test_split

#split '16-'17 transformed team stats with 30% for testing 
x_train_17, x_test_17, y_train_17, y_test_17 = train_test_split(trans_team_stats_17, \
                                                                team_stats_17.loc[:,'won'],\
                                                                test_size=0.3, \
                                                                stratify=team_stats_17.loc[:,'won'],\
                                                                random_state=23)

### Logistic regression 

We begin by classifying with logistic regression.

In [None]:
from sklearn.linear_model import LogisticRegression

start_time = time.time()

#fit logistic regression to '16-'17 data
#high value for C to reduce regularization

model = LogisticRegression(C=1E6, random_state=23)
lr_model = model.fit(x_train_17, y_train_17)
predicted = lr_model.predict(x_test_17)


print('Training took ' + str(time.time() - start_time) + ' seconds.')

In [None]:
lr_model.coef_

In [None]:
from sklearn.metrics import accuracy_score, classification_report

lr_score = 100.0 * accuracy_score(y_test_17,predicted)

print("Logistic regression score for '16-'17 data: {0:4.1f}%\n".format(lr_score))

print(classification_report(y_test_17,predicted))

$79\%$ accuracy is a nice baseline for prediction, especially considering how quick it took to train our Logistic Regression model. __Remember that the all spam method (classifying all games as wins) would be accurate at $50\%$.__ 

We now check how accurate this model is for classifying wins for all games during the '17-'18 season. We will find that it is $76\%$ accurate.

In [None]:
from sklearn.ensemble import RandomForestClassifier

start_time = time.time()

rfc = RandomForestClassifier(n_estimators=10, max_features='auto', \
                            min_samples_split=2, random_state=23)

#Fit to '16-'17 data
rfc = rfc.fit(x_train_17, y_train_17)

print('Training Random Forest Classifier took ' + str(time.time() - start_time) + ' seconds.')


In [None]:
rfc_score = 100.0 * rfc.score(x_test_17, y_test_17)

print("Random Forest Classifier score for '16-'17 data: {0:4.1f}%\n".format(rfc_score))

In [None]:
rfc_score_18 = 100.0 * rfc.score(trans_team_stats_18, team_stats_18.loc[:,'won'])

print("Random Forest Classifier score for '17-'18 data: {0:4.1f}%\n".format(rfc_score_18))

We find that the Random Forest Classifier is slightly less accurate than classifying with Logistic Regression and the Support Vector Classifier. It is still very fast with split second speed to train.

### Gradient Tree Boosting

We continue to use decision trees, but in a slightly different way with boosting. We will test classification with Gradient Tree Boosting.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

start_time = time.time()

gbtc = GradientBoostingClassifier(n_estimators=500, max_depth=3, random_state=23)

#fit to '16-'17 team stats
gbtc = gbtc.fit(x_train_17, y_train_17)

print('Training Gradient Boosting Classifier took ' + str(time.time() - start_time) + ' seconds.')

In [None]:
gbtc_score = 100.0 * gbtc.score(x_test_17, y_test_17)

print("Gradient Boosting Classifier score for '16-'17 data: {0:4.1f}%\n".format(gbtc_score))

In [None]:
gbtc_score_18 = 100.0 * gbtc.score(trans_team_stats_18, team_stats_18.loc[:,'won'])

print("Gradient Boosting Classifier score for '17-'18 data: {0:4.1f}%\n".format(gbtc_score_18))

Using Gradient Boosting had slightly improved performance in terms of accuracy over using a Random Forest. While it took just over a second to train, Gradient Boosting took over 100 times longer than Logistic Regression to train.

### k-Nearest Neighbors classifier

We will now use a bit more intuitive classifier: the k-Nearest Neighbors Classifier. Recall that it labels new data by considering the labels of nearby training points. Unfortunately, it fails under the _curse of dimensionality_: our data has too many features (7) relative to the number of data points (around 2000). For this reason, we will classify according to the first 4 principal components.

In [None]:
from sklearn import neighbors

start_time = time.time()

#number of neighbors
nbrs=10

knc = neighbors.KNeighborsClassifier(n_neighbors=nbrs)

#train model with best 4 principal components of '16-'17 data
knc = knc.fit(x_train_17[:,:4], y_train_17)

print('Training K-nearest Neighbors Classifier took ' + str(time.time() - start_time) + ' seconds.')

In [None]:
knc_score = 100.0 * knc.score(x_test_17[:,:4], y_test_17)

print("K-nearest Neighbors Classifier score for '16-'17 data: {0:4.1f}%\n".format(knc_score))

In [None]:
knc_score_18 = 100.0 * knc.score(trans_team_stats_18[:,:4], team_stats_18.loc[:,'won'])

print("K-nearest Neighbors Classifier score for '17-'18 data: {0:4.1f}%\n".format(knc_score_18))

While the training speed is just low as any other method, the accuracy is a few percentage points lower than others.

### Voting Classification

We have tested several machine learning methods of classification in this notebook. We will conclude by seeing what happens when we combine a few of the methods by the way of Voting Classification. 

In [None]:
#DataFrame of all team stats since 2009-2010 season
all_team_stats = pd.read_csv('all_team_stats_2009_to_2018.csv').loc[:,'team':]

all_team_stats.head()

In [None]:
#DataFrame of info on all games since 2004-2005 season 
all_game_info = pd.read_csv('all_games_04_on.csv').loc[:,'team':]

#restrict to games during 2016-2017 season
season_17_bool = all_game_info['season_end_year'] == 2017
all_game_info_17 = all_game_info[season_17_bool]

#restrict to games during 2017-2018 season
season_18_bool = all_game_info['season_end_year'] == 2018
all_game_info_18 = all_game_info[season_18_bool]

In [None]:
print('Number of 2016-17 team stat rows: ' + str(all_game_info_17.shape[0]))

all_game_info_17.head()

In [None]:
print('Number of 2017-18 team stat rows: ' + str(all_game_info_18.shape[0]))

all_game_info_18.head()

We now employ SQLite to match up the Matchup ID's for each year with the corresponding team stats.

In [None]:
home_dir = !echo $HOME

#Define data directory
database_dir = home_dir[0] + '/database'

print(f'Database will persist at {database_dir}\n')

In [None]:
%%bash -s %%bash  "$database_dir"

#passed Python variable, later accessed with $1

#check if directory exists
if [ -d "$1" ] ; then

    echo "Directory already exists."

else
    #otherwise grapb file from Internet and store locally in data directory
    
    mkdir $1
    echo "creating database directory"

fi

In [None]:
con = sqlite3.connect("stats_629.db")

cur = con.cursor()