# Final Capstone Project

This final capstone project will be looking at sports betting. 

Using machine learning I want to analyze sports betting odds and create a model to predict the results of games and whether or not a gambler should bet money on the game. This model would be valuable to gamblers who want to make informed bets on sports games.

In [1]:
#pip install hvplot bokeh -U

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import sklearn
import scipy
sns.set_style('white')

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')

ModuleNotFoundError: No module named 'google'

# Data Wrangling

In [None]:
# Import odds
odds_raw = pd.read_csv("/content/gdrive/My Drive/Colab Datasets/BSKB_NBA 3in1_ML_Opening odds_02 June 2019.csv")

odds_raw.head()

In [None]:
# Import teams
teams_raw = pd.read_csv("/content/gdrive/My Drive/Colab Datasets/nba_team_ratings.csv")

teams_raw.head()

## About the Data

The first dataset contains the money line odds and final score of NBA games from the 2009-2010 NBA Regular Season through the 2018-2019 NBA Playoffs, up to Game 1 of the NBA Finals. The dataset is from indatabet.com. It is a .csv file I converted over from a .xlsb file. The original file used sub columns under main columns. For example, 'Teams' have both 'Home' and 'Away' teams in separate columns. The same goes for 'Teams ID' having both 'H' (home) and 'A' (away) teams' abbreviations. It also has odds from three different online bookmakers: Matchbook, Pinnacle, and Bet365. I will need to clean the data so that all the columns are on the same level, and remove the columns that are not necessary.

The second dataset contains stats and advanced metrics for each NBA team in a given season. It contains a team's record, offensive rating, defensive rating, net ratings, as well as adjusted ratings.

In [None]:
# Get list of columns
odds_raw.columns

In [None]:
# Creating new Data Frame
odds_df = pd.DataFrame
odds_df = odds_raw

In [None]:
# Drop uneccessary columns
odds_df = odds_raw.drop(['Kick-off', 'Date of the game', 'Unnamed: 5', 'Country', 'League', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Draw', 'Unnamed: 21', 'Unnamed: 22', '1 q ', 'Unnamed: 24', 'Unnamed: 25'],  axis=1)

In [None]:
# Rename columns
odds_df.rename(columns={'Teams ': 'Home Team',
                   'Unnamed: 11': 'Away Team',
                   'Teams ID': 'Home ID',
                   'Unnamed: 13': 'Away ID',
                   'FT +OT Scores': 'Home Score',
                   'Unnamed: 15': 'Away Score',
                   'Unnamed: 16': 'Winner',
                   'OT Money Line': 'matchbook H',
                   'Unnamed: 27': 'matchbook A',
                   'OT Money Line.1': 'pinnacle H',
                   'Unnamed: 29': 'pinnacle A',
                   'OT Money Line.2': 'bet365 H',
                   'Unnamed: 31': 'bet365 A'}, inplace=True)

In [None]:
# Remove first two rows which were previously the names of the sub columns
odds_df = odds_df.iloc[2:]

In [None]:
# Check data types and for null values
odds_df.info()

In [None]:
# Sum of null values
odds_df.isna().sum()

We do not have the odds of 5,795 games from Matchbook, 156 games from Pinnacle, and 160 games from Bet365. 

In [None]:
# Convert objects to float
odds_df = odds_df.convert_objects(convert_numeric=True)

In [None]:
# Check data types again
odds_df.info()

In [None]:
# Fill NaN with average
odds_df = odds_df.fillna(odds_df.mean())

In [None]:
# Check where null values are
odds_df[odds_df.isna().any(axis=1)]

In [None]:
# Add columns for the average odds of the three different sportsbooks (Matchbook, Pinnacle, and Bet365)
odds_df['AVG H'] = ((odds_df['matchbook H'] + odds_df['pinnacle H'] + odds_df['bet365 H']) / 3)
odds_df['AVG A'] = ((odds_df['matchbook A'] + odds_df['pinnacle A'] + odds_df['bet365 A']) / 3)

In [None]:
# Create variables that makes the Winner and Loser binary
odds_df['Winner_B'] = np.where(odds_df['Winner']=='H', 1, 0)
odds_df['Loser_B'] = np.where(odds_df['Winner']=='A', 1, 0)

In [None]:
odds_df.head()

In [None]:
teams_raw.head()

In [None]:
# Merge the dataframes to include home team metrics
merged = pd.merge(odds_df, 
                        teams_raw[['Team', 'MOV', 'ORtg', 'DRtg', 'NRtg', 'MOV/A', 'ORtg/A', 'DRtg/A', 'NRtg/A', 'Seasons']].rename(columns={
                            'DRtg': 'DRtg_Home',
                            'DRtg/A': 'DRtg/A_Home',
                            'MOV': 'MOV_Home',
                            'MOV/A': 'MOV/A_Home',
                            'NRtg': 'NRtg_Home',
                            'NRtg/A': 'NRtg/A_Home',
                            'ORtg': 'ORtg_Home',
                            'ORtg/A': 'ORtg/A_Home'}),
                        how='left',
                        left_on=('Home Team', 'Seasons'),
                        right_on=('Team', 'Seasons'),
                        suffixes=('', '_this_shouldnt_happen'))

In [None]:
# Merge again to include away team metrics
merged = merged.merge(teams_raw[['Team', 'MOV', 'ORtg', 'DRtg', 'NRtg', 'MOV/A', 'ORtg/A', 'DRtg/A', 'NRtg/A', 'Seasons']].rename(columns={
                            'DRtg': 'DRtg_Away',
                            'DRtg/A': 'DRtg/A_Away',
                            'MOV': 'MOV_Away',
                            'MOV/A': 'MOV/A_Away',
                            'NRtg': 'NRtg_Away',
                            'NRtg/A': 'NRtg/A_Away',
                            'ORtg': 'ORtg_Away',
                            'ORtg/A': 'ORtg/A_Away',}), 
                   how='left',
                   left_on=('Away Team', 'Seasons'),
                   right_on=('Team', 'Seasons'),
                   suffixes=('', '_away_team'))

In [None]:
merged.columns

In [None]:
# Remove redundant columns for home and away teams
merged = merged.drop(['Team', 'Team_away_team'],  axis='columns')

In [None]:
merged.info()

In [None]:
# Sum of null values
merged.isna().sum()

In [None]:
# Drop NA
merged = merged.dropna()

#Data Exploration

In [None]:
# How many total
merged.shape

We have data from 12,697 games over the span of 10 NBA seasons (from 2009 to 2019).

In [None]:
# Summary statistics
merged.describe()

In [None]:
# How many games from each season
b = merged.Seasons.value_counts().reindex(['2009/2010', '2010/2011', '2011/2012', '2012/2013', '2013/2014',
                                  '2014/2015', '2015/2016', '2016/2017', '2017/2018', '2018/2019']).plot(kind="bar",
                                                                                                        figsize=(14,8))
b.axes.set_title("Games per Season",fontsize=30)
b.set_xlabel("Season", fontsize=20)
b.set_ylabel("Number of Games", fontsize=20)

There are 30 teams in the NBA. Each team plays 82 regular season games which comes out to 1230 games total. The amount of playoff games is different every year as some series may require more games to decide the outcome. The 2011/2012 NBA Season was shortend to 66 games due to a lockout. 

In [None]:
import hvplot.pandas
import holoviews

holoviews.extension('bokeh')

merged.hvplot(x='YY', by='Home Team', y='NRtg_Home', height=950, width=800)

In [None]:
holoviews.extension('bokeh')

merged.hvplot(x='YY', by='Home Team', y='NRtg_Home', height=800, kind='scatter')

In [None]:
plt.figure(figsize=(30, 15))
sns.heatmap(merged.corr(), cmap='coolwarm', annot=True)

# Models

In [None]:
# Split data

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

X = pd.get_dummies(
    merged.drop(
        [
            "MM",
            "DD",
            "Winner_B",
            "Loser_B",
            "Winner",
            "matchbook H",
            "matchbook A",
            "pinnacle H",
            "pinnacle A",
            "bet365 H",
            "bet365 A",
            "Home Score",
            "Away Score"
        ],
        axis='columns',
    )
    , drop_first=True
)
Y = merged["Winner_B"]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)


### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

logr = LogisticRegression(solver='lbfgs', multi_class='auto')
logr.fit(X_train, Y_train)

In [None]:
print('Logistic Regression Scores')
print('Training score: ', logr.score(X_train, Y_train))
print('Test score: ', logr.score(X_test, Y_test))

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()
tree.fit(X_train, Y_train)

In [None]:
print('Decision Tree Scores')
print('Training score: ', tree.score(X_train, Y_train))
print('Test score: ', tree.score(X_test, Y_test))

In [None]:
# See which features are most important
tree_features = pd.Series(data=tree.feature_importances_, index=X_train.columns)
tree_features.sort_values(ascending=False)

### Random Forest

In [None]:
from sklearn import ensemble

rfc = ensemble.RandomForestClassifier(n_estimators=500)
rfc.fit(X_train, Y_train)

In [None]:
print('Random Forest Scores')
print('Training score: ', rfc.score(X_train, Y_train))
print('Test score: ', rfc.score(X_test, Y_test))

In [None]:
# See which features are most important
rfc_features = pd.Series(data=rfc.feature_importances_, index=X_train.columns)
rfc_features.sort_values(ascending=False)

### KNN

In [None]:
from sklearn import neighbors

knn = neighbors.KNeighborsClassifier(n_neighbors=2, weights='distance')
knn.fit(X_train, Y_train)

In [None]:
print('KNN Scores. K = 2')
print('Training score: ', knn.score(X_train, Y_train))
print('Test score: ', knn.score(X_test, Y_test))

In [None]:
knn = neighbors.KNeighborsClassifier(n_neighbors=20, weights='distance')
knn.fit(X_train, Y_train)

print('KNN Scores. K = 20')
print('Training score: ', knn.score(X_train, Y_train))
print('Test score: ', knn.score(X_test, Y_test))

In [None]:
knn = neighbors.KNeighborsClassifier(n_neighbors=10, weights='distance')
knn.fit(X_train, Y_train)

print('KNN Scores. K = 10')
print('Training score: ', knn.score(X_train, Y_train))
print('Test score: ', knn.score(X_test, Y_test))

###XGBoost

In [None]:
import xgboost as xgb
from xgboost import XGBClassifier

xgb = XGBClassifier()
xgb.fit(X_train, Y_train)

In [None]:
print('XGBoost Scores')
print('Training score: ', xgb.score(X_train, Y_train))
print('Test score: ', xgb.score(X_test, Y_test))

In [None]:
# See which features are most important
xgb_features = pd.Series(data=xgb.feature_importances_, index=X_train.columns)
xgb_features.sort_values(ascending=False)

### Neural Network (Multi Layer Perceptron)

In [None]:
import tensorflow as tf
import keras
from keras import backend as K
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.optimizers import RMSprop

In [None]:
Y = merged[["Winner_B", "Loser_B"]]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

In [None]:
from keras.models import Sequential

# Start with a simple sequential model
model = Sequential()

# Add dense layers to create a fully connected MLP
model.add(Dense(64, activation='relu', input_shape=(146,)))

# Dropout layers remove features and fight overfitting
model.add(Dropout(0.1))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2, activation='softmax'))

model.summary()

# Compile the model
model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])

In [None]:
history = model.fit(X_train, Y_train,
                    batch_size=128,
                    epochs=50,
                    verbose=1,
                    validation_data=(X_test, Y_test))
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

#Feature Importances

In [None]:
tree_importance = tree.feature_importances_
rfc_importance = rfc.feature_importances_
xgb_importance = xgb.feature_importances_

tree_importance = 100.0 * (tree_importance / tree_importance.sum())
rfc_importance = 100.0 * (rfc_importance / rfc_importance.sum())
xgb_importance = 100.0 * (xgb_importance / xgb_importance.sum())


tree_df = pd.DataFrame(data={'Percent_Importance':tree_importance, 'Feature':X_train.columns})
tree_df.sort_values('Percent_Importance', axis=0, ascending=False, inplace=True)

rfc_df = pd.DataFrame(data={'Percent_Importance':rfc_importance, 'Feature':X_train.columns})
rfc_df.sort_values('Percent_Importance', axis=0, ascending=False, inplace=True)

xgb_df = pd.DataFrame(data={'Percent_Importance':xgb_importance, 'Feature':X_train.columns})
xgb_df.sort_values('Percent_Importance', axis=0, ascending=False, inplace=True)

In [None]:
f, ax = plt.subplots(figsize=(30,15))

ax1 = plt.subplot(1, 3, 1)
sns.barplot(y='Feature', x='Percent_Importance', data=tree_df.iloc[:10,:], palette='Blues_r')
plt.xlabel('')
ax1.tick_params(axis='y', which='major', pad=30)
plt.title('Decision Tree \n', fontsize=20)

ax2 = plt.subplot(1, 3, 2)
sns.barplot(y='Feature', x='Percent_Importance', data=rfc_df.iloc[:10,:], palette='Blues_r')
plt.xlabel('\n Percent Importance', fontsize=50)
ax2.tick_params(axis='y', which='major', pad=30)
plt.title('Random Forest \n', fontsize=20)

ax3 = plt.subplot(1, 3, 3)
sns.barplot(y='Feature', x='Percent_Importance', data=xgb_df.iloc[:10,:], palette='Blues_r')
plt.xlabel('')
ax3.tick_params(axis='y', which='major', pad=30)
plt.title('XGBoost \n', fontsize=20)

plt.tight_layout()
plt.suptitle('              Feature Importances by Model \n', fontsize=40)
plt.subplots_adjust(top=0.88)
plt.show()

#Conclusion

The accuracy scores across the models attempted were all similar. The neural network did not perform well compared to the others. Given the dataset, we are able to predict the winners of NBA games with around 65-70% accuracy. The most important features in predicting a winner were adjusted team ratings. Someone can use this model and information to bet on the moneyline odds for NBA games. 

In an effort to improve these models, we could use more data. Moneyline odds tend to change depending on who is playing in the game. For example, a team might be missing their best player, causing the odds to move in the direction of favoring the opposing team. In other words, we would want data on the players that play in the games. We could also use different advanced metrics, such as team tendencies and playstyles. 