# Bayesian Prediction Machine Learning Modeling

This Jupyter Notebook is to take a deeper dive in to the Bayesian predictions for MLB games from 2015-2018. We are taking the output from the head_to_head.py python script and making some new feature additions that should help our algorithms perform better.

For a background reference, the Bayesian model (calculating a winning percentage based off of home / away performance and total winning percentages) returns roughly a 55% accuracy for each season. 

In [1]:
# Import the libraries 

import pandas as pd
import numpy as np
from datetime import datetime

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split



In [3]:
# Import Data that is the output of the head_to_head.py script for seasons 2015-2018
HeadToHead_All = pd.read_excel('Bayesian Head to Head 2015-2018.xlsx')
HeadToHead_All.tail()

Unnamed: 0,Season,Date,HomeTeam,AwayTeam,HomeTeamTotalWins,HomeTeamTotalLosses,AwayTeamTotalWins,AwayTeamTotalLosses,HomeTeamHomeWins,HomeTeamHomeLosses,AwayTeamAwayWins,AwayTeamAwayLosses,HomeTeamBayesianWinningProb,AwayTeamBayesianWinningProb,HomeWinningProb,AwayWinningProb,WinningTeam,CorrectPrediction
9855,2018,2018-10-23,Boston Red Sox,Los Angeles Dodgers,115,56,99,75,59,26,50,37,0.823322,0.640777,0.562341,0.437659,Boston Red Sox,1.0
9856,2018,2018-10-24,Boston Red Sox,Los Angeles Dodgers,116,56,99,76,60,26,50,38,0.826996,0.631539,0.567005,0.432995,Boston Red Sox,1.0
9857,2018,2018-10-26,Los Angeles Dodgers,Boston Red Sox,99,77,117,56,49,38,56,30,0.623762,0.795918,0.439368,0.560632,Los Angeles Dodgers,0.0
9858,2018,2018-10-27,Los Angeles Dodgers,Boston Red Sox,100,77,117,57,50,38,56,31,0.630835,0.787595,0.444742,0.555258,Boston Red Sox,1.0
9859,2018,2018-10-28,Los Angeles Dodgers,Boston Red Sox,100,78,118,57,50,39,57,31,0.621736,0.791946,0.439799,0.560201,Boston Red Sox,1.0


In [5]:
# We are going to drop all of the rows that have NULL values because it contains records that are too early in the
# season for the Bayesian calculations to work correctly
HeadToHead_All.dropna(inplace=True)

## Create the game number of the series
I wanted to add in a few additional features to the dataset, including which game of the series it is. For example, if a team comes to play the Rockies for 4 games, I want to know which game of the 4 games series it is. It might help any algorithms with determing if there is a pattern of home teams winning the first game, or maybe it doesn't have any effect at all. Either way, not too difficult to throw it in

In [6]:
# Need to sort the games by Season, Home Team, Date, and Away Team for our algorithm to work
series_df = HeadToHead_All.sort_values(['Season','HomeTeam', 'Date','AwayTeam'])

In [7]:
i = 0

series_vals = []

for game_idx, (season, date, HomeTeam, AwayTeam) in zip(list(series_df.index)
                                                        , series_df[['Season','Date','HomeTeam','AwayTeam']].values
                                                       ):
    
    # If it is the very first value, it won't have any previous teams, so it will have to be the first series game
    if i == 0:
        series_game_num = 1
        
        
    else:
        current_away_team = AwayTeam
        current_home_team = HomeTeam
        
        # Want to see what the previous home / away teams were in the last record to see if 
        # it is the same series or a new series
        prev_away_team = series_df.loc[prev_game_idx]['AwayTeam']
        prev_home_team = series_df.loc[prev_game_idx]['HomeTeam']
        
        # If it is the same series
        if current_away_team == prev_away_team and current_home_team == prev_home_team:
            series_game_num += 1
        # If it is a new series
        else:
            series_game_num = 1
    
    prev_game_idx = game_idx
    series_vals.append(series_game_num)
    i +=1


In [8]:
series_df['SeriesGameNum'] = series_vals

In [9]:
# Make sure it worked by taking a sample
series_df[series_df['HomeTeam'] == 'Colorado Rockies'][
    ['Season','Date','HomeTeam','AwayTeam','SeriesGameNum']].head(15)

Unnamed: 0,Season,Date,HomeTeam,AwayTeam,SeriesGameNum
68,2015,2015-04-11,Colorado Rockies,Chicago Cubs,1
83,2015,2015-04-12,Colorado Rockies,Chicago Cubs,2
187,2015,2015-04-20,Colorado Rockies,San Diego Padres,1
196,2015,2015-04-21,Colorado Rockies,San Diego Padres,2
211,2015,2015-04-22,Colorado Rockies,San Diego Padres,3
225,2015,2015-04-23,Colorado Rockies,San Diego Padres,4
240,2015,2015-04-24,Colorado Rockies,San Francisco Giants,1
253,2015,2015-04-25,Colorado Rockies,San Francisco Giants,2
400,2015,2015-05-06,Colorado Rockies,Arizona Diamondbacks,1
401,2015,2015-05-06,Colorado Rockies,Arizona Diamondbacks,2


In [10]:
# Add in some more features
series_df['HomeTeamTotalGames'] = series_df['HomeTeamTotalWins'] + series_df['HomeTeamTotalLosses']
series_df['AwayTeamTotalGames'] = series_df['AwayTeamTotalWins'] + series_df['AwayTeamTotalLosses']

# What is the spread difference between home winning probability and away winning probability
series_df['BayesianProbSpread'] = abs(series_df['HomeWinningProb'] - series_df['AwayWinningProb'])

# Get the day of the year to show if it is early in the season or late in the season
series_df['DayOfYear'] = series_df['Date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').timetuple().tm_yday)
series_df.tail()

Unnamed: 0,Season,Date,HomeTeam,AwayTeam,HomeTeamTotalWins,HomeTeamTotalLosses,AwayTeamTotalWins,AwayTeamTotalLosses,HomeTeamHomeWins,HomeTeamHomeLosses,...,AwayTeamBayesianWinningProb,HomeWinningProb,AwayWinningProb,WinningTeam,CorrectPrediction,SeriesGameNum,HomeTeamTotalGames,AwayTeamTotalGames,BayesianProbSpread,DayOfYear
9713,2018,2018-09-22,Washington Nationals,New York Mets,77,77,72,82,37,39,...,0.461235,0.513505,0.486495,Washington Nationals,1.0,3,154,154,0.027009,265
9728,2018,2018-09-23,Washington Nationals,New York Mets,78,77,72,83,38,39,...,0.452101,0.523519,0.476481,New York Mets,0.0,4,155,155,0.047038,266
9740,2018,2018-09-24,Washington Nationals,Miami Marlins,78,78,62,93,38,40,...,0.242424,0.667732,0.332268,Washington Nationals,1.0,1,156,155,0.335463,267
9754,2018,2018-09-25,Washington Nationals,Miami Marlins,79,78,62,94,39,40,...,0.236867,0.677171,0.322829,Washington Nationals,1.0,2,157,156,0.354341,268
9770,2018,2018-09-26,Washington Nationals,Miami Marlins,80,78,62,95,40,40,...,0.231487,0.686254,0.313746,Washington Nationals,1.0,3,158,157,0.372507,269


In [11]:
# Create function that will be the output of the model.
# It doesn't matter if you want to predict if the home team will win or of the away team will win,
# we're just trying to determine how accurate the models are

def home_team_winner(row):
    if row['HomeTeam'] == row['WinningTeam']:
        return 1 
    else:
        return 0
    
series_df['HomeTeamWin'] = series_df.apply(home_team_winner, axis = 1)

In [12]:
Final_Data = series_df.drop(['Date','Season','HomeTeam','CorrectPrediction','AwayTeam','WinningTeam'], axis = 1)

In [13]:
X_train, X_test, y_train, y_test = train_test_split(Final_Data.loc[:, ~Final_Data.columns.isin(['HomeTeamWin'])]
                                                   , Final_Data.loc[:, 'HomeTeamWin']
                                                   )

In [24]:
rfc = RandomForestClassifier(500, random_state = 534)
rfc.fit(X_train, y_train)
print('-- Random Forest -- ')
print('Training Accuracy: ', accuracy_score(y_train, rfc.predict(X_train)))
print('Testing Accuracy: ', accuracy_score(y_test, rfc.predict(X_test)))
print('Whole Dataset: ', accuracy_score(series_df['HomeTeamWin'],rfc.predict(series_df.loc[:, X_train.columns])))
print('\n')

lr = LogisticRegression(random_state = 534)
lr.fit(X_train, y_train)
print('-- Logistic Regression -- ')
print('Training Accuracy: ', accuracy_score(y_train, lr.predict(X_train)))
print('Testing Accuracy: ', accuracy_score(y_test, lr.predict(X_test)))
print('Whole Dataset: ', accuracy_score(series_df['HomeTeamWin'],lr.predict(series_df.loc[:, X_train.columns])))
print('\n')

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print('-- K Nearest Neighbors -- ')
print('Training Accuracy: ', accuracy_score(y_train, knn.predict(X_train)))
print('Testing Accuracy: ', accuracy_score(y_test, knn.predict(X_test)))
print('Whole Dataset: ', accuracy_score(series_df['HomeTeamWin'],knn.predict(series_df.loc[:, X_train.columns])))
print('\n')

sv = SVC()
sv.fit(X_train, y_train)
print('-- SVC -- ')
print('Training Accuracy: ', accuracy_score(y_train, sv.predict(X_train)))
print('Testing Accuracy: ', accuracy_score(y_test, sv.predict(X_test)))
print('Whole Dataset: ', accuracy_score(series_df['HomeTeamWin'],sv.predict(series_df.loc[:, X_train.columns])))

-- Random Forest -- 
Training Accuracy:  0.9976545253863135
Testing Accuracy:  0.5198675496688742
Whole Dataset:  0.8782077814569537


-- Logistic Regression -- 
Training Accuracy:  0.5605684326710817
Testing Accuracy:  0.5401490066225165
Whole Dataset:  0.5554635761589404


-- K Nearest Neighbors -- 
Training Accuracy:  0.697985651214128
Testing Accuracy:  0.5227649006622517
Whole Dataset:  0.6541804635761589


-- SVC -- 
Training Accuracy:  0.9235651214128036
Testing Accuracy:  0.5202814569536424
Whole Dataset:  0.8227442052980133


In [22]:
# What features did the Random Forest look at the most?
pd.DataFrame(list(zip(rfc.feature_importances_, X_train.columns)), columns = ['Feature Importance','Feature']
            ).sort_values('Feature Importance',ascending = False)

Unnamed: 0,Feature Importance,Feature
9,0.085121,AwayTeamBayesianWinningProb
8,0.082683,HomeTeamBayesianWinningProb
10,0.076372,HomeWinningProb
11,0.075085,AwayWinningProb
15,0.07352,BayesianProbSpread
16,0.065384,DayOfYear
13,0.055248,HomeTeamTotalGames
14,0.055203,AwayTeamTotalGames
3,0.055162,AwayTeamTotalLosses
2,0.055113,AwayTeamTotalWins


# Conclusion

Even though Bayesian probabilities only look at a small subset of data, they make correct predictions roughly 55% of the time, while more complex machine learning models are only being correct 51%-54% of the time on data it has not seen before. That is why for very random outcomes (like who will win a baseball game), it is good to sometimes just look for the most simple models to use predictions instead of using models that are difficult to interpret.