# Baseball Homerun Predictions for Y2021

## Introduction

We are going to attempt to accurately predict the amount of homeruns that will occur for the MLB 2021 Season. Since the season has already been completed and thus all the homeruns that can occur in the 2021 season have already happened, we have a fine measurement as to just how accurate our model will be.

We are using data from the 2000-2020 seasons on players hitting stats organized by year. This dataset was obtained from http://www.seanlahman.com/baseball-archive/statistics/ using the 2020 - comma delimited version. This gives a mountain of different datasets, from which we pulled the Batting.csv dataset. The Batting.csv dataset had data on batters starting in the year 1871, which we have determined isn't exactly helpful when trying to determine more modern baseball statistics, thus our version of Batting.csv found at https://github.com/dswetlik/BaseballHRPrediction/blob/master/Batting.csv has been cut down from 1.5 centuries to 2 decades.

That being said, we need to clean up our dataset a little bit more before we begin. This is because we need to compare yearly overall statistics and not individual player data. We will combine all of the statistics from every year and then begin fitting that to models.

Our process for modeling will be fairly straightforward; we will do an ordinary least squares fit utilizing all of the predictors to get the p-values. After that, we will determine collinearity values using a VIF analysis. Then, we will perform subset selection using multiple techniques like forward subset selection. Finally, we will use that model to predict the amount of homeruns the next year and assess our accuracy.

An initial disclaimer, the 2020 season was drastically shorter than every other season in the dataset due to the global pandemic and therefore the sample size is much smaller. However, we expect that the decrease in homeruns and games played will also be reflected in the other predictors proportionately so that it can still be reliably used as data.

## Data Setup

In [1]:
# Basics and Plotting
import pandas as pd
import numpy as np
import scipy as scp
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import seaborn as sns
from itertools import chain, combinations

# Sklearn Models
import sklearn.linear_model as skl_lm
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, LeaveOneOut, KFold, cross_val_score, cross_validate
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression

# Alternative models
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.formula.api as smf

In [2]:
baseball = pd.read_csv("https://raw.githubusercontent.com/dswetlik/BaseballHRPrediction/master/Batting.csv")

In [3]:
baseball

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,abbotje01,2000,1,CHA,AL,80,215,31,59,15,...,29,2,1,21,38,1,2,2,1,2
1,abbotku01,2000,1,NYN,NL,79,157,22,34,7,...,12,1,1,14,51,2,1,0,1,2
2,abbotpa01,2000,1,SEA,AL,35,5,1,2,1,...,0,0,0,0,1,0,0,1,0,0
3,abreubo01,2000,1,PHI,NL,154,576,103,182,42,...,79,28,8,100,116,9,1,0,3,12
4,aceveju01,2000,1,MIL,NL,62,1,1,0,0,...,0,0,0,1,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29518,zimmebr02,2020,1,BAL,AL,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29519,zimmejo02,2020,1,DET,AL,3,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29520,zimmeky01,2020,1,KCA,AL,16,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29521,zuberty01,2020,1,KCA,AL,23,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Below we are dropping the columns playerID, teamID, stint, and lgID, as we have decided that they would be inconsequential or irrelavent for determining league-wide homerun counts.

In [5]:
baseball.drop(columns=["playerID","teamID","stint","lgID","CS","SH","SF","GIDP","SB","HBP"], axis=1, inplace=True)
baseball.rename(columns={"2B": "Double", "3B": "Triple"}, inplace=True)
baseball.head()

KeyError: "['playerID' 'teamID' 'stint' 'lgID' 'CS' 'SH' 'SF' 'GIDP' 'SB'] not found in axis"

After dropping those columns, the remaining columns are as follows:


| Num | ID     | Name                       |
|-----|--------|----------------------------|
|  0  | yearID | Year                       |
|  1  | G      | Games                      |
|  2  | AB     | At Bats                    |
|  3  | R      | Runs                       |
|  4  | H      | Hits                       |
|  5  | Double | Doubles                    |
|  6  | Triple | Triples                    |
|  7  | HR     | Homeruns                   |
|  8  | RBI    | Runs Batted In             |
|  9  | SB     | Stolen Bases               |
| 10  | CS     | Caught Stealing            |
| 11  | BB     | Base on Balls              |
| 12  | SO     | Strikeouts                 |
| 13  | IBB    | Intentional Walks          |
| 14  | HBP    | Hit By Pitch               |
| 15  | SH     | Sacrifice Hits             |
| 16  | SF     | Sacrifice Flies            |
| 17  | GIDP   | Grounded into Double Plays |

This is almost usable for what we want, but it is still organized per-player, and we want it to be based on the year's total statistics. We will go through and create a new dataset now based on years.

In [None]:
baseballYearTotal = []
for i in range(2000,2021):
    baseballYear = baseball.loc[baseball['yearID'] == i].to_dict(orient='dict')
    G = 0
    for j in baseballYear['G'].values():
        G += j
    AB = 0
    for j in baseballYear['AB'].values():
        AB += j
    R = 0
    for j in baseballYear['R'].values():
        R += j
    H = 0
    for j in baseballYear['H'].values():
        H += j
    Double = 0
    for j in baseballYear['Double'].values():
        Double += j
    Triple = 0
    for j in baseballYear['Triple'].values():
        Triple += j
    HR = 0
    for j in baseballYear['HR'].values():
        HR += j
    RBI = 0
    for j in baseballYear['RBI'].values():
        RBI += j
    SB = 0
    for j in baseballYear['SB'].values():
        SB += j
    CS = 0
    for j in baseballYear['CS'].values():
        CS += j
    BB = 0
    for j in baseballYear['BB'].values():
        BB += j
    SO = 0
    for j in baseballYear['SO'].values():
        SO += j
    IBB = 0
    for j in baseballYear['IBB'].values():
        IBB += j
    HBP = 0
    for j in baseballYear['HBP'].values():
        HBP += j
    SH = 0
    for j in baseballYear['SH'].values():
        SH += j
    SF = 0
    for j in baseballYear['SF'].values():
        SF += j
    GIDP = 0
    for j in baseballYear['GIDP'].values():
        GIDP += j
    baseballYearTotal.append([i,G,AB,R,H,Double,Triple,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP])
    
newBaseball = pd.DataFrame(baseballYearTotal, columns=['yearID','G','AB','R','H','Double','Triple','HR','RBI','SB','CS','BB','SO','IBB','HBP','SH','SF','GIDP'])
newBaseball

In [None]:
newBaseball.plot.bar(x='yearID', y='HR', rot=90)

It appears that there is a general upwards trend in homeruns hit each year with a steep decrease in 2020 because COVID-19 shortened the season dramatically. We expect to see all of the predictors values to reflect this decrease proportionately. In 2021, it is expected that the number of homeruns will return to the rising trend over the last decade. In the baseball community, this has become known as the "live ball era" because of the upwards trend in homeruns. 

Now that we have our data laid out in terms of total stats per year, we can continue.

In [None]:
mod = smf.ols(formula='HR ~ 1 + yearID + G + AB + R + H + Double + Triple + RBI + SB + CS + BB + SO + IBB + HBP + SH + SF + GIDP', data = newBaseball)

In [None]:
res = mod.fit()
res.summary()

We decided to do a VIF analysis to determine collinearity in the predictors and found that many of the variables had a VIF score of above ten. This indicates a large amount of collinearity. We will not drop the predictors because of this but it is important to keep in mind as we head into our next step of subset selection. 

In [None]:
vif = pd.DataFrame()
vif['X'] = newBaseball.columns
vif['vif'] = [variance_inflation_factor(newBaseball.values, i) for i in range(len(newBaseball.columns))]
vif

## Subset Selection

After several attempts using different subset selection models, we've decided that with the number of different predictors we have Forward Subset Selection was the best one to use. With that, we are going to start by using the Forward Subset Selection algorithm to determine what the best subset (combination) of the predictors are. We will perform several iteratations of the process below to find the least complex model with the lowest BIC. This will not necessarily be the best model, however.

### Predictor 1

In [None]:
metric_store = [[],[]]
for i, combination in enumerate(combinations([0,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17],1), 1):
    x_data = sm.add_constant(newBaseball.iloc[:,list(combination)])
    mod  = sm.OLS(newBaseball.HR, x_data).fit()
    metric_store[0].append(list(combination))
    metric_store[1].append(mod.bic)

In [None]:
metric_store

In [None]:
metric_store[0][np.argmin(metric_store[1])], np.min(metric_store[1])

### Predictor 2

In [None]:
metric_store = [[],[]]
for i, combination in enumerate(combinations([0,1,2,3,4,5,6,9,10,11,12,13,14,15,16,17],1), 1):
    x_data = sm.add_constant(newBaseball.iloc[:,[8] + list(combination)])
    mod  = sm.OLS(newBaseball.HR, x_data).fit()
    metric_store[0].append(list(combination))
    metric_store[1].append(mod.bic)

In [None]:
metric_store[0][np.argmin(metric_store[1])], np.min(metric_store[1])

### Predictor 3

In [None]:
metric_store = [[],[]]
for i, combination in enumerate(combinations([0,1,2,3,4,5,6,9,10,11,12,13,14,15,17],1), 1):
    x_data = sm.add_constant(newBaseball.iloc[:,[8,16] + list(combination)])
    mod  = sm.OLS(newBaseball.HR, x_data).fit()
    metric_store[0].append(list(combination))
    metric_store[1].append(mod.bic)

In [None]:
metric_store[0][np.argmin(metric_store[1])], np.min(metric_store[1])

### Predictor 4

In [None]:
metric_store = [[],[]]
for i, combination in enumerate(combinations([0,1,2,3,4,5,6,9,10,11,12,13,14,17],1), 1):
    x_data = sm.add_constant(newBaseball.iloc[:,[8,15,16] + list(combination)])
    mod  = sm.OLS(newBaseball.HR, x_data).fit()
    metric_store[0].append(list(combination))
    metric_store[1].append(mod.bic)

In [None]:
metric_store[0][np.argmin(metric_store[1])], np.min(metric_store[1])

### Predictor 5

In [None]:
metric_store = [[],[]]
for i, combination in enumerate(combinations([0,1,2,3,4,5,6,9,11,12,13,14,17],1), 1):
    x_data = sm.add_constant(newBaseball.iloc[:,[8,10,15,16] + list(combination)])
    mod  = sm.OLS(newBaseball.HR, x_data).fit()
    metric_store[0].append(list(combination))
    metric_store[1].append(mod.bic)

In [None]:
metric_store[0][np.argmin(metric_store[1])], np.min(metric_store[1])

Here, we stop since the BIC from Predictor 5 is higher than the BIC of Predictor 4. So our best subset is [8,10,15,16], or RBIs, CS, SH, and SF.

In [None]:
mod = smf.ols(formula='HR ~ 1 + RBI + CS + SH + SF', data = newBaseball)
res = mod.fit()
res.summary()

In [None]:
plt.figure(dpi = 150)
plt.plot(newBaseball['RBI'], newBaseball['HR'], '.', markersize=10, markeredgecolor="black", color="goldenrod")
plt.plot(newBaseball['CS'], newBaseball['HR'], '.', markersize=10, markeredgecolor="black", color="red")
plt.plot(newBaseball['SH'], newBaseball['HR'], '.', markersize=10, markeredgecolor="black", color="green")
plt.plot(newBaseball['SF'], newBaseball['HR'], '.', markersize=10, markeredgecolor="black", color="blue")
plt.ylabel("Home Runs")
plt.grid()
plt.show()