## Baseball Homerun Predictions for Y2021

We are going to attempt to accurately predict the amount of homeruns that will occur for the MLB 2021 Season. Since the season has already been completed and thus all the homeruns that can occur in the 2021 season have already happened, we have a fine measurement as to just how accurate our model will be.

In [1]:
# Basics and Plotting
import pandas as pd
import numpy as np
import scipy as scp
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import seaborn as sns
from itertools import chain, combinations

# Sklearn Models
import sklearn.linear_model as skl_lm
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, LeaveOneOut, KFold, cross_val_score, cross_validate
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression

# Alternative models
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.formula.api as smf

In [2]:
baseball = pd.read_csv("https://raw.githubusercontent.com/dswetlik/BaseballHRPrediction/master/Batting.csv")

In [3]:
baseball

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,aardsda01,2010,1,SEA,AL,53,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,abadfe01,2010,1,HOU,NL,22,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,abreubo01,2010,1,LAA,AL,154,573,88,146,41,...,78,24,10,87,132,3,2,0,5,13
3,abreuto01,2010,1,ARI,NL,81,193,16,45,11,...,13,2,1,4,47,0,0,0,4,8
4,accarje01,2010,1,TOR,AL,5,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15919,zimmebr02,2020,1,BAL,AL,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15920,zimmejo02,2020,1,DET,AL,3,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15921,zimmeky01,2020,1,KCA,AL,16,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15922,zuberty01,2020,1,KCA,AL,23,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Below we are dropping the columns playerID, teamID, stint, and lgID, as we have decided that they would be inconsequential or irrelavent for determining league-wide homerun counts.

In [6]:
baseball.drop(columns=["playerID","teamID","stint","lgID"], axis=1, inplace=True)
baseball.rename(columns={"2B": "Double", "3B": "Triple"}, inplace=True)
baseball.head()

Unnamed: 0,yearID,G,AB,R,H,Double,Triple,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,2010,53,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2010,22,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
2,2010,154,573,88,146,41,1,20,78,24,10,87,132,3,2,0,5,13
3,2010,81,193,16,45,11,1,1,13,2,1,4,47,0,0,0,4,8
4,2010,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


This is almost usable for what we want, but it is still organized per-player, and we want it to be based on the year's total statistics. We will go through and create a new dataset now based on years.

In [83]:
baseballYearTotal = []
for i in range(2010,2021):
    baseballYear = baseball.loc[baseball['yearID'] == i].to_dict(orient='dict')
    G = 0
    for j in baseballYear['G'].values():
        G += j
    AB = 0
    for j in baseballYear['AB'].values():
        AB += j
    R = 0
    for j in baseballYear['R'].values():
        R += j
    H = 0
    for j in baseballYear['H'].values():
        H += j
    Double = 0
    for j in baseballYear['Double'].values():
        Double += j
    Triple = 0
    for j in baseballYear['Triple'].values():
        Triple += j
    HR = 0
    for j in baseballYear['HR'].values():
        HR += j
    RBI = 0
    for j in baseballYear['RBI'].values():
        RBI += j
    SB = 0
    for j in baseballYear['SB'].values():
        SB += j
    CS = 0
    for j in baseballYear['CS'].values():
        CS += j
    BB = 0
    for j in baseballYear['BB'].values():
        BB += j
    SO = 0
    for j in baseballYear['SO'].values():
        SO += j
    IBB = 0
    for j in baseballYear['IBB'].values():
        IBB += j
    HBP = 0
    for j in baseballYear['HBP'].values():
        HBP += j
    SH = 0
    for j in baseballYear['SH'].values():
        SH += j
    SF = 0
    for j in baseballYear['SF'].values():
        SF += j
    GIDP = 0
    for j in baseballYear['GIDP'].values():
        GIDP += j
    baseballYearTotal.append([i,G,AB,R,H,Double,Triple,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP])
    
newBaseball = pd.DataFrame(baseballYearTotal, columns=['yearID','G','AB','R','H','Double','Triple','HR','RBI','SB','CS','BB','SO','IBB','HBP','SH','SF','GIDP'])
newBaseball

Unnamed: 0,yearID,G,AB,R,H,Double,Triple,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,2010,68921,165353,21308,42554,8486,866,4613,20288,2959,1129,15778,34306,1216,1549,1544,1301,3719
1,2011,68729,165705,20808,42267,8399,898,4552,19804,3279,1261,15018,34488,1231,1554,1667,1274,3523
2,2012,69519,165251,21017,42063,8261,927,4934,19998,3229,1136,14709,36426,1055,1494,1479,1223,3614
3,2013,69268,166070,20255,42093,8222,772,4661,19271,2693,1007,14640,36710,1018,1536,1383,1219,3732
4,2014,69564,165614,19761,41595,8137,849,4186,18745,2764,1035,14020,37441,985,1652,1343,1277,3609
5,2015,70534,165488,20647,42106,8242,939,4909,19650,2505,1064,14073,37446,951,1602,1200,1232,3739
6,2016,70451,165561,21744,42276,8254,873,5610,20745,2537,1001,15088,38982,932,1651,1025,1214,3719
7,2017,70743,165567,22582,42215,8397,795,6105,21558,2527,934,15829,40104,970,1763,925,1168,3804
8,2018,71590,165432,21630,41018,8264,847,5585,20606,2474,958,15686,41207,929,1922,823,1235,3457
9,2019,71684,166651,23467,42039,8531,785,6776,22471,2280,832,15895,42823,753,1984,776,1150,3463


Now that we have our data laid out in terms of total stats per year, we can continue.

In [105]:
mod = smf.ols(formula='HR ~ G + AB + R + H + Double + Triple + RBI + SB + CS', data = newBaseball)

In [106]:
res = mod.fit()
res.summary()

  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
  return np.dot(wresid, wresid) / self.df_resid


0,1,2,3
Dep. Variable:,HR,R-squared:,1.0
Model:,OLS,Adj. R-squared:,
Method:,Least Squares,F-statistic:,
Date:,"Tue, 30 Nov 2021",Prob (F-statistic):,
Time:,17:44:06,Log-Likelihood:,206.35
No. Observations:,11,AIC:,-390.7
Df Residuals:,0,BIC:,-386.3
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-839.9026,inf,-0,,,
G,0.5146,inf,0,,,
AB,-0.3737,inf,-0,,,
R,-9.5916,inf,-0,,,
H,0.6287,inf,0,,,
Double,0.5229,inf,0,,,
Triple,-3.3573,inf,-0,,,
RBI,9.7671,inf,0,,,
SB,0.9945,inf,0,,,

0,1,2,3
Omnibus:,27.181,Durbin-Watson:,0.046
Prob(Omnibus):,0.0,Jarque-Bera (JB):,24.457
Skew:,-2.568,Prob(JB):,4.89e-06
Kurtosis:,8.195,Cond. No.,3620000.0


In [12]:
vif = pd.DataFrame()
vif['X'] = baseball.columns
vif['vif'] = [variance_inflation_factor(baseball.values, i) for i in range(len(baseball.columns))]
vif

Unnamed: 0,X,vif
0,yearID,2.839975
1,G,13.092229
2,AB,184.898913
3,R,73.586918
4,H,175.443509
5,Double,23.122008
6,Triple,2.950846
7,HR,23.462745
8,RBI,59.835585
9,SB,4.254755


In [11]:
res2 = mod2.fit()
res2.summary()

0,1,2,3
Dep. Variable:,HR,R-squared:,0.716
Model:,OLS,Adj. R-squared:,0.716
Method:,Least Squares,F-statistic:,4465.0
Date:,"Tue, 30 Nov 2021",Prob (F-statistic):,0.0
Time:,14:11:31,Log-Likelihood:,-43772.0
No. Observations:,15924,AIC:,87560.0
Df Residuals:,15914,BIC:,87640.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-212.5017,19.661,-10.808,0.000,-251.040,-173.964
yearID,0.1056,0.010,10.827,0.000,0.087,0.125
Triple,0.2674,0.031,8.663,0.000,0.207,0.328
SB,-0.0013,0.011,-0.126,0.900,-0.022,0.020
CS,0.0522,0.033,1.586,0.113,-0.012,0.117
IBB,0.8258,0.019,42.987,0.000,0.788,0.863
HBP,0.6739,0.016,41.912,0.000,0.642,0.705
SH,-0.3958,0.017,-22.692,0.000,-0.430,-0.362
SF,0.9254,0.029,32.036,0.000,0.869,0.982

0,1,2,3
Omnibus:,6358.717,Durbin-Watson:,1.948
Prob(Omnibus):,0.0,Jarque-Bera (JB):,101730.987
Skew:,1.492,Prob(JB):,0.0
Kurtosis:,15.018,Cond. No.,1320000.0
