## EECS 731 Project 4
### Adam Podgorny

In [536]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

In [537]:
bb = pd.read_csv("mlb_elo.csv")

In [538]:
bb.columns

Index(['date', 'season', 'neutral', 'playoff', 'team1', 'team2', 'elo1_pre',
       'elo2_pre', 'elo_prob1', 'elo_prob2', 'elo1_post', 'elo2_post',
       'rating1_pre', 'rating2_pre', 'pitcher1', 'pitcher2', 'pitcher1_rgs',
       'pitcher2_rgs', 'pitcher1_adj', 'pitcher2_adj', 'rating_prob1',
       'rating_prob2', 'rating1_post', 'rating2_post', 'score1', 'score2'],
      dtype='object')

In [539]:
bb['team1'].unique()

array(['TBD', 'NYY', 'SDP', 'LAD', 'ATL', 'CHC', 'OAK', 'CLE', 'MIN',
       'STL', 'CHW', 'ARI', 'TOR', 'WSN', 'TEX', 'SFG', 'KCR', 'SEA',
       'MIL', 'BOS', 'PIT', 'NYM', 'CIN', 'ANA', 'FLA', 'COL', 'PHI',
       'HOU', 'DET', 'BAL', 'WS9', 'BL2', 'CL3', 'LS2', 'PHP', 'ML3',
       'BSP', 'CL6', 'CN3', 'PH4', 'SR2', 'RC2', 'TL2', 'BR4', 'IN3',
       'KC2', 'WS8', 'DTN', 'NY4', 'SLU', 'KCN', 'BFN', 'PRO', 'RIC',
       'CL2', 'TL1', 'CL5', 'IN2', 'WS7', 'WOR', 'TRN', 'CN1', 'SR1',
       'ML1', 'IN1', 'SL2', 'LS1', 'HR1', 'NY2', 'PH1', 'NH1', 'PH2',
       'BR2', 'SL1', 'KEO', 'WS6', 'PH3', 'BL1', 'WS5', 'ELI', 'BL4',
       'BR1', 'CL1', 'MID', 'TRO', 'WS4', 'WS3', 'FW1', 'RC1'],
      dtype=object)

In [540]:
bb = bb.drop(['elo1_post', "elo2_post", "rating1_post", "rating2_post", "pitcher1_adj", "pitcher2_adj"], axis=1)
len(bb['team1'].unique())
bb = bb.drop(['date'], axis=1)

89 Teams, That's a lot. And a lot of turn over and changes in teams. Let's make this only the last 20 years, for relevance.

I also would note that since we want to do regression analysis, presumably before a game, having the post game adustments in the feature lists wouldn't be particularly useful, as that would require knowledge of the outcome before the effect. While certainly this is what regression is for, this means having to do multiple sets of regression over mostly similar features. This seems potentially perilous, and requires a circular dependence, since outcome is based on score, the but post-game features also inform the score. It is also just more in the spirit of this analysis to exclude post-game knowledge as features, and as such, these will be excluded.

There are a lot of empty entries in the pitcher_rgs columns. Let's pare down a little and see what happens.

In [541]:
new_bb = bb[bb['season'] > 2000]

In [542]:
new_bb['team1'].unique()

array(['TBD', 'NYY', 'SDP', 'LAD', 'ATL', 'CHC', 'OAK', 'CLE', 'MIN',
       'STL', 'CHW', 'ARI', 'TOR', 'WSN', 'TEX', 'SFG', 'KCR', 'SEA',
       'MIL', 'BOS', 'PIT', 'NYM', 'CIN', 'ANA', 'FLA', 'COL', 'PHI',
       'HOU', 'DET', 'BAL'], dtype=object)

In [543]:
len(new_bb['team1'].unique())

30

Much better. This will also solve a problem of trying to do a regression for teams that could never have matched up, and therefore, for which no such information exists. This means we will have infinitely more useful regressionss, too.

In [544]:
new_bb.corr()

Unnamed: 0,season,neutral,elo1_pre,elo2_pre,elo_prob1,elo_prob2,rating1_pre,rating2_pre,pitcher1_rgs,pitcher2_rgs,rating_prob1,rating_prob2,score1,score2
season,1.0,0.028694,0.008582,0.008821,-0.000485,0.000485,0.008379,0.008931,0.016534,0.021214,-0.012488,0.012488,-0.03296,-0.031084
neutral,0.028694,1.0,-0.002598,0.015899,-0.025794,0.025794,-0.002621,0.016968,-0.007272,0.005295,-0.025144,0.025144,-0.005183,0.010755
elo1_pre,0.008582,-0.002598,1.0,-0.009351,0.714969,-0.714969,0.987906,-0.012772,0.373241,0.003824,0.655151,-0.655151,0.063117,-0.089282
elo2_pre,0.008821,0.015899,-0.009351,1.0,-0.704723,0.704723,-0.014074,0.987883,0.015175,0.372343,-0.644216,0.644216,-0.082932,0.070817
elo_prob1,-0.000485,-0.025794,0.714969,-0.704723,1.0,-1.0,0.709652,-0.698678,0.25509,-0.255977,0.91531,-0.91531,0.102233,-0.113309
elo_prob2,0.000485,0.025794,-0.714969,0.704723,-1.0,1.0,-0.709652,0.698678,-0.25509,0.255977,-0.91531,0.91531,-0.102233,0.113309
rating1_pre,0.008379,-0.002621,0.987906,-0.014074,0.709652,-0.709652,1.0,-0.01792,0.370587,0.002018,0.664216,-0.664216,0.063352,-0.091042
rating2_pre,0.008931,0.016968,-0.012772,0.987883,-0.698678,0.698678,-0.01792,1.0,0.013583,0.368764,-0.652074,0.652074,-0.0845,0.072113
pitcher1_rgs,0.016534,-0.007272,0.373241,0.015175,0.25509,-0.25509,0.370587,0.013583,1.0,0.029458,0.417766,-0.417766,0.01285,-0.115695
pitcher2_rgs,0.021214,0.005295,0.003824,0.372343,-0.255977,0.255977,0.002018,0.368764,0.029458,1.0,-0.417911,0.417911,-0.118875,0.021924


In [545]:
len(new_bb)

47730

Should be enough samples


Let's one hot encode the team as that _may_ be relevant. Same as playoff. I don't quite understand how these things work, but they are probably signifiers of something.

In [546]:
new_bb

Unnamed: 0,season,neutral,playoff,team1,team2,elo1_pre,elo2_pre,elo_prob1,elo_prob2,rating1_pre,rating2_pre,pitcher1,pitcher2,pitcher1_rgs,pitcher2_rgs,rating_prob1,rating_prob2,score1,score2
0,2020,0,d,TBD,NYY,1566.075394,1557.931446,0.561368,0.438632,1562.032965,1562.805994,,,,,0.519347,0.480653,,
1,2020,0,d,NYY,TBD,1557.931446,1566.075394,0.530387,0.469613,1562.805994,1562.032965,,,,,0.517478,0.482522,,
2,2020,0,d,NYY,TBD,1557.931446,1566.075394,0.530387,0.469613,1562.805994,1562.032965,,,,,0.517478,0.482522,,
3,2020,0,d,TBD,NYY,1566.075394,1557.931446,0.561368,0.438632,1562.032965,1562.805994,,,,,0.519347,0.480653,,
4,2020,0,d,TBD,NYY,1566.075394,1557.931446,0.561368,0.438632,1562.032965,1562.805994,,,,,0.525179,0.474821,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47725,2001,0,,CLE,CHW,1534.350000,1517.815000,0.558071,0.441929,1532.804000,1519.894000,Bartolo Colon,welld001,58.750,55.656,0.606725,0.393275,4.0,7.0
47726,2001,0,,CIN,ATL,1527.274000,1523.864000,0.539365,0.460635,1529.321000,1523.448000,harnp001,burkj001,52.378,50.404,0.546062,0.453938,4.0,10.0
47727,2001,0,,CHC,WSN,1462.510000,1461.765000,0.535551,0.464449,1460.738000,1462.150000,liebj001,vazqj001,53.174,52.762,0.536206,0.463794,4.0,5.0
47728,2001,0,,BAL,BOS,1488.060000,1515.815000,0.494596,0.505404,1484.973000,1517.331000,hentp001,martp001,47.558,77.188,0.491141,0.508859,2.0,1.0


In [547]:
new_bb = pd.get_dummies(new_bb, columns=['team1', 'team2', 'playoff'])

Now, let's deal with those pesky rgs values

In [548]:
p1_rgs = new_bb[['pitcher1', 'pitcher1_rgs']]
p2_rgs = new_bb[['pitcher2', 'pitcher2_rgs']]
p1_rgs = p1_rgs.rename(columns={'pitcher1': 'pitcher', 'pitcher1_rgs': 'rgs'})
p2_rgs = p2_rgs.rename(columns={'pitcher2': 'pitcher', 'pitcher2_rgs': 'rgs'})
pitchers = p1_rgs.append(p2_rgs, ignore_index=True)
pitchers = pitchers.groupby("pitcher").mean()
pitchers

Unnamed: 0_level_0,rgs
pitcher,Unnamed: 1_level_1
A.J. Burnett,52.837856
A.J. Cole,47.080876
A.J. Griffin,50.924327
Aaron Blair,41.183305
Aaron Brooks,45.844149
...,...
younj002,46.769200
zambc001,54.084107
zambv001,49.699800
zerbc001,47.574000


That's a lot of pitchers, Possibly too much to 1hot encode, but we can maybe drop the pitcher column _after_ we get the rgs values we need, but...creating a mean value seems to yield a correct mapping. Maybe for pitchers with an NaN, we can apply a function to put their average. my concern, of course, is that the NAN columns may have pulled down some averages. 

In [549]:
pitchers_ = p1_rgs.append(p2_rgs, ignore_index=True)
pitchers_ = pitchers_.dropna()
pitchers_ = pitchers_.groupby('pitcher').mean()
len(pitchers)

1665

Much better, doesn't drop anyone and doesn't try to mean things with zeros, driving the score down. Let's use that, then create a function to map the appropriate pitcher to a NaN value. Though it not strikes me as I do this and look over why the missing fields, that these are because they are in the future. So that isn't very helpeful, anyway. And obviously, we cannot regress for things into the future, so let's drop for the games in October that haven't happened yet. Obviously, thes are ones that have no pitcher selected, so we can drop on that criteria

In [550]:
new_bb = new_bb.dropna()
new_bb = new_bb.drop(['pitcher1', 'pitcher2'], axis=1)
new_bb

Unnamed: 0,season,neutral,elo1_pre,elo2_pre,elo_prob1,elo_prob2,rating1_pre,rating2_pre,pitcher1_rgs,pitcher2_rgs,...,team2_SFG,team2_STL,team2_TBD,team2_TEX,team2_TOR,team2_WSN,playoff_c,playoff_d,playoff_l,playoff_w
14,2020,0,1595.308574,1504.062746,0.707763,0.292237,1603.668369,1509.101046,56.107674,51.086009,...,0,0,0,0,0,0,1,0,0,0
15,2020,0,1519.167846,1556.011017,0.475376,0.524624,1529.732319,1561.379468,56.485235,53.414925,...,0,0,0,0,0,0,1,0,0,0
16,2020,0,1523.681570,1518.731209,0.555323,0.444677,1531.647539,1512.135815,49.911607,51.893876,...,0,1,0,0,0,0,1,0,0,0
17,2020,0,1563.142855,1499.523060,0.662064,0.337936,1559.189080,1489.605354,54.763433,57.611584,...,0,0,0,0,1,0,1,0,0,0
18,2020,0,1543.679344,1509.780682,0.609304,0.390696,1529.402486,1517.203515,54.223156,54.018904,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47725,2001,0,1534.350000,1517.815000,0.558071,0.441929,1532.804000,1519.894000,58.750000,55.656000,...,0,0,0,0,0,0,0,0,0,0
47726,2001,0,1527.274000,1523.864000,0.539365,0.460635,1529.321000,1523.448000,52.378000,50.404000,...,0,0,0,0,0,0,0,0,0,0
47727,2001,0,1462.510000,1461.765000,0.535551,0.464449,1460.738000,1462.150000,53.174000,52.762000,...,0,0,0,0,0,1,0,0,0,0
47728,2001,0,1488.060000,1515.815000,0.494596,0.505404,1484.973000,1517.331000,47.558000,77.188000,...,0,0,0,0,0,0,0,0,0,0


In [551]:
new_bb.corr()

Unnamed: 0,season,neutral,elo1_pre,elo2_pre,elo_prob1,elo_prob2,rating1_pre,rating2_pre,pitcher1_rgs,pitcher2_rgs,...,team2_SFG,team2_STL,team2_TBD,team2_TEX,team2_TOR,team2_WSN,playoff_c,playoff_d,playoff_l,playoff_w
season,1.000000,0.028719,0.007900,0.008560,-0.000961,0.000961,0.007691,0.008640,0.016435,0.020976,...,-0.000203,-0.001158,0.000721,0.000654,0.001643,0.002208,0.031333,-0.002135,-0.003838,0.000480
neutral,0.028719,1.000000,-0.002589,0.015905,-0.025793,0.025793,-0.002612,0.016975,-0.007271,0.005299,...,-0.004680,0.004476,-0.004654,0.009324,-0.004639,-0.004652,-0.000608,-0.002022,-0.001687,-0.001195
elo1_pre,0.007900,-0.002589,1.000000,-0.009633,0.714938,-0.714938,0.987903,-0.013099,0.373242,0.003650,...,-0.016020,-0.037037,0.038682,0.027412,0.034614,-0.015468,0.027098,0.107766,0.104198,0.083738
elo2_pre,0.008560,0.015905,-0.009633,1.000000,-0.704984,0.704984,-0.014352,0.987889,0.015195,0.372385,...,0.027399,0.141689,-0.027627,0.021162,0.028146,-0.039376,0.014251,0.102601,0.105055,0.087277
elo_prob1,-0.000961,-0.025793,0.714938,-0.704984,1.000000,-1.000000,0.709616,-0.698987,0.255056,-0.256224,...,-0.029734,-0.125025,0.045812,0.004525,0.004571,0.016596,0.016110,0.018715,0.011062,0.005153
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
team2_WSN,0.002208,-0.004652,-0.015468,-0.039376,0.016596,-0.016596,-0.014881,-0.027618,-0.003185,0.011466,...,-0.034605,-0.034916,-0.034415,-0.034426,-0.034302,1.000000,-0.004494,0.001104,-0.008989,0.001003
playoff_c,0.031333,-0.000608,0.027098,0.014251,0.016110,-0.016110,0.027437,0.014784,0.027068,0.028702,...,0.005079,0.004959,0.005154,-0.004497,0.005198,-0.004494,1.000000,-0.001953,-0.001630,-0.001154
playoff_d,-0.002135,-0.002022,0.107766,0.102601,0.018715,-0.018715,0.104646,0.099057,0.083532,0.071579,...,0.005283,0.023692,0.001094,-0.001833,-0.009053,0.001104,-0.001953,1.000000,-0.005423,-0.003839
playoff_l,-0.003838,-0.001687,0.104198,0.105055,0.011062,-0.011062,0.101058,0.101851,0.061525,0.064940,...,0.004804,0.035529,-0.007249,-0.002023,-0.003693,-0.008989,-0.001630,-0.005423,1.000000,-0.003204


In [552]:
scores = new_bb[['score1', "score2"]]
features = new_bb.drop(['score1', 'score2'], axis=1)

train_x, test_x, train_y, test_y = train_test_split(features, scores, test_size=0.15, random_state=1)

In [553]:
features

Unnamed: 0,season,neutral,elo1_pre,elo2_pre,elo_prob1,elo_prob2,rating1_pre,rating2_pre,pitcher1_rgs,pitcher2_rgs,...,team2_SFG,team2_STL,team2_TBD,team2_TEX,team2_TOR,team2_WSN,playoff_c,playoff_d,playoff_l,playoff_w
14,2020,0,1595.308574,1504.062746,0.707763,0.292237,1603.668369,1509.101046,56.107674,51.086009,...,0,0,0,0,0,0,1,0,0,0
15,2020,0,1519.167846,1556.011017,0.475376,0.524624,1529.732319,1561.379468,56.485235,53.414925,...,0,0,0,0,0,0,1,0,0,0
16,2020,0,1523.681570,1518.731209,0.555323,0.444677,1531.647539,1512.135815,49.911607,51.893876,...,0,1,0,0,0,0,1,0,0,0
17,2020,0,1563.142855,1499.523060,0.662064,0.337936,1559.189080,1489.605354,54.763433,57.611584,...,0,0,0,0,1,0,1,0,0,0
18,2020,0,1543.679344,1509.780682,0.609304,0.390696,1529.402486,1517.203515,54.223156,54.018904,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47725,2001,0,1534.350000,1517.815000,0.558071,0.441929,1532.804000,1519.894000,58.750000,55.656000,...,0,0,0,0,0,0,0,0,0,0
47726,2001,0,1527.274000,1523.864000,0.539365,0.460635,1529.321000,1523.448000,52.378000,50.404000,...,0,0,0,0,0,0,0,0,0,0
47727,2001,0,1462.510000,1461.765000,0.535551,0.464449,1460.738000,1462.150000,53.174000,52.762000,...,0,0,0,0,0,1,0,0,0,0
47728,2001,0,1488.060000,1515.815000,0.494596,0.505404,1484.973000,1517.331000,47.558000,77.188000,...,0,0,0,0,0,0,0,0,0,0


In [554]:
lr = LinearRegression().fit(train_x, train_y)

In [555]:
lr.score(test_x, test_y) ##Well that is awful



0.0346659701120243

In [556]:
lr_test = lr.predict(test_x)
mean_squared_error(test_y, lr_test) ##I like this better than the R2

9.756940550308968

In [557]:
#Let's try a tree
dt = DecisionTreeRegressor(random_state=0)
dt.fit(train_x, train_y)
dt.score(test_x, test_y)



-0.9109666588032151

In [558]:
dt_test = dt.predict(test_x)
mean_squared_error(test_y, dt_test) 

19.314752724224643

In [559]:
new_bb2020 = new_bb[new_bb['season']==2020]

Let's contextualize this to just one year to account for things like hirings and firings that will absolutely affect things.

In [560]:
scores = new_bb2020[['score1', "score2"]]
features = new_bb2020.drop(['score1', 'score2'], axis=1)

In [561]:
train_x, test_x, train_y, test_y = train_test_split(features, scores, test_size=0.15, random_state=1)
lr = LinearRegression().fit(train_x, train_y)
lr.score(train_x, train_y) ##Well that is awful too. Too many features, I think. Let's try this with a tree



0.13550195713770002

In [562]:
lr_test = lr.predict(test_x)
mean_squared_error(test_y, lr_test) ##MARGINALLY BETTER! But...only having the year didn't mean much, interesting

9.42574906769783

Let's try a decision tree regressor again

In [563]:
dt = DecisionTreeRegressor(random_state=0)
dt.fit(train_x, train_y)
dt.score(test_x, test_y)



-0.8459461151455824

In [564]:
dt_test = dt.predict(test_x)
mean_squared_error(test_y, dt_test) 

17.28102189781022

Okay, clearly, something isn't working here. I think the best thing to do here would be to pare down the low correlation features with the trees. Let's recheck that corr matrix just for 2020

In [565]:
new_bb2020.corr()

Unnamed: 0,season,neutral,elo1_pre,elo2_pre,elo_prob1,elo_prob2,rating1_pre,rating2_pre,pitcher1_rgs,pitcher2_rgs,...,team2_SFG,team2_STL,team2_TBD,team2_TEX,team2_TOR,team2_WSN,playoff_c,playoff_d,playoff_l,playoff_w
season,,,,,,,,,,,...,,,,,,,,,,
neutral,,,,,,,,,,,...,,,,,,,,,,
elo1_pre,,,1.000000,-0.133949,0.755515,-0.755515,0.936667,-0.159700,0.296117,-0.048795,...,0.093816,-0.067186,0.000340,-0.018751,0.077428,-0.009905,0.115075,,,
elo2_pre,,,-0.133949,1.000000,-0.747795,0.747795,-0.147034,0.935846,-0.051594,0.295095,...,-0.075136,0.100127,0.221291,-0.169786,-0.053032,0.155734,0.036944,,,
elo_prob1,,,0.755515,-0.747795,1.000000,-1.000000,0.721694,-0.721823,0.233559,-0.222364,...,0.111486,-0.109591,-0.145743,0.099424,0.092909,-0.119769,0.085715,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
team2_WSN,,,-0.009905,0.155734,-0.119769,0.119769,-0.030757,0.113654,-0.085580,0.042589,...,-0.034091,-0.033499,-0.034091,-0.034091,-0.035249,1.000000,-0.021344,,,
playoff_c,,,0.115075,0.036944,0.085715,-0.085715,0.109786,0.043106,0.113204,0.131300,...,-0.021344,0.033870,-0.021344,-0.021344,0.082528,-0.021344,1.000000,,,
playoff_d,,,,,,,,,,,...,,,,,,,,,,
playoff_l,,,,,,,,,,,...,,,,,,,,,,


Okay, let's try again

In [566]:
new_bb = bb[bb['season'] == 2020]
new_bb = new_bb.drop(['playoff', 'team1', 'team2', 'season'], axis=1)
new_bb = new_bb.dropna()

In [567]:
new_bb.corr()

Unnamed: 0,neutral,elo1_pre,elo2_pre,elo_prob1,elo_prob2,rating1_pre,rating2_pre,pitcher1_rgs,pitcher2_rgs,rating_prob1,rating_prob2,score1,score2
neutral,,,,,,,,,,,,,
elo1_pre,,1.0,-0.133949,0.755515,-0.755515,0.936667,-0.1597,0.296117,-0.048795,0.66164,-0.66164,0.042088,-0.122971
elo2_pre,,-0.133949,1.0,-0.747795,0.747795,-0.147034,0.935846,-0.051594,0.295095,-0.65328,0.65328,-0.067058,0.066604
elo_prob1,,0.755515,-0.747795,1.0,-1.0,0.721694,-0.721823,0.233559,-0.222364,0.872321,-0.872321,0.071452,-0.128191
elo_prob2,,-0.755515,0.747795,-1.0,1.0,-0.721694,0.721823,-0.233559,0.222364,-0.872321,0.872321,-0.071452,0.128191
rating1_pre,,0.936667,-0.147034,0.721694,-0.721694,1.0,-0.171965,0.348018,-0.045383,0.710042,-0.710042,0.060292,-0.120573
rating2_pre,,-0.1597,0.935846,-0.721823,0.721823,-0.171965,1.0,-0.053327,0.337788,-0.708201,0.708201,-0.091325,0.062982
pitcher1_rgs,,0.296117,-0.051594,0.233559,-0.233559,0.348018,-0.053327,1.0,0.083349,0.458152,-0.458152,-0.015409,-0.145689
pitcher2_rgs,,-0.048795,0.295095,-0.222364,0.222364,-0.045383,0.337788,0.083349,1.0,-0.435673,0.435673,-0.139865,-0.030536
rating_prob1,,0.66164,-0.65328,0.872321,-0.872321,0.710042,-0.708201,0.458152,-0.435673,1.0,-1.0,0.112736,-0.137639


I wonder if I screwed something up here, as neutral isn't correlated to anything

In [568]:
new_bb = new_bb.drop(['neutral', 'pitcher1', 'pitcher2'], axis=1)

In [569]:
new_bb.corr()

Unnamed: 0,elo1_pre,elo2_pre,elo_prob1,elo_prob2,rating1_pre,rating2_pre,pitcher1_rgs,pitcher2_rgs,rating_prob1,rating_prob2,score1,score2
elo1_pre,1.0,-0.133949,0.755515,-0.755515,0.936667,-0.1597,0.296117,-0.048795,0.66164,-0.66164,0.042088,-0.122971
elo2_pre,-0.133949,1.0,-0.747795,0.747795,-0.147034,0.935846,-0.051594,0.295095,-0.65328,0.65328,-0.067058,0.066604
elo_prob1,0.755515,-0.747795,1.0,-1.0,0.721694,-0.721823,0.233559,-0.222364,0.872321,-0.872321,0.071452,-0.128191
elo_prob2,-0.755515,0.747795,-1.0,1.0,-0.721694,0.721823,-0.233559,0.222364,-0.872321,0.872321,-0.071452,0.128191
rating1_pre,0.936667,-0.147034,0.721694,-0.721694,1.0,-0.171965,0.348018,-0.045383,0.710042,-0.710042,0.060292,-0.120573
rating2_pre,-0.1597,0.935846,-0.721823,0.721823,-0.171965,1.0,-0.053327,0.337788,-0.708201,0.708201,-0.091325,0.062982
pitcher1_rgs,0.296117,-0.051594,0.233559,-0.233559,0.348018,-0.053327,1.0,0.083349,0.458152,-0.458152,-0.015409,-0.145689
pitcher2_rgs,-0.048795,0.295095,-0.222364,0.222364,-0.045383,0.337788,0.083349,1.0,-0.435673,0.435673,-0.139865,-0.030536
rating_prob1,0.66164,-0.65328,0.872321,-0.872321,0.710042,-0.708201,0.458152,-0.435673,1.0,-1.0,0.112736,-0.137639
rating_prob2,-0.66164,0.65328,-0.872321,0.872321,-0.710042,0.708201,-0.458152,0.435673,-1.0,1.0,-0.112736,0.137639


Let's look at these correlations more closely, the elo_probs are intermeshed, so we can probably nix one of those columns. Ditto the rating_probs, so let's exclude the secondary

Okay, let's facet this

In [570]:
new_bb = new_bb.drop(['elo_prob2', 'rating_prob2'], axis=1)

In [571]:
new_bb

Unnamed: 0,elo1_pre,elo2_pre,elo_prob1,rating1_pre,rating2_pre,pitcher1_rgs,pitcher2_rgs,rating_prob1,score1,score2
14,1595.308574,1504.062746,0.707763,1603.668369,1509.101046,56.107674,51.086009,0.724542,4.0,2.0
15,1519.167846,1556.011017,0.475376,1529.732319,1561.379468,56.485235,53.414925,0.471800,9.0,10.0
16,1523.681570,1518.731209,0.555323,1531.647539,1512.135815,49.911607,51.893876,0.535651,4.0,7.0
17,1563.142855,1499.523060,0.662064,1559.189080,1489.605354,54.763433,57.611584,0.594481,8.0,2.0
18,1543.679344,1509.780682,0.609304,1529.402486,1517.203515,54.223156,54.018904,0.548883,5.0,3.0
...,...,...,...,...,...,...,...,...,...,...
919,1537.750707,1487.408999,0.605383,1539.417329,1478.540428,59.143800,56.199388,0.602440,4.0,6.0
920,1490.427814,1420.458239,0.632029,1521.707154,1436.198577,58.879686,49.587500,0.665662,7.0,1.0
921,1521.484830,1528.909097,0.523836,1529.122226,1528.008772,64.418447,55.793568,0.563735,1.0,0.0
922,1561.949414,1487.819917,0.637581,1584.386566,1453.743200,48.700313,49.333010,0.670766,8.0,1.0


In [572]:
new_bb.to_csv("baseball.csv")

So there is so bimodality int the ratings/elos. That is good to know. The ELO also seems to skew things, but it's difficult to just look at and see. I wonder if writing in a feature column for 'win' is useful, and have it so that team1 is zero and team2 is 1. But, that is also a proxy for another value, and not something we'd have at time of regression for future data, so I may exclude it. This produces a bit of a problem for feature engineering. The other issue is the probability. Spending a lot of time on chess websites, the win probability is based on ELO, specifically so that would have a ton of bleedthrough in terms of information.

In [573]:
scores = new_bb[['score1', "score2"]]
features = new_bb.drop(['score1', 'score2'], axis=1)

train_x, test_x, train_y, test_y = train_test_split(features, scores, test_size=0.15, random_state=1)

In [574]:
lr = LinearRegression().fit(train_x, train_y)
lr.score(test_x, test_y)



0.01111096031820791

In [575]:
lr_test = lr.predict(test_x)
mean_squared_error(test_y, lr_test) ##EVEN BETTER

9.257590462166782

In [576]:
dt = DecisionTreeRegressor(random_state=0)
dt.fit(train_x, train_y)
dt.score(test_x, test_y)



-0.9894114098390511

In [577]:
dt_test = dt.predict(test_x)
mean_squared_error(test_y, dt_test) 

18.624087591240876

I must confess my suprise that the Decision Tree is performing _worse_ than the simple linear regression. My guess is that the random forest is having a hard time considering how much mutual information there is between each feature, and not only that, the tree has to try to do two regressions, and they may not play nicely.

So from here on out, I will spend my efforts trying to use simple linear regressions.


Let's see if this works over the last 20 years, now that we know dropping certain columns really, really helps. I should explain my reticence to use rederived features. I haven't seen a good mutual information algorithm for continuous features in SKlearn, and correlation is sometimes not always the best measure, as we see here. In order to provide the correct scaling, I didn't want to flatten or normalize too much. I thought about making the ELO a score derived by substracting 1500, as that is what ELO scores are generally standarized too. But a regression line should be able to account for that and weight accordingly. As such, it is already normalized in some sense.

In [578]:
final_bb = bb[bb['season'] > 2000]
final_bb = final_bb.dropna()
final_bb = final_bb.drop(['pitcher1', 'pitcher2'], axis=1)
final_bb = final_bb.drop(['playoff', 'team1', 'team2', 'season'], axis=1)


In [579]:
scores = final_bb[['score1', "score2"]]
features = final_bb.drop(['score1', 'score2'], axis=1)

train_x, test_x, train_y, test_y = train_test_split(features, scores, test_size=0.15, random_state=1)

In [580]:
lr = LinearRegression().fit(train_x, train_y)
lr.score(test_x, test_y)



-0.01485936075841005

In [581]:
lr_test = lr.predict(test_x)
mean_squared_error(test_y, lr_test)

8.945383142120093

Interestingly, adding in 20 years only adds up to a total of ~8.9 mean squared error. Now, this may be because many games have zero scores, and that assessment is reflected, but the faceting suggests that is a small fraction of the overall games, about 5.5 percent for the score columns.

I know I said this was final, but I am really curious now.

In [582]:
final_bb = bb[bb['season'] > 1915] ##Fields look more standardized then
final_bb = final_bb.dropna()
final_bb = final_bb.drop(['pitcher1', 'pitcher2'], axis=1)
final_bb = final_bb.drop(['playoff', 'team1', 'team2', 'season'], axis=1)
scores = final_bb[['score1', "score2"]]
features = final_bb.drop(['score1', 'score2'], axis=1)

train_x, test_x, train_y, test_y = train_test_split(features, scores, test_size=0.15, random_state=1)

In [583]:
lr = LinearRegression().fit(train_x, train_y)
lr.score(test_x, test_y)



-0.013709814115447957

In [584]:
lr_test = lr.predict(test_x)
mean_squared_error(test_y, lr_test)

9.409769310210773

So including all this extra data doesn't throw off the regression very much. Interesting! This means the relationship must be fairly stable then, with regard to time. 