# MLB Hall of Fame

Usng Lahman's Baseball Database to predict which players will be inducted into the MLB Hall of Fame. Data visualization and analysis of current hall of famers and future prospects.

#### Dataset sources:   
Sean Lahman, Lahman's Baseball Database: [MLB_Kaggle](https://www.kaggle.com/seanlahman/the-history-of-baseball/data)

We limit ourselves to the following .csv files in this analysis:
- `all_star.csv`: All-Star appearances
- `batting.csv` and `pitching.csv`: Batting data and pitching data, used to evaluate player value for HOF candidacy
- `player_award_vote.csv`: MVP, Cy Young, and other votes
- `hall_of_fame.csv`: Response variable, induction into HOF
- `player.csv`: Dataset of player information to match up names, etc.

In [1]:
# import packages and data
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline

# block warnings
import warnings
warnings.filterwarnings("ignore")

# players
players = pd.read_csv('model_data/players_df.csv')
pitchers = pd.read_csv('model_data/pitchers.csv')
pitching_metrics = pd.read_csv('model_data/pitching_hof.csv')
hof_vote = pd.read_csv('data/hall_of_fame.csv')

# pitching predictions
pitch_active = pd.read_csv('predict/pitchers-active.csv')
pitch_historical = pd.read_csv('predict/pitchers-historical.csv')

for i in [players, pitch_historical, pitch_active, pitchers, pitching_metrics, hof_vote]:
    i.set_index('player_id', inplace = True)

## Summary of methodology
We looked at batting and pitching data separately for players. We only considered pitching data for pitchers; batting data for hitters - and did not include fielding data on the heuristic that a player is rarely selected for the Hall of Fame purely on fielding play. Postseason data was also not included, but would have been beneficial to look at. Only BBWAA voting for the Hall of Fame was considered, meaning special elections or veterans elections would be marked as "0" in the inducted phase. 

We added aggregate metrics to each of these datasets: percentage of All-Star seasons, MVP awards, Cy Young awards, and certain milestones such as 3000 strikeouts/hits, 500 HR, 300 W/SV, respectively - this is in addition to batting average, ERA, etc. that are the more traditional season-to-season data points.

We reduced the datasets by several components to the statistics that would influence that type of player (ex: not looking at earned runs allowed by a pitcher, but rather ERA), then standard-scaled and ran PCA on each dataset. We set cutoffs at 95% of explained variance. Afterwards, we ran a series of regression and decision models - relatively simple ones - to identify the ideal model performance. This was done with a `random_state = 19` for consistency.

With pitchers, we came to a single-bagged depth-4 Decision Tree; for hitters, the model is currently incomplete. 

Overall, we received strong accuracy on the pitching dataset, but some compromises in predictive ability. However, there is less of an ability to forecast this, given how Trevor Hoffman is the only pitcher in our dataset who has been inducted since the end of the dataset. Others are likely to come in future seasons, however.

Here, we analyze the errors and visualize some of the differences between players in our datasets.

### Pitchers: preparation
Historical data mispredictions in both training and test sets.

In [2]:
wrong_pitchers = pitch_historical[pitch_historical.inducted != pitch_historical.predicted]
wrong_pitchers = wrong_pitchers.merge(players[['name']], left_index = True, right_index = True)
wrong_pitchers

Unnamed: 0_level_0,inducted,predicted,name
player_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
pennohe01,1.0,0.0,Herb Pennock
bunniji01,0.0,1.0,Jim Bunning
passecl01,0.0,1.0,Claude Passeau
stottme01,0.0,1.0,Mel Stottlemyre
lyonste01,1.0,0.0,Ted Lyons
stiebda01,0.0,1.0,Dave Stieb
wadderu01,0.0,1.0,Rube Waddell
tiantlu01,0.0,1.0,Luis Tiant
vanceda01,1.0,0.0,Dazzy Vance
lolicmi01,0.0,1.0,Mickey Lolich


In [3]:
wrong_pitchers = players[['name']].merge(pitchers.merge(wrong_players[['predicted']], \
                        left_index = True, right_index = True), \
                        left_index = True, right_index = True)

NameError: name 'wrong_players' is not defined

In [None]:
pitcher_hof = pitch_historical[pitch_historical.inducted == 1]
pitcher_hof = pitchers.merge(pitcher_hof[['predicted']], left_index = True, right_index = True)
hof_pitchers = players[['name']].merge(pitcher_hof, left_index = True, right_index = True)
hof_pitchers

### Pitchers: false negatives

As we can see from this, out of the Hall of Fame pitchers the mispredicted ones typically had higher ERAs/WHIPs, and perhaps the one questionable exception is Hoyt Wilhelm.

In [None]:
sns.set()
sns.lmplot(data = hof_pitchers, x = 'ERA', y = 'WHIP', hue = 'predicted', truncate = True)

In [None]:
fn_pitchers = pd.DataFrame(hof_pitchers[hof_pitchers.inducted != hof_pitchers.predicted])
fn_pitchers.drop(labels = ['ipouts', 'h', 'er', 'hr', 'inducted'], axis = 1, inplace = True)
fn_pitchers

In [None]:
hof_pitchers.groupby('predicted').mean()[['w', 'l', 'so', 'BB', 'ERA', 'WHIP', 'BAA', 'K/9', 'BB/9', 'K/BB', 'VOTE%']]

If we look at these players, we can see that the majority of them had higher ERAs, BAAs than the average inductee, stats that may suggest less dominance. Typically, they also had fewer wins, even though number of losses was approximately similar. Moreover, they tended to have been less sure-fire votes, outside of Ted Lyons, who finished his career back in 1945.

Poor Hoyt Wilhelm had a powerful pedigree as a pitcher, but interestingly in terms of the cumulative metrics often used to evaluate pitchers, performs only decently in each. He lacks the wins to be a "stellar" starting pitcher, or the saves to be as so for a reliever. The powerful versatility he would have brought to a team could justify his inclusion, but for the sake of classification here, he is incorrectly omitted.

Early Wynn, the other interesting example, had 300 wins but just barely was voted in - and was mispredicted here. The high number of losses and high walk count likely hurt him, despite his Cy Young Award and 9x All-Star selection. An alternative difference could be his "late-bloomer status" - Wynn didn't get his second midseason recongition until on the worse side of 30. [Baseball-Reference: Early Wynn](https://www.baseball-reference.com/players/w/wynnea01.shtml)

### Pitchers: false positives

In [None]:
pitching_metrics.head()

In [None]:
fp_pitchers = wrong_pitchers[wrong_pitchers.inducted == 0]
fp_pitchers = fp_pitchers[['name', 'w', 'l', 'so', 'BB', 'ERA', 'WHIP', 'BAA', 'K/9', 'BB/9', 'K/BB', 'VOTE%']]
fp_pitchers = pitching_metrics[['seasons_played', 'All-Star', 'TripleCrown', 'CYA', '3000K', '300W', '300SV']].merge(fp_pitchers, left_index = True, right_index = True)
fp_pitchers

In order to determine the number of mispredictions, we first have to look at the other dataset: non-BBWAA votes.

In [None]:
hof_noBBWAA = hof_vote[['votedby', 'inducted']]
hof_noBBWAA = hof_noBBWAA[hof_noBBWAA.votedby != 'BBWAA']
fp_pitchers.merge(hof_noBBWAA, left_index = True, right_index = True)

From this, we can see that Rube Waddell and Jim Bunning were actually inducted into the Hall of Fame, except not through the BBWAA vote. The others' relatively good cumulative metrics may have resulted in their false positive classification, although this indicates the model was undervaluing the All-Star and Cy Young selections in voting. In the actual votes, often this is a question that is brought up (it has been used to question Andy Pettitte's candidacy, in fact).

Roger Clemens by statistics should be a surefire Hall-of-Famer, but he is a relatively unsurprising omission, due to the association with PEDs.

### Pitchers: visualizations
More data visualizations to compare Hall of Fame and non Hall of Fame groups

In [None]:
pitchers_complete = pitchers.merge(pitch_historical[['predicted']], left_index = True, right_index = True)
pitchers_complete = players[['name']].merge(pitchers_complete, left_index = True, right_index = True)
pitchers_complete.head()

As we can see, Hall of Fame Pitchers tend to have lower ERAs, higher strikeouts, and lower WHIPs than your typical players. That's not particularly surprising

In [None]:
sns.set()
sns.lmplot(data = pitchers_complete, x = 'ERA', y = 'so', hue = 'inducted', \
           truncate = True, fit_reg=False, size = 5, \
           palette={1: "#000080", 0:"#ff6961"})
sns.lmplot(data = pitchers_complete, x = 'ERA', y = 'WHIP', hue = 'inducted', \
           truncate = True, fit_reg=False, size = 5, \
           palette={1: "#000080", 0:"#ff6961"})

Unsurprisingly, higher wins and shutouts correlated to higher induction, while it is inverse for ERA/WHIP/BAA - to be expected given that lower values in the latter categories signify dominance. What is also interesting to note is how saves appear "inversely correlated"; however, this is likely due to the dominance of starting pitchers in this group. Separating to relief pitchers alone would likely reverse this potential trend.

Having higher hits, strikeouts, and walks is also correlated - the first and last of these is probably explained by how successful pitchers get more opportunities over time, leading to more innings and this natural positive correlation.

In [None]:
# correlations between certain statistics
pitchers_complete_corr = pitchers_complete.copy()
pitchers_complete_corr.drop(labels = ['name', 'g', 'l', 'ipouts', 'er', 'hr', 'K/BB', 'BB/9', 'VOTE%', 'predicted'], inplace = True, axis = 1)
f, ax = plt.subplots()
f.set_size_inches(10, 7)
sns.heatmap(pitchers_complete_corr.corr())

When we compare the distributions of some of our key statistics for pitchers, it becomes very clear that Hall of Fame pitchers have significant dominance over the rest of the field. Lower ERAs, higher strikeouts, lower WHIPs, and higher win counts are consistent - and often these are significantly better than the typical performance of a pitcher. Although there is overlap between a Hall of Fame category and a really good general pitcher - which makes our classification job more difficult - it is pretty easy to see that there is a clear performance/ability difference.

Similarly here, separating by relievers and starters would likely show us even greater disparity between what is a typical pitcher and the best of the best.

In [None]:
# distribution map
f.set_size_inches(10, 7)
pairplot_pitchers = pitchers_complete_corr[['w', 'so', 'ERA', 'WHIP', 'inducted']]
sns.pairplot(pairplot_pitchers, kind="scatter", hue = 'inducted', palette="Set2")

### Pitchers: prediction
Here, let's remove Rube Waddell first, as we have seen he is eventually inducted into the Hall of Fame through the Veterans vote. We also remove Roger Clemens as we discussed above.

Still, we have several omissions that are unexpected: Justin Verlander, K-Rod, Max Scherzer, Clayton Kershaw. This can be seen through examining the pitch_active dataset.

In [None]:
pitch_active.sort_values(by = ['All-Star'], inplace = True, ascending = False)
pitcher_future_hof = pitch_active[pitch_active.prediction == 1]
pitcher_future_hof.drop(labels = ['clemero02', 'wadderu01'], axis = 0, inplace = True)

In [None]:
# statistics
hof_pitchers_future

In [None]:
# concatenating to statistics
pitcher_future_hof = pitchers.merge(pitcher_future_hof[['prediction']], left_index = True, right_index = True)
hof_pitchers_future = players[['name']].merge(pitcher_future_hof, left_index = True, right_index = True)
hof_pitchers_future.rename({'prediciton': 'predicted'}, axis = 1, inplace = True)

# present and future
pitch_predict = pd.concat([hof_pitchers, hof_pitchers_future], axis = 0, sort = True)

sns.set()
f, (ax1, ax2) = plt.subplots(1, 2)
f.set_size_inches(10, 5)
sns.stripplot(data = pitch_predict, x = 'ERA', y = 'WHIP', hue = 'inducted', size = 5, \
           palette={1: "#000080", 0:"#ff6961"}, ax = ax1)
sns.stripplot(data = pitch_predict, x = 'ERA', y = 'so', hue = 'inducted', size = 5, \
           palette={1: "#000080", 0:"#ff6961"}, ax = ax2)
ax1.set_xticks(np.arange(1,16,4))
for i in [ax1, ax2]:
    i.set_xticks(np.arange(0,50,5))
    ()


In terms of statistical performance, C.C. Sabathia, Johan Santana, and Mark Buerhle are in the most danger. Buerhle was never really regarded as a standout pitcher, and failed to have the stretch of dominance that other Hall of Famers enjoyed despite being consistently solid. C.C. may be hurt my some older-age regression (although that isn't included fully in this dataset) as well, but his career WAR numbers and potentially future World Series titles could boost his case.

Johan Santana is similarly unlikely to qualify in the end, due to his relatively short-lived career. Billy Wagner has failed to reach the Hall of Fame vote threshold for several years, actually, despite his statistical excellence - a lack of innings pitched, for example, may have been a limiting factor despite over 400 saves and a nearly 12 K/9 rate. [SI: Billy Wagner](https://www.si.com/mlb/2016/12/19/jaws-2017-hall-of-fame-ballot-billy-wagner), [MLB: Billy Wagner](https://www.mlb.com/news/billy-wagner-on-hall-of-fame-ballot/c-156876028).

The others in this group, however, seem to be strong candidates. Hoffman has already been elected; Halladay is widely expected to qualify for the Hall of Fame, and has been regarded as the best of his generation [538: Roy Halladay](https://fivethirtyeight.com/features/roy-halladay-was-the-greatest-pitcher-of-his-generation/). 

Price and Hernandez had dominant careers and in their 30+ years can make a stronger case by avoiding the same regression. Mariano Rivera, by comparison, is widely regarded as the most dominant closer of all time - and it wouldn't be surprising that the discussion should be more about whether he becomes the first unanimous vote, rather than if he is voted in.

### Pitchers: missing predictions
Let's look at some of the more dominant pitchers as of late who were not predicted to make the Hall of Fame here:  Justin Verlander, Francisco Rodriguez, Max Scherzer, Clayton Kershaw, Adam Wainwright, Zack Greinke, Chris Sale, Corey Kluber.

Of these players: Kluber, Sale, Scherzer, and Kershaw aren't actually in the dataset. Let's look at the stats up until 2015 for these other players.

In [None]:
missing = ['verlaju01', 'rodrifr03', 'wainwad01', 'greinza01']
four_missing_pitchers = pitchers.loc[missing]
four_missing_pitchers = four_missing_pitchers[['w', 'l', 'so', 'BB', 'ERA', 'WHIP', 'BAA', 'K/9', 'BB/9', 'K/BB']]
hof_averages = pd.DataFrame({'HOF_mean': hof_pitchers.mean()[['w', 'l', 'so', 'BB', 'ERA', 'WHIP', \
                                                              'BAA', 'K/9', 'BB/9', 'K/BB']], \
                            'HOF_std': hof_pitchers.std()[['w', 'l', 'so', 'BB', 'ERA', 'WHIP', \
                                                           'BAA', 'K/9', 'BB/9', 'K/BB']]}).T
pitching_comparison = pd.concat([four_missing_pitchers, hof_averages], axis = 0).T
pitching_comparison['plus'] = pitching_comparison['HOF_mean'] + pitching_comparison['HOF_std']
pitching_comparison['minus'] = pitching_comparison['HOF_mean'] - pitching_comparison['HOF_std']
pitching_comparison.T

Perhaps it's simply that these players haven't finished full careers, so they are undervalued because they haven't had the same number of saves, wins, strikeouts, etc. as the retirees may have. From a summary-stat perspective such as ERA or WHIP, they generally seem fine, with the possible exception of Verlander and Greinke who both had several "down-seasons" with higher run-scoring. Let's look at a histogram of wins and ERAs.

In [None]:
fig, ax = plt.subplots()
sns.distplot(hof_pitchers['w'], bins = 20, kde=True, rug=False, ax = ax, label = 'HOF')
sns.distplot(four_missing_pitchers['w'], bins = 4, kde=True, rug=False, ax = ax, label = 'Current Stars')
sns.distplot(fp_pitchers['w'], bins = 8, kde=True, rug=False, ax = ax, label = 'False Positives')
ax.legend()

In [None]:
fig, ax = plt.subplots()
sns.distplot(hof_pitchers['so'], bins = 20, kde=True, rug=False, ax = ax, label = 'HOF')
sns.distplot(four_missing_pitchers['so'], bins = 4, kde=True, rug=False, ax = ax, label = 'Current Stars')
sns.distplot(fp_pitchers['so'], bins = 8, kde=True, rug=False, ax = ax, label = 'False Positives')
ax.legend()

As we can see, our distributions of wins and strikeouts have lower means than even the false-predicted Hall of Famers, who in turn don't match up on par with the typical HOF performance. Although all sets have closers/relievers in them, they are dominated by starters, so we have not directly removed the relivers (but maybe should have, to avoid being so lazy). This seems to suggest that there isn't anything wrong with these players' performances, just that they have to keep it up and rack up the numbers.

We can look at ERA/WHIP/K-per-9 to further validate this belief:

In [None]:
f, (ax1, ax2, ax3) = plt.subplots(1, 3)
f.set_size_inches(20, 5)
sns.distplot(hof_pitchers['ERA'], bins = 20, kde=True, rug=False, ax = ax1, label = 'HOF')
sns.distplot(four_missing_pitchers['ERA'], bins = 4, kde=True, rug=False, ax = ax1, label = 'Current Stars')
sns.distplot(hof_pitchers_future['ERA'], bins = 8, kde=True, rug=False, ax = ax1, label = 'Future Inductees')
ax1.legend()
ax1.set_title('ERA')

sns.distplot(hof_pitchers['WHIP'], bins = 20, kde=True, rug=False, ax = ax2, label = 'HOF')
sns.distplot(four_missing_pitchers['WHIP'], bins = 4, kde=True, rug=False, ax = ax2, label = 'Current Stars')
sns.distplot(hof_pitchers_future['WHIP'], bins = 8, kde=True, rug=False, ax = ax2, label = 'Future Inductees')
ax2.legend()
ax2.set_title('WHIP')

sns.distplot(hof_pitchers['K/9'], bins = 20, kde=True, rug=False, ax = ax3, label = 'HOF')
sns.distplot(four_missing_pitchers['K/9'], bins = 4, kde=True, rug=False, ax = ax3, label = 'Current Stars')
sns.distplot(hof_pitchers_future['K/9'], bins = 8, kde=True, rug=False, ax = ax3, label = 'Future Inductees')
ax3.legend()
ax3.set_title('K/9')

In fact, by their 2015 statistics, these four players on average were striking out players at a better rate than even the Hall of Fame. Their performance is generally on par with the future inductees we had (Halladay, Rivera, etc.), albeit slightly poorer. Even then, this comparison shows that they have had Hall-worthy performance. Even if that has yet to accumulate enough, they're likely to be future inductees.

## Concessions
There are numerous potential issues with the analysis and classification that must be addressed:
1. Hall of Fame voting is non-stationary; the model assumes voting is consistent through over 60 years of different commitees. This simplifes our analysis greatly but is a fundamental flaw in the approach, used to get more data.
2. The models were trained solely on BBWAA voting, even though many of the current Hall of Fame members (including some that were mispredicted here) were voted in by other means.
3. Out of sample/test data fit is less indicative in this matter because of the overwhelmingly greater proportion of people who failed to make it to the Hall of Fame
4. Not all statistics have been adequately recorded, and the ways that awards were given out differed greatly over the years.
5. The players themselves changed characteristics over the years; an interesting analysis would be to look at the seasonal averages over time of different baseball traits.
6. Players were not grouped by position, or by category outside of batter/pitcher (ex: starter/reliever). This finite segmentation could lead to more effective models due to the differences that would ensue among groups in the data. Potentially clustering could have been done before that.
7. Reputation was not taken into effect in any of these cases; steroids are a key issue that have hurt many players' candidacies, but we did not have data for it.
8. Data is only up to the 2015 season, meaning my own intuition is limited to before then. Moreover, any recent stars such as Clayton Kershaw and Max Scherzer aren't included in predicting Hall of Famers because they just recently made their debuts back in that time.
9. There was very little pitcher data to validate the model on. Hold-outs could have been used alternatively; instead of Trevor Hoffman being the only out-of-sample example.
10. Classification of the model varied significantly with the model used - even slight changes in the model such as level of a decision tree would drastically impact the predictions.

## Reflections
Working with baseball data was fun and rewarding, between creating features such as our aggregate measures, reducing dimensionality and scaling of the data, and then this final analysis part with predicting future players. As this stands, the hitter model is incomplete even if the pitcher one is. The difficulties seen in this dataset are relatively expected, however - baseball is known for being a highly quantified sport, especially with the recent success of sabermetric-based teams like the Astros (and Billy Beane's Moneyball, the original spark of interest for baseball statistics to me), and more data will increase complexity.

Thank you to Kaggle, Sean Lahman and his dataset for providing this wealth of information. Future analysis could be to extend the prediction period, scale performance by season to avoid penalizing players with fewer seasons of experience (especially in forward-looking predictions), finite classes within the pitcher/batter categories, and reduction of the training timeframe to have a more modern BBWAA's voting used.