# Using Historical Data to Predict Batting Success: Step 2

Authored by: Donna J. Harris (994042890)

Email: harr2890@mylaurier.ca

For: CP640 Machine Learning (S22) with Professor Elham Harirpoush

## Notebook Series

Just a word about the presentation of this project code.

The code is organized into a series of locally executed Jupyter notebooks, organized by step and needing to be executed in sequence. This is `harr2890_project_step2_hof_data_prep`, the second of XXXXX notebooks.   TODO

## *Step 2 - Data Preparation for a Hall of Fame Approach*

This notebook encompasses the second phase of data preparation, continuing with the structuring and splitting up the data to the state where the experiments and modelling will be conducted based on a Hall of Fame approach.

Here, we will be combining the main source data with our list of Hall of Fame inductees and preparing the data for exploration and modelling based on various **classification** techniques in a subsequent notebook.

We will also be splitting the data, based on the year 2000, so that later on in the process we can run controlled tests on unseen data which can be used to manually evaluate the results and predictions of the modelling.

## Environment Setup

Import and establish environment for our work, including showing all dataframe column values.

In [1]:
import pandas as pd

pd.set_option('display.max_columns', None)

### Pre-Conditions

Step 1 must be run completely before running this notebook.

The `data` folder must exist with the following prepared data files:
- `./data/core_mlb_dataset.csv`
- `./data/hof_dataset.csv`

##  Loading Prepared Data Files

Load in the two stored data files: 
- Major League Baseball batting data (`./data/core_mlb_dataset.csv`)
- Baseball Hall of Fame inductee data (`./data/hof_dataset.csv`)

so we can continue with preparing this data.


In [2]:
core_mlb_dataset = "./data/core_mlb_dataset.csv"
df = pd.read_csv(core_mlb_dataset)
df

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Result,Season
0,delahed01,Ed Delahanty,PHI,BRO,5,4,1,2,0,0,0,0,1,0,0,0,0,L,1901
1,dolanjo02,Joe Dolan,PHI,BRO,5,5,0,1,0,0,0,1,0,0,0,0,0,L,1901
2,childcu01,Cupid Childs,CHC,STL,5,5,1,1,0,0,0,0,0,0,0,0,0,W,1901
3,crolifr01,Fred Crolius,BSN,NYG,4,4,0,0,0,0,0,1,0,0,0,0,0,W,1901
4,delahed01,Ed Delahanty,PHI,BRO,4,4,0,0,0,0,0,0,0,2,0,0,0,L,1901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3715517,woodfja01,Jake Woodford,STL,CHC,2,1,0,0,0,0,0,0,0,0,0,1,0,L,2021
3715518,yastrmi01,Mike Yastrzemski,SFG,SDP,4,3,1,1,1,0,0,2,1,1,0,0,0,W,2021
3715519,zimmebr01,Bradley Zimmer,CLE,TEX,4,4,1,2,0,0,0,1,0,0,0,0,0,W,2021
3715520,zimmery01,Ryan Zimmerman,WSN,BOS,4,3,0,0,0,0,0,1,1,2,0,0,0,L,2021


In [3]:
hof_dataset = "./data/hof_dataset.csv"
hof = pd.read_csv(hof_dataset)
hof

Unnamed: 0,ID,Inductee
0,hodgegi01,1
1,kaatji01,1
2,minosmi01,1
3,olivato01,1
4,ortizda01,1
...,...,...
263,cobbty01,1
264,johnswa01,1
265,mathech01,1
266,ruthba01,1


## Preprocessing (Continued from the Step 1 Notebook)

### The Hall of Fame Approach

First, a statement of the approach here, with Hall of Fame induction data.

The overarching goal is to use Major League Baseball data to predict batting success.

One consideration as a measure of success is a batter's induction into the Major League Baseball Hall of Fame. With the batting data available, it seems like an interesting approach and problem to atttempt to predict batting success based on induction into the Hall of Fame. (**Note:** There are many issues with this approach, which will be explored in more depth within the Step 3 Notebook, as well as the written report.)

In Step 3's Notebook, the goal will be to look at a baseball player's career batting data and predict whether or not that player is in the Hall of Fame.

**Note:** In the case that the player is not yet eligible for induction into the Hall of Fame (because they are currently playing or recently active) the prediction would represent the strong possibility that they would be inducted in the future once.

In order to prepare the data for this approach, we will to do the following:
1.  Generate career batting statistics for player data.
2.  Arbitrarily split the data, based on the year 2000. (Get the names of these players.)
3.  Filter out players with minimal batting data.
4.  Generate the **calculated** batting statistics for players.
5.  Label the career batting data with the Hall of Fame induction information.
6.  Create dataframes based on the calculated batting data, split by the list of names created in stage 2.

The end result will be two labelled dataframes containing player data for:
- players whose career began before 2000
- players whose career began during or after 2000

These will be stored and used in the Step 3 Notebook work. 

### 1. Career Batting Statistics

First, we group the game-based statistics by player and season.

In [43]:
filterable = ['ID', 'Player', 'Season']
columns = ['ID', 'Player','PA','AB','R','H','2B','3B','HR','RBI','BB','SO','HBP','SH', 'SF']
group_alldata = df.groupby(filterable)
group_alldata = group_alldata[columns].sum().copy()
group_alldata

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
ID,Player,Season,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0
aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0
aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0
aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4
aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0
zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0
zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0
zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0


Then, we'll remove the grouping so we have independent seasonal statistic records for each player's season.

In [44]:
group_alldata = group_alldata.reset_index()
group_alldata

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0


Next, because we're interested in career statistics, we want to sum seasonal statistics to generate career statistics. But, we took the seasonal grouping approach so that we could count the number of seasons a player played within the Major Leagues. (**Note:** We have to count because not all careers are sequential, so we won't assume they are.)

Before generating the career statistics, we will add a new column to the dataframe that will generate our "Number of Seasons" statistic when we sum the other career statistics.

In [46]:
group_alldata['Number of Seasons'] = 1
group_alldata

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0,1
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0,1
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0,1
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4,1
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0,1
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0,1
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0,1
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0,1


### 2. Splitting the Data by the Year 2000 (Getting the List of Names)

While we still have seasonal data, now is a good time to prepare for our data splitting at the end.

Right now, we want two lists of players:
- those who started their careers before the year 2000
- those who started their careers in 2000 or later

We aren't worried about their stats this point, we just want our two lists of names.

First, let's get a list of all of the players who played in seasons before 2000.

In [48]:
career_pre_2000 = group_alldata.copy()
career_pre_2000 = career_pre_2000[career_pre_2000['Season'] < 2000].copy()
career_pre_2000

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4,1
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4,1
5,aaronha01,Henry Aaron,1956,660,609,106,200,34,14,26,92,37,54,2,5,7,1
6,aaronha01,Henry Aaron,1957,400,372,71,130,17,4,29,78,26,34,0,0,2,1
7,aaronha01,Henry Aaron,1958,664,601,109,196,34,4,30,95,59,49,1,0,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0,1
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0,1
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0,1
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0,1


Next, let's simplify the list to just be the IDs.

In [50]:
career_pre_2000 = career_pre_2000[['ID']].copy()
career_pre_2000 = career_pre_2000.drop_duplicates(subset=['ID'], keep='first')
career_pre_2000 = career_pre_2000.sort_values(by='ID', ascending=True)
career_pre_2000

Unnamed: 0,ID
3,aaronha01
26,aaronto01
33,aasedo01
41,abbotje01
46,abbotji01
...,...
69993,zuberjo01
70007,zupcibo01
70011,zupofr01
70014,zuvelpa01


Now, let's get a list of all of the players who played in seasons 2000 and onward.

In [65]:
career_2000_onward = group_alldata.copy()
career_2000_onward = career_2000_onward[career_2000_onward['Season'] > 1999].copy()
career_2000_onward

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0,1
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0,1
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0,1
34,abadan01,Andy Abad,2001,1,1,0,0,0,0,0,0,0,0,0,0,0,1
35,abadan01,Andy Abad,2003,19,17,1,2,0,0,0,0,2,5,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70002,zuninmi01,Mike Zunino,2017,435,387,52,97,25,0,25,64,39,160,8,0,1,1
70003,zuninmi01,Mike Zunino,2018,405,373,37,75,18,0,20,44,24,150,6,0,2,1
70004,zuninmi01,Mike Zunino,2019,289,266,30,44,10,1,9,32,20,98,3,0,0,1
70005,zuninmi01,Mike Zunino,2020,84,75,8,11,4,0,4,10,6,37,3,0,0,1


Again, let's simplify the list to just be the IDs.

In [66]:
career_2000_onward = career_2000_onward[['ID']].copy()
career_2000_onward = career_2000_onward.drop_duplicates(subset=['ID'], keep='first')
career_2000_onward = career_2000_onward.sort_values(by='ID', ascending=True)
career_2000_onward

Unnamed: 0,ID
0,aardsda01
34,abadan01
37,abadfe01
40,abbotco01
44,abbotje01
...,...
69967,zoccope01
69981,zoskyed01
69995,zuberty01
69996,zuletju01


It's important to note that we have a good potential for overlap between these two lists. We can see this with the ID `'abbotje01'` appearing on both lists.

Let's confirm this in the full data:

In [69]:
group_alldata.loc[group_alldata['ID'] == 'abbotje01']

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
41,abbotje01,Jeff Abbott,1997,38,38,8,10,1,0,1,2,0,6,0,0,0,1
42,abbotje01,Jeff Abbott,1998,261,244,33,68,14,1,12,41,9,28,0,2,5,1
43,abbotje01,Jeff Abbott,1999,64,57,5,9,0,0,2,6,5,12,0,1,1,1
44,abbotje01,Jeff Abbott,2000,242,215,31,59,15,1,3,29,21,38,2,2,1,1
45,abbotje01,Jeff Abbott,2001,46,42,5,11,3,0,0,5,3,7,1,0,0,1


This confirms that Jeff Abbott played from 1997-2001.

For players like Jeff, we want them to appear in the first list only because their career started before 2000. We need to find these players and then remove them from the second list.

In [70]:
overlapped_names = []
overlapped_names = pd.merge(career_pre_2000, career_2000_onward, on=['ID'], how='inner')
overlapped_names

Unnamed: 0,ID
0,abbotje01
1,abbotku01
2,abreubo01
3,aceveju01
4,adamste01
...,...
952,younger02
953,youngke01
954,zaungr01
955,zeileto01


We see Jeff Abbott (`'abbotje01'`) first on this overlapped list.

Next, we want to remove Jeff and the other players represented by this list from the `career_2000_onward` list.

First, a quick sanity check on our numbers:

In [71]:
print("career_pre_2000 length: ", len(career_pre_2000))
print("career_2000_onward length - BEFORE: ", len(career_2000_onward))
print("overlapped_names length: ", len(overlapped_names))

print("\n\nWe expect a career_2000_onward length of", len(career_2000_onward)-len(overlapped_names),"after the removal.")

career_pre_2000 length:  10365
career_2000_onward length - BEFORE:  4928
overlapped_names length:  957


We expect a career_2000_onward length of 3971 after the removal.


In [72]:
for dupe in overlapped_names['ID']:
    career_2000_onward.drop(career_2000_onward.index[career_2000_onward['ID'] == dupe], inplace=True)
    
career_2000_onward

Unnamed: 0,ID
0,aardsda01
34,abadan01
37,abadfe01
40,abbotco01
58,abbotpa01
...,...
69953,zobribe01
69967,zoccope01
69995,zuberty01
69996,zuletju01


In [73]:
print("career_2000_onward length - AFTER: ", len(career_2000_onward))

career_2000_onward length - AFTER:  3971


We see the numbers align with what we expect and no longer see Jeff Abbott (`'abbotje01'`) in the `career_2000_onward` list.

And for thoroughness, we can confirm definitively that he is no longer in this list.

In [63]:
career_2000_onward.loc[career_2000_onward['ID'] == 'abbotje01']

Unnamed: 0,ID


Now we have our two lists of IDs, which we will use toward the end of this notebook.

### 3. Filtering Out Batters with Minimal Plate Appearances

Now, we will delete the `'Season'` column from the all career data dataframe, since it is no longer needed, and then sum the rest of the columns.

Note that we will only maintain the main dataframe and then use the lists of names we just generated to split the completed data at the end.

In [74]:
del group_alldata['Season']
career_alldata = group_alldata.groupby(['ID','Player']).sum()
career_alldata

Unnamed: 0_level_0,Unnamed: 1_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
ID,Player,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
aardsda01,David Aardsma,5,4,0,0,0,0,0,0,0,2,0,1,0,3
aaronha01,Henry Aaron,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,23
aaronto01,Tommie Aaron,1045,944,99,216,42,6,13,94,85,145,0,9,6,7
aasedo01,Don Aase,5,5,0,0,0,0,0,0,0,3,0,0,0,1
abadan01,Andy Abad,25,21,1,2,0,0,0,0,4,5,0,0,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zuninmi01,Mike Zunino,2835,2559,308,518,111,5,141,345,198,981,58,8,12,9
zupcibo01,Bob Zupcic,886,795,93,199,47,4,7,80,57,137,6,20,8,4
zupofr01,Frank Zupo,8,7,1,2,1,0,0,0,1,2,0,0,0,3
zuvelpa01,Paul Zuvella,545,491,40,109,17,2,2,20,34,50,2,18,0,8


Before moving forward, we will ungroup the dataframe again.

In [75]:
career_alldata = career_alldata.reset_index()
career_alldata

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
0,aardsda01,David Aardsma,5,4,0,0,0,0,0,0,0,2,0,1,0,3
1,aaronha01,Henry Aaron,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,23
2,aaronto01,Tommie Aaron,1045,944,99,216,42,6,13,94,85,145,0,9,6,7
3,aasedo01,Don Aase,5,5,0,0,0,0,0,0,0,3,0,0,0,1
4,abadan01,Andy Abad,25,21,1,2,0,0,0,0,4,5,0,0,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14331,zuninmi01,Mike Zunino,2835,2559,308,518,111,5,141,345,198,981,58,8,12,9
14332,zupcibo01,Bob Zupcic,886,795,93,199,47,4,7,80,57,137,6,20,8,4
14333,zupofr01,Frank Zupo,8,7,1,2,1,0,0,0,1,2,0,0,0,3
14334,zuvelpa01,Paul Zuvella,545,491,40,109,17,2,2,20,34,50,2,18,0,8


Now, let's make our lives easier by filtering out players with minimal batting data.

Not only should this help our predictions by reducing the number of records of players who will not be Hall of Fame inductees (and there are a lot of those), but by filtering now we will avoid some extra steps of data cleaning that are only issues with players with a low number of plate appearances. For example, divide by 0 issues when we have 0 `'AB'`:

In [76]:
career_alldata[career_alldata['AB'] == 0 ]

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
536,bakertr01,Tracy Baker,1,0,0,0,0,0,0,0,0,0,0,1,0,1
628,barneho01,Honey Barnes,1,0,0,0,0,0,0,0,1,0,0,0,0,1
629,barneja01,Jacob Barnes,1,0,0,0,0,0,0,0,1,0,0,0,0,1
689,bartosh01,Shawn Barton,1,0,0,0,0,0,0,0,0,0,0,1,0,1
722,batscbi01,Bill Batsch,1,0,0,0,0,0,0,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13537,wardbr01,Bryan Ward,1,0,0,0,0,0,0,0,1,0,0,0,0,1
13680,wellsbo01,Bob Wells,1,0,1,0,0,0,0,0,1,0,0,0,0,1
13969,wilsoga03,Gary Wilson,1,0,0,0,0,0,0,0,0,0,0,1,0,1
14204,yeabsbe01,Bert Yeabsley,1,0,0,0,0,0,0,0,1,0,0,0,0,1


First, we can filter out players who have appeared in the Major Leagues in less than ten seasons.

Generally speaking, a player must have played in at least ten seasons.
(Reference: https://baseballhall.org/hall-of-famers/rules/bbwaa-rules-for-election)

In [77]:
career_filtered = career_alldata[career_alldata['Number of Seasons'] > 9 ].copy()
career_filtered

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
1,aaronha01,Henry Aaron,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,23
17,abernte02,Ted Abernathy,204,180,12,25,3,0,0,9,7,74,2,15,0,14
24,abreubo01,Bobby Abreu,10081,8480,1453,2470,574,59,288,1363,1476,1840,33,7,85,18
42,adairje01,Jerry Adair,4314,4019,376,1022,163,19,57,365,207,497,17,41,30,13
51,adamsbo03,Bobby Adams,4335,3846,557,1036,180,47,36,294,394,426,16,74,5,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14316,zimmery01,Ryan Zimmerman,7402,6654,963,1846,417,22,284,1061,646,1384,31,1,69,16
14320,ziskri01,Richie Zisk,5737,5144,681,1477,245,26,207,792,533,910,12,7,41,13
14321,zitoba01,Barry Zito,418,344,12,35,0,0,0,11,18,99,0,56,0,13
14323,zobribe01,Ben Zobrist,6836,5880,883,1566,349,44,167,768,832,994,31,26,67,14


Next, we want to filter out players who are primarily pitchers, since our focus is batting. Pitchers who are not regular batters (such as Babe Ruth or Shohei Othani) have a significantly lower number of plate appearances than position players but may show a higher number of seasons. 

For example, Ted Abernathy, Barry Zito, and Bill Zuber are all pitchers, according to Baseball-Reference.com.

We will arbitrarily select a number of plate appearances to filter out some of this noise from the data.

In [78]:
temp = career_filtered[career_filtered['PA'] < 1000]
temp = temp.sort_values(by='PA', ascending=False)
temp.head(10)

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
7568,lopated01,Eddie Lopat,995,887,86,187,26,5,5,77,72,93,0,35,1,12
5225,haneyla01,Larry Haney,984,919,68,198,30,1,12,73,44,175,3,13,5,12
7220,lawve01,Vern Law,984,867,87,190,35,7,11,90,41,174,5,61,7,16
10713,reynoal01,Allie Reynolds,982,857,47,140,19,4,1,80,61,187,5,59,0,13
6856,kneppbo01,Bob Knepper,982,840,47,115,28,2,6,59,41,333,2,90,9,15
10035,perryji01,Jim Perry,980,889,72,177,22,3,5,59,32,180,3,51,5,14
9285,newcodo01,Don Newcombe,970,863,91,234,32,3,15,108,85,145,3,16,3,10
11597,schumha02,Hal Schumacher,961,896,86,181,23,4,15,102,32,196,5,28,0,13
11084,rommeed01,Eddie Rommel,960,827,79,165,23,6,1,62,62,102,0,67,0,13
11268,ryanno01,Nolan Ryan,957,852,40,94,10,2,2,36,38,371,0,65,2,14


9/10 of these players were pitchers, according to Baseball-Reference.com.

There are lots of batters who don't get many plate appearances, and bounce back and forth between the minor and major leagues over many seasons, so we don't want to get too agressive here.

Removing out players with less than 1000 plate appearances over their career will reduce some of the noise created by career pitchers.

In [79]:
career_filtered = career_filtered[career_filtered['PA'] > 999 ].copy()
career_filtered

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
1,aaronha01,Henry Aaron,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,23
24,abreubo01,Bobby Abreu,10081,8480,1453,2470,574,59,288,1363,1476,1840,33,7,85,18
42,adairje01,Jerry Adair,4314,4019,376,1022,163,19,57,365,207,497,17,41,30,13
51,adamsbo03,Bobby Adams,4335,3846,557,1036,180,47,36,294,394,426,16,74,5,14
63,adamsma01,Matt Adams,2614,2421,297,624,130,6,118,399,165,643,12,0,16,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14304,zernigu01,Gus Zernial,4361,3940,551,1049,152,21,227,749,375,731,24,2,20,11
14312,zimmedo01,Don Zimmer,3523,3218,342,758,127,21,90,348,242,662,13,36,14,12
14316,zimmery01,Ryan Zimmerman,7402,6654,963,1846,417,22,284,1061,646,1384,31,1,69,16
14320,ziskri01,Richie Zisk,5737,5144,681,1477,245,26,207,792,533,910,12,7,41,13


### 4. Career Calculated Batting Statistics

We return to calculate batting statistics for our players with 10+ seasons and 999+ plate appearances.

Below are the statistical formulae (and simplifications) for career calculations for:
- batting average (`'AVG'`)
- slugging percentage (`'SLG'`)
- on base percentage (`'OBP'`)
- on base percentage plus slugging (`'OPS'`)

#### Our Calculated Statistics Formulae

_AVG (float) = H / AB_


_SLG (float) = total_bases / AB_

- _total_bases (int) =_ 1 * 1B + 2 * 2B + 3 * 3B + 4 * HR
- _1B (int) =_ H - (2B + 3B + HR)


_OBP (float) = (H + BB + HBP)/(AB + BB + HBP + SF)_


_OPS (float) = SLG + OBP_

##### Simplification of SLG formula


SLG = 
- total_bases / AB
- ( 1 * 1B                       + 2 * 2B + 3 * 3B + 4 * HR ) / AB
- ( 1 * \[ H - (2B + 3B + HR) \] + 2 * 2B + 3 * 3B + 4 * HR ) / AB
- ( H - 2B - 3B - HR             + 2 * 2B + 3 * 3B + 4 * HR ) / AB
- ( H - 2B + 2 * 2B - 3B + 3 * 3B - HR + 4 * HR ) / AB
- ( H + (-2B + 2\*2B) + (-3B + 3\*3B) + (-HR + 4\*HR) ) / AB
- ( H + 2B + 2 * 3B + 3 * HR ) / AB

Time to calculate the statistics (all real values) for all remaining players, adding new columns for each.

In [80]:
career_filtered['AVG'] = career_filtered['H'] / (career_filtered['AB']*1.0)
career_filtered['SLG'] = (career_filtered['H'] + career_filtered['2B'] + 2*career_filtered['3B'] + 3*career_filtered['HR']) / (career_filtered['AB']*1.0)
career_filtered['OBP'] = (career_filtered['H'] + career_filtered['BB'] + career_filtered['HBP']) / ((career_filtered['AB'] + career_filtered['BB'] + career_filtered['HBP'] + career_filtered['SF'])*1.0) 
career_filtered['OPS'] = career_filtered['SLG'] + career_filtered['OBP']

career_filtered

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons,AVG,SLG,OBP,OPS
1,aaronha01,Henry Aaron,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,23,0.305503,0.555152,0.374276,0.929429
24,abreubo01,Bobby Abreu,10081,8480,1453,2470,574,59,288,1363,1476,1840,33,7,85,18,0.291274,0.474764,0.394977,0.869741
42,adairje01,Jerry Adair,4314,4019,376,1022,163,19,57,365,207,497,17,41,30,13,0.254292,0.346852,0.291598,0.638451
51,adamsbo03,Bobby Adams,4335,3846,557,1036,180,47,36,294,394,426,16,74,5,14,0.269371,0.368695,0.339357,0.708052
63,adamsma01,Matt Adams,2614,2421,297,624,130,6,118,399,165,643,12,0,16,10,0.257745,0.462619,0.306427,0.769046
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14304,zernigu01,Gus Zernial,4361,3940,551,1049,152,21,227,749,375,731,24,2,20,11,0.266244,0.488325,0.332186,0.820511
14312,zimmedo01,Don Zimmer,3523,3218,342,758,127,21,90,348,242,662,13,36,14,12,0.235550,0.371970,0.290508,0.662478
14316,zimmery01,Ryan Zimmerman,7402,6654,963,1846,417,22,284,1061,646,1384,31,1,69,16,0.277427,0.474752,0.340946,0.815698
14320,ziskri01,Richie Zisk,5737,5144,681,1477,245,26,207,792,533,910,12,7,41,13,0.287131,0.465591,0.352880,0.818471


### 5. Adding Hall of Fame Inductee Labelling

Next, let's get the data labelled, marking up if the player is currently a member of the Major League Baseball Hall of Fame.

In [81]:
career_hof = pd.merge(hof, career_filtered, on="ID", how="right")
career_hof

Unnamed: 0,ID,Inductee,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons,AVG,SLG,OBP,OPS
0,aaronha01,1.0,Henry Aaron,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,23,0.305503,0.555152,0.374276,0.929429
1,abreubo01,,Bobby Abreu,10081,8480,1453,2470,574,59,288,1363,1476,1840,33,7,85,18,0.291274,0.474764,0.394977,0.869741
2,adairje01,,Jerry Adair,4314,4019,376,1022,163,19,57,365,207,497,17,41,30,13,0.254292,0.346852,0.291598,0.638451
3,adamsbo03,,Bobby Adams,4335,3846,557,1036,180,47,36,294,394,426,16,74,5,14,0.269371,0.368695,0.339357,0.708052
4,adamsma01,,Matt Adams,2614,2421,297,624,130,6,118,399,165,643,12,0,16,10,0.257745,0.462619,0.306427,0.769046
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1814,zernigu01,,Gus Zernial,4361,3940,551,1049,152,21,227,749,375,731,24,2,20,11,0.266244,0.488325,0.332186,0.820511
1815,zimmedo01,,Don Zimmer,3523,3218,342,758,127,21,90,348,242,662,13,36,14,12,0.235550,0.371970,0.290508,0.662478
1816,zimmery01,,Ryan Zimmerman,7402,6654,963,1846,417,22,284,1061,646,1384,31,1,69,16,0.277427,0.474752,0.340946,0.815698
1817,ziskri01,,Richie Zisk,5737,5144,681,1477,245,26,207,792,533,910,12,7,41,13,0.287131,0.465591,0.352880,0.818471


The first thing we notice is the NaN values under the new `'Inductee'` column. This is easy to address, as these are all non-inductees and can be replaced with a value of zero.

In [82]:
hof_is_nan = career_hof.loc[pd.isna(career_hof['Inductee'])]
series = hof_is_nan.index
career_hof.loc[series, 'Inductee'] = 0
career_hof['Inductee'] = career_hof['Inductee'].astype(int)
career_hof

Unnamed: 0,ID,Inductee,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons,AVG,SLG,OBP,OPS
0,aaronha01,1,Henry Aaron,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,23,0.305503,0.555152,0.374276,0.929429
1,abreubo01,0,Bobby Abreu,10081,8480,1453,2470,574,59,288,1363,1476,1840,33,7,85,18,0.291274,0.474764,0.394977,0.869741
2,adairje01,0,Jerry Adair,4314,4019,376,1022,163,19,57,365,207,497,17,41,30,13,0.254292,0.346852,0.291598,0.638451
3,adamsbo03,0,Bobby Adams,4335,3846,557,1036,180,47,36,294,394,426,16,74,5,14,0.269371,0.368695,0.339357,0.708052
4,adamsma01,0,Matt Adams,2614,2421,297,624,130,6,118,399,165,643,12,0,16,10,0.257745,0.462619,0.306427,0.769046
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1814,zernigu01,0,Gus Zernial,4361,3940,551,1049,152,21,227,749,375,731,24,2,20,11,0.266244,0.488325,0.332186,0.820511
1815,zimmedo01,0,Don Zimmer,3523,3218,342,758,127,21,90,348,242,662,13,36,14,12,0.235550,0.371970,0.290508,0.662478
1816,zimmery01,0,Ryan Zimmerman,7402,6654,963,1846,417,22,284,1061,646,1384,31,1,69,16,0.277427,0.474752,0.340946,0.815698
1817,ziskri01,0,Richie Zisk,5737,5144,681,1477,245,26,207,792,533,910,12,7,41,13,0.287131,0.465591,0.352880,0.818471


Let's just do a sanity check for any null or NaN values in the dataframe.

In [83]:
career_hof.isnull().values.any()

False

All that's left to do before splitting the data up is reordering the columns.

In [84]:
columns_hof = ['ID', 'Player', 'Number of Seasons', 
               'PA','AB','R','H','2B','3B','HR','RBI','BB','SO','HBP','SH', 'SF', 
               'AVG', 'SLG', 'OBP', 'OPS', 
               'Inductee']

career_hof = career_hof[columns_hof]
career_hof

Unnamed: 0,ID,Player,Number of Seasons,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,Inductee
0,aaronha01,Henry Aaron,23,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,0.305503,0.555152,0.374276,0.929429,1
1,abreubo01,Bobby Abreu,18,10081,8480,1453,2470,574,59,288,1363,1476,1840,33,7,85,0.291274,0.474764,0.394977,0.869741,0
2,adairje01,Jerry Adair,13,4314,4019,376,1022,163,19,57,365,207,497,17,41,30,0.254292,0.346852,0.291598,0.638451,0
3,adamsbo03,Bobby Adams,14,4335,3846,557,1036,180,47,36,294,394,426,16,74,5,0.269371,0.368695,0.339357,0.708052,0
4,adamsma01,Matt Adams,10,2614,2421,297,624,130,6,118,399,165,643,12,0,16,0.257745,0.462619,0.306427,0.769046,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1814,zernigu01,Gus Zernial,11,4361,3940,551,1049,152,21,227,749,375,731,24,2,20,0.266244,0.488325,0.332186,0.820511,0
1815,zimmedo01,Don Zimmer,12,3523,3218,342,758,127,21,90,348,242,662,13,36,14,0.235550,0.371970,0.290508,0.662478,0
1816,zimmery01,Ryan Zimmerman,16,7402,6654,963,1846,417,22,284,1061,646,1384,31,1,69,0.277427,0.474752,0.340946,0.815698,0
1817,ziskri01,Richie Zisk,13,5737,5144,681,1477,245,26,207,792,533,910,12,7,41,0.287131,0.465591,0.352880,0.818471,0


### 6. Complete the Data Splitting by Creating Dataframes

With the career data in order and labelled with the Hall of Fame induction data, it is time to use the lists we created earlier in the notebook and split the data into our arbitrarily grouped dataframes.

All players who started their careers before the year 2000 will be in the first dataframe, `pre_2000`, and all players who started their careers in 2000 or later (some of whom will be currently active players) will be in the second dataframe, `from_2000`.

We already have our list of names from earlier work, so now it is a matter of merging data to make new dataframes specific to the chosen eras.

First, our pre-2000 players.

In [89]:
pre_2000 = pd.merge(career_hof, career_pre_2000, on='ID', how='inner')
pre_2000

Unnamed: 0,ID,Player,Number of Seasons,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,Inductee
0,aaronha01,Henry Aaron,23,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,0.305503,0.555152,0.374276,0.929429,1
1,abreubo01,Bobby Abreu,18,10081,8480,1453,2470,574,59,288,1363,1476,1840,33,7,85,0.291274,0.474764,0.394977,0.869741,0
2,adairje01,Jerry Adair,13,4314,4019,376,1022,163,19,57,365,207,497,17,41,30,0.254292,0.346852,0.291598,0.638451,0
3,adamsbo03,Bobby Adams,14,4335,3846,557,1036,180,47,36,294,394,426,16,74,5,0.269371,0.368695,0.339357,0.708052,0
4,adamssp01,Sparky Adams,13,6175,5558,839,1588,249,49,9,390,453,222,28,95,0,0.285714,0.353005,0.342606,0.695611,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1509,zaungr01,Gregg Zaun,16,4042,3489,431,878,194,9,88,446,479,544,29,14,31,0.251648,0.388077,0.344091,0.732168,0
1510,zeileto01,Todd Zeile,16,8649,7573,986,2004,397,23,253,1110,945,1279,42,8,81,0.264624,0.423346,0.346140,0.769487,0
1511,zernigu01,Gus Zernial,11,4361,3940,551,1049,152,21,227,749,375,731,24,2,20,0.266244,0.488325,0.332186,0.820511,0
1512,zimmedo01,Don Zimmer,12,3523,3218,342,758,127,21,90,348,242,662,13,36,14,0.235550,0.371970,0.290508,0.662478,0


Then finally, our 2000 onward players.

In [91]:
from_2000 = pd.merge(career_hof, career_2000_onward, on='ID', how='inner')
from_2000

Unnamed: 0,ID,Player,Number of Seasons,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,Inductee
0,adamsma01,Matt Adams,10,2614,2421,297,624,130,6,118,399,165,643,12,0,16,0.257745,0.462619,0.306427,0.769046,0
1,alonsyo01,Yonder Alonso,10,3773,3362,390,872,181,2,100,426,366,648,16,1,28,0.259369,0.403629,0.332450,0.736078,0
2,altuvjo01,Jose Altuve,11,6346,5778,883,1777,340,29,164,639,443,753,54,26,45,0.307546,0.461578,0.359810,0.821389,0
3,andinro01,Robert Andino,10,1491,1344,153,313,58,1,18,97,113,313,6,21,7,0.232887,0.317708,0.293878,0.611586,0
4,andruel01,Elvis Andrus,13,7620,6863,953,1864,328,50,79,673,547,1043,50,103,56,0.271601,0.368498,0.327435,0.695933,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
300,youngde03,Delmon Young,10,4371,4108,473,1162,218,11,109,566,179,784,40,1,43,0.282863,0.420886,0.316018,0.736904,0
301,younger03,Eric Young Jr.,10,1926,1725,253,422,67,22,13,112,147,350,23,26,5,0.244638,0.331594,0.311579,0.643173,0
302,youngmi02,Michael Young,14,8612,7918,1136,2375,441,60,185,1030,575,1235,22,25,72,0.299949,0.440894,0.346105,0.786999,0
303,zimmery01,Ryan Zimmerman,16,7402,6654,963,1846,417,22,284,1061,646,1384,31,1,69,0.277427,0.474752,0.340946,0.815698,0


## It's Time to Save Our Three New Dataframes

Let's capture these three dataframes and have them be ready for Step 3's Notebook.

In [93]:
import os

if not os.path.exists('./data'):
    os.makedirs('./data')
    
alldata_csv = "./data/step2_alldata.csv"
career_hof.to_csv(alldata_csv, index=False)

pre_2000_csv = "./data/step2_pre_2000.csv"
pre_2000.to_csv(pre_2000_csv, index=False)

from_2000_csv = "./data/step2_from_2000.csv"
from_2000.to_csv(from_2000_csv, index=False)

## Concluding Notebook Comments

**Note:** At this point, we will conclude this notebook for organizational purposes, as it is a logical point for saving and launching into the following experimentation and modelling based on the data prepared here.

Saving the data files in various states makes it easier to re-run parts of the overall project without having to re-run every aspect.

The purpose of this notebook is to continue on with the data preparation started in step 1, by structuring and splitting up the data to the state where the experiments and modelling will be conducted.

The *next* notebook in the series is: `harr2890_project_step3_hof_modelling`, where the saved, data files will be loaded and ......   XXXXXX  TODO