# Using Historical Data to Predict Batting Success: Step 2 of 5

Authored by: Donna J. Harris (994042890)

Email: harr2890@mylaurier.ca

For: CP640 Machine Learning (S22) with Professor Elham Harirpoush

## Notebook Series

Just a word about the presentation of this project code.

The code is organized into a series of locally executed Jupyter notebooks, organized by step and needing to be executed in sequence to follow the flow of the entire project.

This is `harr2890_project_step2_hof_data_prep`, the second of five notebooks.

## *Step 2 - Data Preparation for a Hall of Fame Approach*

This notebook encompasses the second phase of data preparation, continuing with the structuring and splitting up the data to the state where the experiments and modelling will be conducted based on a Hall of Fame Approach.

Here, we will be combining the main source data with our list of Hall of Fame inductees and preparing the data for exploration and modelling based on various **classification** techniques in a subsequent notebook.

We will also be splitting the data, based on the year 2000, so that there is also an option throughout the process to run controlled tests on totally unseen data which can be used to manually evaluate the results and predictions of the modelling.

## Environment Setup

Import and establish environment for our work, including showing all dataframe column values.

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

### Pre-Conditions

Step 1 (`harr2890_project_step1_data_prep`) must be run completely before running this notebook.

The `data` folder must exist with the following prepared data files:
- `./data/core_mlb_dataset.csv`
- `./data/hof_dataset.csv`

##  Loading Prepared Data Files

Load in the two stored data files: 
- Major League Baseball batting data (`./data/core_mlb_dataset.csv`)
- Baseball Hall of Fame inductee data (`./data/hof_dataset.csv`)

so we can continue with preparing this data.


In [2]:
core_mlb_dataset = "./data/core_mlb_dataset.csv"
df = pd.read_csv(core_mlb_dataset)
df

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Result,Season
0,delahed01,Ed Delahanty,PHI,BRO,5,4,1,2,0,0,0,0,1,0,0,0,0,L,1901
1,dolanjo02,Joe Dolan,PHI,BRO,5,5,0,1,0,0,0,1,0,0,0,0,0,L,1901
2,childcu01,Cupid Childs,CHC,STL,5,5,1,1,0,0,0,0,0,0,0,0,0,W,1901
3,crolifr01,Fred Crolius,BSN,NYG,4,4,0,0,0,0,0,1,0,0,0,0,0,W,1901
4,delahed01,Ed Delahanty,PHI,BRO,4,4,0,0,0,0,0,0,0,2,0,0,0,L,1901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3715517,woodfja01,Jake Woodford,STL,CHC,2,1,0,0,0,0,0,0,0,0,0,1,0,L,2021
3715518,yastrmi01,Mike Yastrzemski,SFG,SDP,4,3,1,1,1,0,0,2,1,1,0,0,0,W,2021
3715519,zimmebr01,Bradley Zimmer,CLE,TEX,4,4,1,2,0,0,0,1,0,0,0,0,0,W,2021
3715520,zimmery01,Ryan Zimmerman,WSN,BOS,4,3,0,0,0,0,0,1,1,2,0,0,0,L,2021


In [3]:
hof_dataset = "./data/hof_dataset.csv"
hof = pd.read_csv(hof_dataset)
hof

Unnamed: 0,ID,Inductee
0,hodgegi01,1
1,kaatji01,1
2,minosmi01,1
3,olivato01,1
4,ortizda01,1
...,...,...
263,cobbty01,1
264,johnswa01,1
265,mathech01,1
266,ruthba01,1


## Preprocessing (Continued from the Step 1 Notebook)

### The Hall of Fame Approach

This approach to predicting batting success uses current Hall of Fame induction data.

The overarching goal is to use Major League Baseball data to predict batting success.

One consideration as a measure of success is a batter's induction into the Major League Baseball Hall of Fame. With the batting data available, it seems like an interesting approach and problem to atttempt to predict batting success based on induction into the Hall of Fame. (**Note:** There are many issues with this approach, which will be explored in more depth within the Step 3 Notebook, as well as the final written report.)

In Step 3's Notebook (`harr2890_project_step3_hof_modelling`), the goal will be to look at a baseball player's **career** batting data and predict whether or not that player is in the Hall of Fame.

**Note:** In the case that the player is not yet eligible for induction into the Hall of Fame (because they are currently playing or recently active) the prediction would represent the strong possibility that they would be inducted in the future.

### Overview of Tasks

In order to prepare the data for this approach, we will to do the following:
1.  Generate seasonal batting statistics for player data.
2.  Arbitrarily split the data, based on the year 2000. (Get the names of these players.)
3.  Gather first five seasons of statistics for all players.
4.  Calculate statistics for five season careers.
5.  Label the career batting data with the Hall of Fame induction information.
6.  Create dataframes based on the calculated batting data, split by the list of names created in task 2.


The end result will be two labelled dataframes containing player data for:
- players whose career began before 2000
- players whose career began during or after 2000

These will be stored and used in the Step 3 Notebook work (`harr2890_project_step3_hof_modelling`). 

### 1. Generate Seasonal Batting Statistics

First, we group the game-based statistics by player and season.

In [4]:
filterable = ['ID', 'Player', 'Season']
columns = ['ID', 'Player','PA','AB','R','H','2B','3B','HR','RBI','BB','SO','HBP','SH', 'SF']
group_alldata = df.groupby(filterable)
group_alldata = group_alldata[columns].sum().copy()
group_alldata

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
ID,Player,Season,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0
aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0
aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0
aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4
aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0
zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0
zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0
zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0


Then, we'll remove the grouping so we have independent seasonal statistic records for each player's season.

In [5]:
seasonal_data = group_alldata.reset_index()
seasonal_data

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0


### 2. Splitting the Data by the Year 2000 (Getting the List of Names)

While we still have seasonal data, now is a good time to prepare for our data splitting at the end.

Right now, we want two lists of players:
- those who started their careers before the year 2000
- those who started their careers in 2000 or later

We aren't worried about their stats this point, we just want our two lists of names.

First, let's get a list of all of the players who played in seasons before 2000.

In [6]:
career_pre_2000 = seasonal_data.copy()
career_pre_2000 = career_pre_2000[career_pre_2000['Season'] < 2000].copy()
career_pre_2000

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4
5,aaronha01,Henry Aaron,1956,660,609,106,200,34,14,26,92,37,54,2,5,7
6,aaronha01,Henry Aaron,1957,400,372,71,130,17,4,29,78,26,34,0,0,2
7,aaronha01,Henry Aaron,1958,664,601,109,196,34,4,30,95,59,49,1,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0


Next, let's simplify the list to just be the IDs.

In [7]:
career_pre_2000 = career_pre_2000[['ID']].copy()
career_pre_2000 = career_pre_2000.drop_duplicates(subset=['ID'], keep='first')
career_pre_2000 = career_pre_2000.sort_values(by='ID', ascending=True)
career_pre_2000

Unnamed: 0,ID
3,aaronha01
26,aaronto01
33,aasedo01
41,abbotje01
46,abbotji01
...,...
69993,zuberjo01
70007,zupcibo01
70011,zupofr01
70014,zuvelpa01


Now, let's get a list of all of the players who played in seasons 2000 and onward.

Again, let's simplify the list to just be the IDs.

In [8]:
career_from_2000 = seasonal_data.copy()
career_from_2000 = career_from_2000[career_from_2000['Season'] > 1999].copy()
career_from_2000

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0
34,abadan01,Andy Abad,2001,1,1,0,0,0,0,0,0,0,0,0,0,0
35,abadan01,Andy Abad,2003,19,17,1,2,0,0,0,0,2,5,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70002,zuninmi01,Mike Zunino,2017,435,387,52,97,25,0,25,64,39,160,8,0,1
70003,zuninmi01,Mike Zunino,2018,405,373,37,75,18,0,20,44,24,150,6,0,2
70004,zuninmi01,Mike Zunino,2019,289,266,30,44,10,1,9,32,20,98,3,0,0
70005,zuninmi01,Mike Zunino,2020,84,75,8,11,4,0,4,10,6,37,3,0,0


In [9]:
career_from_2000 = career_from_2000[['ID']].copy()
career_from_2000 = career_from_2000.drop_duplicates(subset=['ID'], keep='first')
career_from_2000 = career_from_2000.sort_values(by='ID', ascending=True)
career_from_2000

Unnamed: 0,ID
0,aardsda01
34,abadan01
37,abadfe01
40,abbotco01
44,abbotje01
...,...
69967,zoccope01
69981,zoskyed01
69995,zuberty01
69996,zuletju01


It's important to note that we have a good potential for overlap between these two lists. We can see this with the ID `'abbotje01'` appearing on both lists.

Let's confirm this in the full data:

In [10]:
seasonal_data.loc[seasonal_data['ID'] == 'abbotje01']

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
41,abbotje01,Jeff Abbott,1997,38,38,8,10,1,0,1,2,0,6,0,0,0
42,abbotje01,Jeff Abbott,1998,261,244,33,68,14,1,12,41,9,28,0,2,5
43,abbotje01,Jeff Abbott,1999,64,57,5,9,0,0,2,6,5,12,0,1,1
44,abbotje01,Jeff Abbott,2000,242,215,31,59,15,1,3,29,21,38,2,2,1
45,abbotje01,Jeff Abbott,2001,46,42,5,11,3,0,0,5,3,7,1,0,0


This confirms that Jeff Abbott played from 1997-2001.

For players like Jeff, we want them to appear in the first list only because their career started before 2000. We need to find these players and then remove them from the second list.

In [11]:
overlapped_names = []
overlapped_names = pd.merge(career_pre_2000, career_from_2000, on=['ID'], how='inner')
overlapped_names

Unnamed: 0,ID
0,abbotje01
1,abbotku01
2,abreubo01
3,aceveju01
4,adamste01
...,...
952,younger02
953,youngke01
954,zaungr01
955,zeileto01


We see Jeff Abbott (`'abbotje01'`) first on this overlapped list.

Next, we want to remove Jeff and the other players represented by this list from the `career_from_2000` list.

First, a quick sanity check on our numbers:

In [12]:
print("career_pre_2000 length: ", len(career_pre_2000))
print("career_from_2000 length - BEFORE: ", len(career_from_2000))
print("overlapped_names length: ", len(overlapped_names))

print("\n\nWe expect a career_from_2000 length of", len(career_from_2000)-len(overlapped_names),"after the removal.")

career_pre_2000 length:  10365
career_from_2000 length - BEFORE:  4928
overlapped_names length:  957


We expect a career_from_2000 length of 3971 after the removal.


In [13]:
for dupe in overlapped_names['ID']:
    career_from_2000.drop(career_from_2000.index[career_from_2000['ID'] == dupe], inplace=True)
    
career_from_2000

Unnamed: 0,ID
0,aardsda01
34,abadan01
37,abadfe01
40,abbotco01
58,abbotpa01
...,...
69953,zobribe01
69967,zoccope01
69995,zuberty01
69996,zuletju01


In [14]:
print("career_from_2000 length - AFTER: ", len(career_from_2000))

career_from_2000 length - AFTER:  3971


We see the numbers align with what we expect and no longer see Jeff Abbott (`'abbotje01'`) in the `career_from_2000` list.

And for thoroughness, we can confirm definitively that he is no longer in this list.

In [15]:
career_from_2000.loc[career_from_2000['ID'] == 'abbotje01']

Unnamed: 0,ID


Now we have our two lists of IDs, which we will use toward the end of this notebook. Now, we return to shaping the five season player data.

### 3.  Gathering the First 5 Seasons of Statistics

Next, we want to get all of the player IDs and names for the new dataframe, ensuring it is sorted by earliest season played.

In [16]:
player_names = seasonal_data.reset_index()
player_names = player_names[['ID']].copy()
player_names = player_names.drop_duplicates(subset=['ID'], keep='first')
player_names

Unnamed: 0,ID
0,aardsda01
3,aaronha01
26,aaronto01
33,aasedo01
34,abadan01
...,...
69998,zuninmi01
70007,zupcibo01
70011,zupofr01
70014,zuvelpa01


Sort the dataframe by season

In [17]:
seasonal_sorted = seasonal_data.sort_values(by=['Season'], ascending=True)
seasonal_sorted

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
21668,gatinfr01,Frank Gatins,1901,207,198,21,45,7,2,1,21,5,27,2,2,0
7533,brownfr01,Fred Brown,1901,15,14,1,2,0,0,0,2,0,2,0,1,0
40448,mccange01,Gene McCann,1901,13,10,2,0,0,0,0,0,2,4,0,1,0
15398,denzero01,Roger Denzer,1901,25,22,0,2,1,0,0,1,2,8,1,0,0
30788,jennihu01,Hughie Jennings,1901,345,302,38,79,21,2,1,39,25,25,12,6,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42312,mileywa01,Wade Miley,2021,59,54,9,10,3,0,0,3,3,13,0,2,0
10254,castier01,Erick Castillo,2021,9,8,0,2,0,0,0,0,1,1,0,0,0
18738,feltnry01,Ryan Feltner,2021,1,1,0,0,0,0,0,0,0,1,0,0,0
22364,gittech01,Chris Gittens,2021,44,36,1,4,0,0,1,5,7,13,0,0,1


Capture the first five seasons for every player who has played at least five seasons.

In [18]:
first_five_seasons = pd.DataFrame({'ID': pd.Series(dtype='str'), 
                                  'Player': pd.Series(dtype='str'), 
                                  'Season': pd.Series(dtype='int'),
                                  'PA': pd.Series(dtype='int'),
                                  'AB': pd.Series(dtype='int'),
                                  'R': pd.Series(dtype='int'),
                                  'H': pd.Series(dtype='int'),
                                  '2B': pd.Series(dtype='int'),
                                  '3B': pd.Series(dtype='int'),
                                  'HR': pd.Series(dtype='int'),
                                  'RBI': pd.Series(dtype='int'),
                                  'BB': pd.Series(dtype='int'),
                                  'SO': pd.Series(dtype='int'),
                                  'HBP': pd.Series(dtype='int'),
                                  'SH': pd.Series(dtype='int'),
                                  'SF': pd.Series(dtype='int')
                                  })

first_n_seasons = 5

for player_id in player_names['ID']:
    n_seasons = seasonal_sorted[seasonal_sorted['ID'] == player_id][0:first_n_seasons]
    
    if len(n_seasons) == first_n_seasons:
        first_five_seasons = pd.concat([first_five_seasons, n_seasons], ignore_index = True)

first_five_seasons

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
0,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4
1,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4
2,aaronha01,Henry Aaron,1956,660,609,106,200,34,14,26,92,37,54,2,5,7
3,aaronha01,Henry Aaron,1957,400,372,71,130,17,4,29,78,26,34,0,0,2
4,aaronha01,Henry Aaron,1958,664,601,109,196,34,4,30,95,59,49,1,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27910,zuverge01,George Zuverink,1954,76,66,2,9,1,0,0,3,1,15,0,9,0
27911,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0
27912,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0
27913,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0


Sum the statistics for the first five seasons.

In [19]:
five_season_career_stats = first_five_seasons.copy()
del five_season_career_stats['Season']
five_season_career_stats = five_season_career_stats.groupby(['ID','Player']).sum()
five_season_career_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
ID,Player,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
aaronha01,Henry Aaron,2898,2652,450,846,149,37,125,440,199,237,9,18,20
aaronto01,Tommie Aaron,923,828,93,191,38,6,11,84,79,130,0,9,6
abbotje01,Jeff Abbott,651,596,82,157,33,2,18,83,38,91,3,5,7
abbotku01,Kurt Abbott,1528,1398,181,358,72,19,43,165,91,395,14,18,7
aberal01,Al Aber,107,91,5,12,0,0,0,5,7,37,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zoskyed01,Eddie Zosky,53,50,4,8,1,2,0,3,1,13,0,1,1
zuberbi01,Bill Zuber,50,46,1,3,0,1,0,0,1,12,0,3,0
zuninmi01,Mike Zunino,1682,1512,169,316,68,2,75,197,114,564,39,8,9
zuvelpa01,Paul Zuvella,303,269,20,57,10,1,0,7,25,22,1,8,0


### 4. Calculated Statistics for 5-year Career Statistics

Below are the statistical formulae (and simplifications) for our calculated statistics for:
- batting average (`'AVG'`)
- slugging percentage (`'SLG'`)
- on base percentage (`'OBP'`)
- on base percentage plus slugging (`'OPS'`)

#### Our Calculated Statistics Formulae

_AVG (float) = H / AB_


_SLG (float) = total_bases / AB_

- _total_bases (int) =_ 1 * 1B + 2 * 2B + 3 * 3B + 4 * HR
- _1B (int) =_ H - (2B + 3B + HR)


_OBP (float) = (H + BB + HBP)/(AB + BB + HBP + SF)_


_OPS (float) = SLG + OBP_

##### Simplification of SLG formula


SLG = 
- total_bases / AB
- ( 1 * 1B                       + 2 * 2B + 3 * 3B + 4 * HR ) / AB
- ( 1 * \[ H - (2B + 3B + HR) \] + 2 * 2B + 3 * 3B + 4 * HR ) / AB
- ( H - 2B - 3B - HR             + 2 * 2B + 3 * 3B + 4 * HR ) / AB
- ( H - 2B + 2 * 2B - 3B + 3 * 3B - HR + 4 * HR ) / AB
- ( H + (-2B + 2\*2B) + (-3B + 3\*3B) + (-HR + 4\*HR) ) / AB
- ( H + 2B + 2 * 3B + 3 * HR ) / AB

Time to calculate the statistics (all real values) for all remaining players, adding new columns for each.

In [20]:
five_season_career_stats['AVG'] = five_season_career_stats['H'] / (five_season_career_stats['AB']*1.0)
five_season_career_stats['SLG'] = (five_season_career_stats['H'] + five_season_career_stats['2B'] + 2*five_season_career_stats['3B'] + 3*five_season_career_stats['HR']) / (five_season_career_stats['AB']*1.0)
five_season_career_stats['OBP'] = (five_season_career_stats['H'] + five_season_career_stats['BB'] + five_season_career_stats['HBP']) / ((five_season_career_stats['AB'] + five_season_career_stats['BB'] + five_season_career_stats['HBP'] + five_season_career_stats['SF'])*1.0) 
five_season_career_stats['OPS'] = five_season_career_stats['SLG'] + five_season_career_stats['OBP']

five_season_career_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
ID,Player,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
aaronha01,Henry Aaron,2898,2652,450,846,149,37,125,440,199,237,9,18,20,0.319005,0.544495,0.365972,0.910467
aaronto01,Tommie Aaron,923,828,93,191,38,6,11,84,79,130,0,9,6,0.230676,0.330918,0.295728,0.626646
abbotje01,Jeff Abbott,651,596,82,157,33,2,18,83,38,91,3,5,7,0.263423,0.416107,0.307453,0.723561
abbotku01,Kurt Abbott,1528,1398,181,358,72,19,43,165,91,395,14,18,7,0.256080,0.427039,0.306623,0.733661
aberal01,Al Aber,107,91,5,12,0,0,0,5,7,37,0,8,1,0.131868,0.131868,0.191919,0.323787
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zoskyed01,Eddie Zosky,53,50,4,8,1,2,0,3,1,13,0,1,1,0.160000,0.260000,0.173077,0.433077
zuberbi01,Bill Zuber,50,46,1,3,0,1,0,0,1,12,0,3,0,0.065217,0.108696,0.085106,0.193802
zuninmi01,Mike Zunino,1682,1512,169,316,68,2,75,197,114,564,39,8,9,0.208995,0.405423,0.280167,0.685591
zuvelpa01,Paul Zuvella,303,269,20,57,10,1,0,7,25,22,1,8,0,0.211896,0.256506,0.281356,0.537862


Because we've completed a number of calculations, we should look for null or NaN values. (Because we crafted this data in previous steps, we know that any null values detected are newly introduced.)

In [21]:
five_season_career_stats.isnull().values.any()

False

There were no null or NaN values in the dataframe, so we can proceed. First, let's reset the index.

In [22]:
five_season_career_stats = five_season_career_stats.reset_index()
five_season_career_stats

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
0,aaronha01,Henry Aaron,2898,2652,450,846,149,37,125,440,199,237,9,18,20,0.319005,0.544495,0.365972,0.910467
1,aaronto01,Tommie Aaron,923,828,93,191,38,6,11,84,79,130,0,9,6,0.230676,0.330918,0.295728,0.626646
2,abbotje01,Jeff Abbott,651,596,82,157,33,2,18,83,38,91,3,5,7,0.263423,0.416107,0.307453,0.723561
3,abbotku01,Kurt Abbott,1528,1398,181,358,72,19,43,165,91,395,14,18,7,0.256080,0.427039,0.306623,0.733661
4,aberal01,Al Aber,107,91,5,12,0,0,0,5,7,37,0,8,1,0.131868,0.131868,0.191919,0.323787
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5578,zoskyed01,Eddie Zosky,53,50,4,8,1,2,0,3,1,13,0,1,1,0.160000,0.260000,0.173077,0.433077
5579,zuberbi01,Bill Zuber,50,46,1,3,0,1,0,0,1,12,0,3,0,0.065217,0.108696,0.085106,0.193802
5580,zuninmi01,Mike Zunino,1682,1512,169,316,68,2,75,197,114,564,39,8,9,0.208995,0.405423,0.280167,0.685591
5581,zuvelpa01,Paul Zuvella,303,269,20,57,10,1,0,7,25,22,1,8,0,0.211896,0.256506,0.281356,0.537862


### 5. Adding Hall of Fame Inductee Labelling

Next, let's get the data labelled, marking up if the player is currently a member of the Major League Baseball Hall of Fame.

In [23]:
five_year_hof = pd.merge(hof, five_season_career_stats, on="ID", how="right")
five_year_hof

Unnamed: 0,ID,Inductee,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
0,aaronha01,1.0,Henry Aaron,2898,2652,450,846,149,37,125,440,199,237,9,18,20,0.319005,0.544495,0.365972,0.910467
1,aaronto01,,Tommie Aaron,923,828,93,191,38,6,11,84,79,130,0,9,6,0.230676,0.330918,0.295728,0.626646
2,abbotje01,,Jeff Abbott,651,596,82,157,33,2,18,83,38,91,3,5,7,0.263423,0.416107,0.307453,0.723561
3,abbotku01,,Kurt Abbott,1528,1398,181,358,72,19,43,165,91,395,14,18,7,0.256080,0.427039,0.306623,0.733661
4,aberal01,,Al Aber,107,91,5,12,0,0,0,5,7,37,0,8,1,0.131868,0.131868,0.191919,0.323787
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5578,zoskyed01,,Eddie Zosky,53,50,4,8,1,2,0,3,1,13,0,1,1,0.160000,0.260000,0.173077,0.433077
5579,zuberbi01,,Bill Zuber,50,46,1,3,0,1,0,0,1,12,0,3,0,0.065217,0.108696,0.085106,0.193802
5580,zuninmi01,,Mike Zunino,1682,1512,169,316,68,2,75,197,114,564,39,8,9,0.208995,0.405423,0.280167,0.685591
5581,zuvelpa01,,Paul Zuvella,303,269,20,57,10,1,0,7,25,22,1,8,0,0.211896,0.256506,0.281356,0.537862


The first thing we notice is the NaN values under the new `'Inductee'` column. This is easy to address, as these are all non-inductees and can be replaced with a value of zero.

In [24]:
hof_is_nan = five_year_hof.loc[pd.isna(five_year_hof['Inductee'])]
series = hof_is_nan.index
five_year_hof.loc[series, 'Inductee'] = 0
five_year_hof['Inductee'] = five_year_hof['Inductee'].astype(int)
five_year_hof

Unnamed: 0,ID,Inductee,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
0,aaronha01,1,Henry Aaron,2898,2652,450,846,149,37,125,440,199,237,9,18,20,0.319005,0.544495,0.365972,0.910467
1,aaronto01,0,Tommie Aaron,923,828,93,191,38,6,11,84,79,130,0,9,6,0.230676,0.330918,0.295728,0.626646
2,abbotje01,0,Jeff Abbott,651,596,82,157,33,2,18,83,38,91,3,5,7,0.263423,0.416107,0.307453,0.723561
3,abbotku01,0,Kurt Abbott,1528,1398,181,358,72,19,43,165,91,395,14,18,7,0.256080,0.427039,0.306623,0.733661
4,aberal01,0,Al Aber,107,91,5,12,0,0,0,5,7,37,0,8,1,0.131868,0.131868,0.191919,0.323787
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5578,zoskyed01,0,Eddie Zosky,53,50,4,8,1,2,0,3,1,13,0,1,1,0.160000,0.260000,0.173077,0.433077
5579,zuberbi01,0,Bill Zuber,50,46,1,3,0,1,0,0,1,12,0,3,0,0.065217,0.108696,0.085106,0.193802
5580,zuninmi01,0,Mike Zunino,1682,1512,169,316,68,2,75,197,114,564,39,8,9,0.208995,0.405423,0.280167,0.685591
5581,zuvelpa01,0,Paul Zuvella,303,269,20,57,10,1,0,7,25,22,1,8,0,0.211896,0.256506,0.281356,0.537862


Let's just do a sanity check for any null or NaN values in the dataframe.

In [25]:
five_year_hof.isnull().values.any()

False

All that's left to do before splitting the data up is reordering the columns.

In [26]:
columns_hof = ['ID', 'Player', 
               'PA','AB','R','H','2B','3B','HR','RBI','BB','SO','HBP','SH', 'SF', 
               'AVG', 'SLG', 'OBP', 'OPS', 
               'Inductee']

five_year_hof = five_year_hof[columns_hof]
five_year_hof

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,Inductee
0,aaronha01,Henry Aaron,2898,2652,450,846,149,37,125,440,199,237,9,18,20,0.319005,0.544495,0.365972,0.910467,1
1,aaronto01,Tommie Aaron,923,828,93,191,38,6,11,84,79,130,0,9,6,0.230676,0.330918,0.295728,0.626646,0
2,abbotje01,Jeff Abbott,651,596,82,157,33,2,18,83,38,91,3,5,7,0.263423,0.416107,0.307453,0.723561,0
3,abbotku01,Kurt Abbott,1528,1398,181,358,72,19,43,165,91,395,14,18,7,0.256080,0.427039,0.306623,0.733661,0
4,aberal01,Al Aber,107,91,5,12,0,0,0,5,7,37,0,8,1,0.131868,0.131868,0.191919,0.323787,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5578,zoskyed01,Eddie Zosky,53,50,4,8,1,2,0,3,1,13,0,1,1,0.160000,0.260000,0.173077,0.433077,0
5579,zuberbi01,Bill Zuber,50,46,1,3,0,1,0,0,1,12,0,3,0,0.065217,0.108696,0.085106,0.193802,0
5580,zuninmi01,Mike Zunino,1682,1512,169,316,68,2,75,197,114,564,39,8,9,0.208995,0.405423,0.280167,0.685591,0
5581,zuvelpa01,Paul Zuvella,303,269,20,57,10,1,0,7,25,22,1,8,0,0.211896,0.256506,0.281356,0.537862,0


### 6. Complete the Data Splitting by Creating Dataframes

With the career data in order and labelled with the Hall of Fame induction data, it is time to use the lists we created earlier in the notebook and split the data into our arbitrarily grouped dataframes.

All players who started their careers before the year 2000 will be in the first dataframe, `pre_2000`, and all players who started their careers in 2000 or later (some of whom will be currently active players) will be in the second dataframe, `from_2000`.

We already have our list of names from earlier work, so now it is a matter of merging data to make new dataframes specific to the chosen eras.

First, our pre-2000 players.

In [27]:
pre_2000 = pd.merge(five_year_hof, career_pre_2000, on='ID', how='inner')
pre_2000

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,Inductee
0,aaronha01,Henry Aaron,2898,2652,450,846,149,37,125,440,199,237,9,18,20,0.319005,0.544495,0.365972,0.910467,1
1,aaronto01,Tommie Aaron,923,828,93,191,38,6,11,84,79,130,0,9,6,0.230676,0.330918,0.295728,0.626646,0
2,abbotje01,Jeff Abbott,651,596,82,157,33,2,18,83,38,91,3,5,7,0.263423,0.416107,0.307453,0.723561,0
3,abbotku01,Kurt Abbott,1528,1398,181,358,72,19,43,165,91,395,14,18,7,0.256080,0.427039,0.306623,0.733661,0
4,aberal01,Al Aber,107,91,5,12,0,0,0,5,7,37,0,8,1,0.131868,0.131868,0.191919,0.323787,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4274,zoldasa01,Sam Zoldak,209,195,11,33,5,0,0,6,7,39,0,7,0,0.169231,0.194872,0.198020,0.392892,0
4275,zoskyed01,Eddie Zosky,53,50,4,8,1,2,0,3,1,13,0,1,1,0.160000,0.260000,0.173077,0.433077,0
4276,zuberbi01,Bill Zuber,50,46,1,3,0,1,0,0,1,12,0,3,0,0.065217,0.108696,0.085106,0.193802,0
4277,zuvelpa01,Paul Zuvella,303,269,20,57,10,1,0,7,25,22,1,8,0,0.211896,0.256506,0.281356,0.537862,0


Let's decrease the dataset by saying a player must have at least 2000 plate appearances over the five seasons.

In [28]:
pre_2000 = pre_2000[pre_2000['PA'] > 2000 ].copy()
pre_2000

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,Inductee
0,aaronha01,Henry Aaron,2898,2652,450,846,149,37,125,440,199,237,9,18,20,0.319005,0.544495,0.365972,0.910467,1
8,abreubo01,Bobby Abreu,2165,1829,312,572,117,29,65,273,316,413,5,4,11,0.312739,0.515036,0.413235,0.928270,0
13,adamsbu01,Buster Adams,2084,1821,259,487,85,11,48,234,208,255,13,42,0,0.267435,0.405272,0.346719,0.751991,0
16,adamssp01,Sparky Adams,2263,2025,300,591,87,17,7,148,167,75,14,40,0,0.291852,0.361975,0.349955,0.711930,0
18,adcocjo01,Joe Adcock,2425,2235,273,632,114,20,72,322,156,232,7,23,4,0.282774,0.448322,0.330974,0.779296,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4252,younger01,Eric Young Sr.,2051,1784,306,528,74,22,25,193,205,127,32,18,12,0.295964,0.404148,0.376291,0.780439,0
4259,yountro01,Robin Yount,2869,2647,306,717,118,23,26,252,148,307,5,47,22,0.270873,0.362297,0.308292,0.670589,1
4265,zarilal01,Al Zarilla,2014,1796,224,499,88,25,27,216,161,183,15,42,0,0.277840,0.399777,0.342292,0.742069,0
4268,zeileto01,Todd Zeile,2462,2152,278,571,118,12,51,297,276,331,7,1,26,0.265335,0.402416,0.347013,0.749430,0


Then finally, our 2000 onward players.

In [29]:
from_2000 = pd.merge(five_year_hof, career_from_2000, on='ID', how='inner')
from_2000

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,Inductee
0,abreujo02,José Abreu,3213,2913,398,858,180,13,146,488,209,624,67,0,24,0.294542,0.515620,0.352941,0.868561,0
1,abreuto01,Tony Abreu,611,575,60,147,39,6,6,60,22,116,5,1,8,0.255652,0.375652,0.285246,0.660898,0
2,acevejo01,Jose Acevedo,120,101,2,8,2,0,0,4,5,61,1,12,1,0.079208,0.099010,0.129630,0.228640,0
3,ackledu01,Dustin Ackley,2277,2064,255,503,94,18,46,212,186,410,5,11,11,0.243702,0.373547,0.306267,0.679813,0
4,adamecr01,Cristhian Adames,367,328,31,70,9,4,2,22,30,77,5,4,0,0.213415,0.283537,0.289256,0.572793,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1299,zimmejo02,Jordan Zimmermann,239,202,11,34,4,0,1,11,8,58,0,28,1,0.168317,0.202970,0.199052,0.402022,0
1300,zimmery01,Ryan Zimmerman,2626,2363,350,672,161,12,91,364,228,447,10,1,23,0.284384,0.478206,0.346799,0.825004,0
1301,zitoba01,Barry Zito,28,26,0,1,0,0,0,0,0,13,0,2,0,0.038462,0.038462,0.038462,0.076923,0
1302,zobribe01,Ben Zobrist,1784,1520,217,384,74,13,52,223,221,295,8,12,23,0.252632,0.421053,0.345937,0.766989,0


And, again, reducing the dataset by requiring at least 2000 plate appearances over five seasons.

In [30]:
from_2000 = from_2000[from_2000['PA'] > 1999 ].copy()
from_2000

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,Inductee
0,abreujo02,José Abreu,3213,2913,398,858,180,13,146,488,209,624,67,0,24,0.294542,0.515620,0.352941,0.868561,0
3,ackledu01,Dustin Ackley,2277,2064,255,503,94,18,46,212,186,410,5,11,11,0.243702,0.373547,0.306267,0.679813,0
13,albieoz01,Ozzie Albies,2440,2243,365,613,137,25,90,311,163,422,16,2,16,0.273295,0.477040,0.324856,0.801896,0
22,altuvjo01,Jose Altuve,2932,2721,341,830,162,14,36,226,146,308,24,17,24,0.305035,0.414553,0.343053,0.757607,0
24,alvarpe01,Pedro Alvarez,2293,2063,240,484,90,6,104,324,211,678,9,1,9,0.234610,0.435288,0.307155,0.742444,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1289,youklke01,Kevin Youkilis,2269,1922,325,555,138,8,66,314,277,397,42,0,28,0.288762,0.471904,0.385192,0.857096,0
1291,youngch04,Chris Young,2566,2281,328,549,136,14,98,296,244,596,14,11,16,0.240684,0.441473,0.315851,0.757324,0
1292,youngde03,Delmon Young,2464,2311,288,675,137,8,59,344,102,429,23,1,27,0.292081,0.434877,0.324807,0.759684,0
1295,youngmi02,Michael Young,2516,2317,353,666,110,30,56,282,147,396,5,25,22,0.287441,0.433319,0.328382,0.761701,0


## It's Time to Save the Career & Hall of Fame Data

Let's capture these three dataframes and have them be ready for Step 3's Notebook.

In [32]:
import os

if not os.path.exists('./data'):
    os.makedirs('./data')
    
alldata_csv = "./data/step2_alldata.csv"
five_year_hof.to_csv(alldata_csv, index=False)

pre_2000_csv = "./data/step2_pre_2000.csv"
pre_2000.to_csv(pre_2000_csv, index=False)

from_2000_csv = "./data/step2_from_2000.csv"
from_2000.to_csv(from_2000_csv, index=False)

## Concluding Notebook Comments

**Note:** At this point, we will conclude this notebook for organizational purposes.

Saving the data files in various states makes it easier to re-run parts of the overall project without having to re-run every aspect.

The **purpose of this notebook** is to prepare the Hall of Fame and Career MLB data for use in the next notebook.

**The *next* notebook in the series is: `harr2890_project_step3_hof_modelling`,** where the saved, data files will be loaded and experimentation and modelling will take place using **the Hall of Fame Approach**.