# Using Historical Data to Predict Batting Success: Step 4

Authored by: Donna J. Harris (994042890)

Email: harr2890@mylaurier.ca

For: CP640 Machine Learning (S22) with Professor Elham Harirpoush

## Notebook Series

Just a word about the presentation of this project code.

The code is organized into a series of locally executed Jupyter notebooks, organized by step and needing to be executed in sequence. This is `harr2890_project_step4_ops_data_prep`, the fourth(???) of XXXXX notebooks.   TODO

## *Step 4 - Data Preparation for an OPS Approach*

This notebook encompasses a third phase of data preparation, following Step 1 Notebook's preparation. From here, we will continue with the structuring and splitting up the data to the state where the experiments and modelling will be conducted based on a On Base Percentage + Slugging (OPS) approach.

Here, we will be extracting data and generating multiple seasons of the OPS statistic, preparing the data for exploration and modelling based on various **regression** techniques in a subsequent notebook.

We will also be splitting the data, based on the year 2000, so that later on in the process we can run controlled tests on unseen data which can be used to manually evaluate the results and predictions of the modelling.

## Environment Setup

Import and establish environment for our work, including showing all dataframe column values.

In [1]:
import pandas as pd

pd.set_option('display.max_columns', None)

### Pre-Conditions

Step 1 must be run completely before running this notebook.

The `data` folder must exist with the following prepared data file:
- `./data/core_mlb_dataset.csv`

##  Loading Prepared Data Files

Load in the Major League Baseball batting data (`./data/core_mlb_dataset.csv`) so we can continue with preparing this data.


In [2]:
core_mlb_dataset = "./data/core_mlb_dataset.csv"
df = pd.read_csv(core_mlb_dataset)
df

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Result,Season
0,delahed01,Ed Delahanty,PHI,BRO,5,4,1,2,0,0,0,0,1,0,0,0,0,L,1901
1,dolanjo02,Joe Dolan,PHI,BRO,5,5,0,1,0,0,0,1,0,0,0,0,0,L,1901
2,childcu01,Cupid Childs,CHC,STL,5,5,1,1,0,0,0,0,0,0,0,0,0,W,1901
3,crolifr01,Fred Crolius,BSN,NYG,4,4,0,0,0,0,0,1,0,0,0,0,0,W,1901
4,delahed01,Ed Delahanty,PHI,BRO,4,4,0,0,0,0,0,0,0,2,0,0,0,L,1901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3715517,woodfja01,Jake Woodford,STL,CHC,2,1,0,0,0,0,0,0,0,0,0,1,0,L,2021
3715518,yastrmi01,Mike Yastrzemski,SFG,SDP,4,3,1,1,1,0,0,2,1,1,0,0,0,W,2021
3715519,zimmebr01,Bradley Zimmer,CLE,TEX,4,4,1,2,0,0,0,1,0,0,0,0,0,W,2021
3715520,zimmery01,Ryan Zimmerman,WSN,BOS,4,3,0,0,0,0,0,1,1,2,0,0,0,L,2021


## Preprocessing (Continued from the Step 1 Notebook)

### The OPS Approach

A statistic that has become a very important measurement of a batter's productivity and efficacy at the plate is the On-base Plus Slugging (OPS) statistic, which (as the name implies) is the sum of two other batting statistics:  On-Base Percentage (OBP) and Slugging Percentage (SLG). Generally speaking, an OPS value that is close to (or over) 1.000 indicates a very good batting performance.

Because the overarching goal is to use Major League Baseball data to predict batting success, OPS prediction has the potential to be an approach that helps to predict a batter's success at the plate.

By using OPS values from multiple seasons, we should be able to use **regression** techniques to predict the future OPS values of a batter.

In order to prepare the data for this approach, we will to do the following:  TODO....
1.  Generate seasonal batting statistics for player data.



2.  Arbitrarily split the data, based on the year 2000. (Get the names of these players.)
3.  Filter out players with minimal batting data.
6.  Create dataframes based on the calculated batting data, split by the list of names created in stage XXX.

The end result will be two labelled dataframes containing player OPS data for:
- players whose career began before 2000
- players whose career began during or after 2000

These will be stored and used in the Step 5 Notebook work. 

### 1. Seasonal Batting Statistics

First, we group the game-based statistics by player and season.

In [47]:
filterable = ['ID', 'Player', 'Season']
columns = ['ID', 'Player','PA','AB','R','H','2B','3B','HR','RBI','BB','SO','HBP','SH', 'SF']
group_alldata = df.groupby(filterable)
group_alldata = group_alldata[columns].sum().copy()
group_alldata

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
ID,Player,Season,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0
aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0
aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0
aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4
aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0
zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0
zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0
zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0


Then, we'll remove the grouping so we have independent seasonal statistic records for each player's season.

In [48]:
seasonal_data = group_alldata.reset_index()
seasonal_data

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0


Next, because we're interested in seasonal statistics -- especially OPS -- we need to calculate those values seasonally for each player.

**Note:** The statement/simplifictaion of these statistical calculations is outlined in detail in the ***Step 2 Notebook***, section 4.

In [49]:
seasonal_data['AVG'] = seasonal_data['H'] / (seasonal_data['AB']*1.0)
seasonal_data['SLG'] = (seasonal_data['H'] + seasonal_data['2B'] + 2*seasonal_data['3B'] + 3*seasonal_data['HR']) / (seasonal_data['AB']*1.0)
seasonal_data['OBP'] = (seasonal_data['H'] + seasonal_data['BB'] + seasonal_data['HBP']) / ((seasonal_data['AB'] + seasonal_data['BB'] + seasonal_data['HBP'] + seasonal_data['SF'])*1.0) 
seasonal_data['OPS'] = seasonal_data['SLG'] + seasonal_data['OBP']

seasonal_data

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0,0.000000,0.000000,0.000000,0.000000
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4,0.279915,0.446581,0.322068,0.768649
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4,0.313953,0.539867,0.366261,0.906129
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0,0.185185,0.222222,0.214286,0.436508
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0,0.117647,0.117647,0.166667,0.284314
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0,0.071429,0.071429,0.133333,0.204762
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0,0.222222,0.444444,0.300000,0.744444


Because we've completed a number of calculations, we should look for null or NaN values. (Because we crafted this data in previous steps, we know that any null values detected are newly introduced.)

In [50]:
seasonal_data.isnull().values.any()

True

As anticipated, there are null values from those calculations, so let's find them and resolve the issues.

Let's look at `'AVG'`...

In [51]:
avg_is_nan = seasonal_data.loc[pd.isna(seasonal_data['AVG'])]
avg_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
75,abernte01,Ted Abernathy,1942,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
636,alfonan01,Antonio Alfonseca,2003,1,0,0,0,0,0,0,0,0,0,0,1,0,,,,
844,almanar01,Armando Almanza,2004,1,0,0,0,0,0,0,0,0,0,0,1,0,,,,
1361,anderla02,Larry Andersen,1994,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
1445,andrena01,Nate Andrews,1937,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69277,yancyhu01,Hugh Yancy,1974,1,0,0,0,0,0,0,0,0,0,0,1,0,,,,
69326,yeabsbe01,Bert Yeabsley,1919,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
69766,zannido01,Dom Zanni,1959,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
69807,zavadcl01,Clay Zavada,2009,1,0,0,0,0,0,0,0,0,0,0,1,0,,,,


... and resolve by setting to 0.0, then confirm.

In [52]:
series = avg_is_nan.index
seasonal_data.loc[series, 'AVG'] = 0.0

avg_is_nan = seasonal_data.loc[pd.isna(seasonal_data['AVG'])]
avg_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS


`'AVG'` is addressed, so let's look at `'SLG'`...

In [53]:
slg_is_nan = seasonal_data.loc[pd.isna(seasonal_data['SLG'])]
slg_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
75,abernte01,Ted Abernathy,1942,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
636,alfonan01,Antonio Alfonseca,2003,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,,,
844,almanar01,Armando Almanza,2004,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,,,
1361,anderla02,Larry Andersen,1994,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
1445,andrena01,Nate Andrews,1937,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69277,yancyhu01,Hugh Yancy,1974,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,,,
69326,yeabsbe01,Bert Yeabsley,1919,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
69766,zannido01,Dom Zanni,1959,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
69807,zavadcl01,Clay Zavada,2009,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,,,


... and resolve by setting to 0.0, then confirm.

In [54]:
series = slg_is_nan.index
seasonal_data.loc[series, 'SLG'] = 0.0

slg_is_nan = seasonal_data.loc[pd.isna(seasonal_data['SLG'])]
slg_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS


`'SLG'` is addressed, so let's look at `'OBP'`...

In [55]:
obp_is_nan = seasonal_data.loc[pd.isna(seasonal_data['OBP'])]
obp_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
636,alfonan01,Antonio Alfonseca,2003,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
844,almanar01,Armando Almanza,2004,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
1847,arroylu01,Luis Arroyo,1963,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
2179,avilalu01,Luis Avilan,2013,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
2180,avilalu01,Luis Avilan,2014,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68213,wilsoga03,Gary Wilson,1995,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
68486,winklda01,Dan Winkler,2019,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
69153,wuertmi01,Michael Wuertz,2006,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
69277,yancyhu01,Hugh Yancy,1974,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,


... and resolve by setting to 0.0, then confirm.

In [56]:
series = obp_is_nan.index
seasonal_data.loc[series, 'OBP'] = 0.0

obp_is_nan = seasonal_data.loc[pd.isna(seasonal_data['OBP'])]
obp_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS


`'OBP'` is addressed, so, let's finish this up and look at `'OPS'`...

In [57]:
ops_is_nan = seasonal_data.loc[pd.isna(seasonal_data['OPS'])]
ops_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
75,abernte01,Ted Abernathy,1942,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
636,alfonan01,Antonio Alfonseca,2003,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0.0,
844,almanar01,Armando Almanza,2004,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0.0,
1361,anderla02,Larry Andersen,1994,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
1445,andrena01,Nate Andrews,1937,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69277,yancyhu01,Hugh Yancy,1974,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0.0,
69326,yeabsbe01,Bert Yeabsley,1919,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
69766,zannido01,Dom Zanni,1959,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
69807,zavadcl01,Clay Zavada,2009,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0.0,


... and resolve by recalculating by adding `'SLG'` + `'OBP'`...

In [58]:
series = ops_is_nan.index

for record in series:
    seasonal_data.loc[record, 'OPS'] = seasonal_data.loc[record, 'SLG'] + seasonal_data.loc[record, 'OBP']


... and then confirm.

In [59]:
ops_is_nan = seasonal_data.loc[pd.isna(seasonal_data['OPS'])]
ops_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS


And lastly, let's confirm we addressed all of the issues.

In [60]:
seasonal_data.isnull().values.any()

False

Success. The calculated statistics are now filled out in the `seasonal_data` table.

In [61]:
seasonal_data

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0,0.000000,0.000000,0.000000,0.000000
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4,0.279915,0.446581,0.322068,0.768649
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4,0.313953,0.539867,0.366261,0.906129
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0,0.185185,0.222222,0.214286,0.436508
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0,0.117647,0.117647,0.166667,0.284314
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0,0.071429,0.071429,0.133333,0.204762
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0,0.222222,0.444444,0.300000,0.744444


### 2.  Gathering a Decade of Statistics

get all player IDs and names for the new dataframe

In [79]:
seasonal_data

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0,0.000000,0.000000,0.000000,0.000000
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4,0.279915,0.446581,0.322068,0.768649
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4,0.313953,0.539867,0.366261,0.906129
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0,0.185185,0.222222,0.214286,0.436508
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0,0.117647,0.117647,0.166667,0.284314
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0,0.071429,0.071429,0.133333,0.204762
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0,0.222222,0.444444,0.300000,0.744444


# make sure it is sorted by earliest season played


In [80]:
seasonal_sorted = seasonal_data.sort_values(by=['Season'], ascending=True)
seasonal_sorted

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
21668,gatinfr01,Frank Gatins,1901,207,198,21,45,7,2,1,21,5,27,2,2,0,0.227273,0.297980,0.253659,0.551638
7533,brownfr01,Fred Brown,1901,15,14,1,2,0,0,0,2,0,2,0,1,0,0.142857,0.142857,0.142857,0.285714
40448,mccange01,Gene McCann,1901,13,10,2,0,0,0,0,0,2,4,0,1,0,0.000000,0.000000,0.166667,0.166667
15398,denzero01,Roger Denzer,1901,25,22,0,2,1,0,0,1,2,8,1,0,0,0.090909,0.136364,0.200000,0.336364
30788,jennihu01,Hughie Jennings,1901,345,302,38,79,21,2,1,39,25,25,12,6,0,0.261589,0.354305,0.342183,0.696488
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42312,mileywa01,Wade Miley,2021,59,54,9,10,3,0,0,3,3,13,0,2,0,0.185185,0.240741,0.228070,0.468811
10254,castier01,Erick Castillo,2021,9,8,0,2,0,0,0,0,1,1,0,0,0,0.250000,0.250000,0.333333,0.583333
18738,feltnry01,Ryan Feltner,2021,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
22364,gittech01,Chris Gittens,2021,44,36,1,4,0,0,1,5,7,13,0,0,1,0.111111,0.194444,0.250000,0.444444


In [81]:
decade_stats = seasonal_sorted.reset_index()
decade_stats = decade_stats[['ID', 'Player']].copy()
decade_stats = decade_stats.drop_duplicates(subset=['ID'], keep='first')
decade_stats

Unnamed: 0,ID,Player
0,gatinfr01,Frank Gatins
1,brownfr01,Fred Brown
2,mccange01,Gene McCann
3,denzero01,Roger Denzer
4,jennihu01,Hughie Jennings
...,...,...
70019,baragca01,Caleb Baragar
70024,castier01,Erick Castillo
70025,feltnry01,Ryan Feltner
70026,gittech01,Chris Gittens


In [82]:
decade_stats = decade_stats.reindex(columns = decade_stats.columns.tolist() 
                                  + [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
decade_stats

Unnamed: 0,ID,Player,1,2,3,4,5,6,7,8,9,10
0,gatinfr01,Frank Gatins,,,,,,,,,,
1,brownfr01,Fred Brown,,,,,,,,,,
2,mccange01,Gene McCann,,,,,,,,,,
3,denzero01,Roger Denzer,,,,,,,,,,
4,jennihu01,Hughie Jennings,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
70019,baragca01,Caleb Baragar,,,,,,,,,,
70024,castier01,Erick Castillo,,,,,,,,,,
70025,feltnry01,Ryan Feltner,,,,,,,,,,
70026,gittech01,Chris Gittens,,,,,,,,,,


calculate OPS for first 10 seasons

**NOTE:** This one takes a bit more time...

In [83]:
first_n_seasons = 10
x = first_n_seasons

for player_id in decade_stats['ID']:
    n_seasons = seasonal_sorted[seasonal_sorted['ID'] == player_id][0:first_n_seasons]
    
    if len(n_seasons) == first_n_seasons:
        #print("Looking at ", player_id)
        ten_year_ops = n_seasons['OPS'] 
        ten_year_ops_list = ten_year_ops.to_list()

        i = 0
        while i < x:
            decade_stats.loc[decade_stats['ID'] == player_id, i+1] = ten_year_ops_list[i]
            i+=1    
    
decade_stats

Unnamed: 0,ID,Player,1,2,3,4,5,6,7,8,9,10
0,gatinfr01,Frank Gatins,,,,,,,,,,
1,brownfr01,Fred Brown,,,,,,,,,,
2,mccange01,Gene McCann,,,,,,,,,,
3,denzero01,Roger Denzer,,,,,,,,,,
4,jennihu01,Hughie Jennings,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
70019,baragca01,Caleb Baragar,,,,,,,,,,
70024,castier01,Erick Castillo,,,,,,,,,,
70025,feltnry01,Ryan Feltner,,,,,,,,,,
70026,gittech01,Chris Gittens,,,,,,,,,,


Now, this doesn't look populated but there will be several players whose data will not be populated because they don't have ten seasons or more of data.

To make this more evident, we'll remove rows with NaN values.

In [86]:
decade_stats = decade_stats.dropna().copy()

In [87]:
decade_stats

Unnamed: 0,ID,Player,1,2,3,4,5,6,7,8,9,10
533,willeed01,Ed Willett,0.000000,0.219780,0.367510,0.496266,0.379030,0.774655,0.456006,0.692935,0.621154,0.533333
813,daussho01,Hooks Dauss,1.000000,0.560802,0.578379,0.470550,0.731898,0.400920,0.559740,0.379742,0.542872,0.634387
865,coopewi01,Wilbur Cooper,0.368132,0.219780,0.462535,0.260180,0.455696,0.519551,0.591622,0.679654,0.537423,0.611333
1012,mamaual01,Al Mamaux,0.000000,0.550000,0.363387,0.473844,0.475806,0.000000,0.479365,0.464706,0.431818,0.991071
1194,ruthba01,Babe Ruth,0.500000,0.952325,0.730611,0.856730,0.967903,1.114268,1.379841,1.355617,1.103689,1.313417
...,...,...,...,...,...,...,...,...,...,...,...,...
61489,perezhe01,Hernan Perez,1.000000,0.444664,0.533333,0.583502,0.730326,0.703826,0.676495,0.641605,0.333333,0.195489
61553,gomesya01,Yan Gomes,0.630983,0.825949,0.784906,0.658537,0.527451,0.707728,0.761775,0.704177,0.787218,0.722537
61558,simmoan01,Andrelton Simmons,0.750827,0.691599,0.617196,0.659624,0.689723,0.751810,0.754196,0.673284,0.702389,0.557754
61607,mercejo03,Jordy Mercer,0.635674,0.771547,0.692806,0.613224,0.701352,0.732539,0.695653,0.747463,0.472727,0.671493


Rename OPS seasonal stats

In [88]:
decade_stats.rename(columns = {1:'OPS Y1', 2:'OPS Y2', 3:'OPS Y3', 4:'OPS Y4', 5:'OPS Y5',
                            6:'OPS Y6', 7:'OPS Y7', 8:'OPS Y8', 9:'OPS Y9', 10:'OPS Y10'
                            }, inplace = True)


In [89]:
decade_stats

Unnamed: 0,ID,Player,OPS Y1,OPS Y2,OPS Y3,OPS Y4,OPS Y5,OPS Y6,OPS Y7,OPS Y8,OPS Y9,OPS Y10
533,willeed01,Ed Willett,0.000000,0.219780,0.367510,0.496266,0.379030,0.774655,0.456006,0.692935,0.621154,0.533333
813,daussho01,Hooks Dauss,1.000000,0.560802,0.578379,0.470550,0.731898,0.400920,0.559740,0.379742,0.542872,0.634387
865,coopewi01,Wilbur Cooper,0.368132,0.219780,0.462535,0.260180,0.455696,0.519551,0.591622,0.679654,0.537423,0.611333
1012,mamaual01,Al Mamaux,0.000000,0.550000,0.363387,0.473844,0.475806,0.000000,0.479365,0.464706,0.431818,0.991071
1194,ruthba01,Babe Ruth,0.500000,0.952325,0.730611,0.856730,0.967903,1.114268,1.379841,1.355617,1.103689,1.313417
...,...,...,...,...,...,...,...,...,...,...,...,...
61489,perezhe01,Hernan Perez,1.000000,0.444664,0.533333,0.583502,0.730326,0.703826,0.676495,0.641605,0.333333,0.195489
61553,gomesya01,Yan Gomes,0.630983,0.825949,0.784906,0.658537,0.527451,0.707728,0.761775,0.704177,0.787218,0.722537
61558,simmoan01,Andrelton Simmons,0.750827,0.691599,0.617196,0.659624,0.689723,0.751810,0.754196,0.673284,0.702389,0.557754
61607,mercejo03,Jordy Mercer,0.635674,0.771547,0.692806,0.613224,0.701352,0.732539,0.695653,0.747463,0.472727,0.671493


repeat for decade of AVG data

In [90]:
decade_stats = decade_stats.reindex(columns = decade_stats.columns.tolist() 
                                  + [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
decade_stats

Unnamed: 0,ID,Player,OPS Y1,OPS Y2,OPS Y3,OPS Y4,OPS Y5,OPS Y6,OPS Y7,OPS Y8,OPS Y9,OPS Y10,1,2,3,4,5,6,7,8,9,10
533,willeed01,Ed Willett,0.000000,0.219780,0.367510,0.496266,0.379030,0.774655,0.456006,0.692935,0.621154,0.533333,,,,,,,,,,
813,daussho01,Hooks Dauss,1.000000,0.560802,0.578379,0.470550,0.731898,0.400920,0.559740,0.379742,0.542872,0.634387,,,,,,,,,,
865,coopewi01,Wilbur Cooper,0.368132,0.219780,0.462535,0.260180,0.455696,0.519551,0.591622,0.679654,0.537423,0.611333,,,,,,,,,,
1012,mamaual01,Al Mamaux,0.000000,0.550000,0.363387,0.473844,0.475806,0.000000,0.479365,0.464706,0.431818,0.991071,,,,,,,,,,
1194,ruthba01,Babe Ruth,0.500000,0.952325,0.730611,0.856730,0.967903,1.114268,1.379841,1.355617,1.103689,1.313417,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61489,perezhe01,Hernan Perez,1.000000,0.444664,0.533333,0.583502,0.730326,0.703826,0.676495,0.641605,0.333333,0.195489,,,,,,,,,,
61553,gomesya01,Yan Gomes,0.630983,0.825949,0.784906,0.658537,0.527451,0.707728,0.761775,0.704177,0.787218,0.722537,,,,,,,,,,
61558,simmoan01,Andrelton Simmons,0.750827,0.691599,0.617196,0.659624,0.689723,0.751810,0.754196,0.673284,0.702389,0.557754,,,,,,,,,,
61607,mercejo03,Jordy Mercer,0.635674,0.771547,0.692806,0.613224,0.701352,0.732539,0.695653,0.747463,0.472727,0.671493,,,,,,,,,,


In [91]:
first_n_seasons = 10
x = first_n_seasons

for player_id in decade_stats['ID']:
    n_seasons = seasonal_sorted[seasonal_sorted['ID'] == player_id][0:first_n_seasons]
    
    if len(n_seasons) == first_n_seasons:
        #print("Looking at ", player_id)
        ten_year_avg = n_seasons['AVG'] 
        ten_year_avg_list = ten_year_avg.to_list()

        i = 0
        while i < x:
            decade_stats.loc[decade_stats['ID'] == player_id, i+1] = ten_year_avg_list[i]
            i+=1    
    
decade_stats

Unnamed: 0,ID,Player,OPS Y1,OPS Y2,OPS Y3,OPS Y4,OPS Y5,OPS Y6,OPS Y7,OPS Y8,OPS Y9,OPS Y10,1,2,3,4,5,6,7,8,9,10
533,willeed01,Ed Willett,0.000000,0.219780,0.367510,0.496266,0.379030,0.774655,0.456006,0.692935,0.621154,0.533333,0.000000,0.076923,0.164179,0.190909,0.134146,0.268293,0.165217,0.282609,0.234375,0.200000
813,daussho01,Hooks Dauss,1.000000,0.560802,0.578379,0.470550,0.731898,0.400920,0.559740,0.379742,0.542872,0.634387,0.250000,0.177215,0.216495,0.145631,0.222222,0.126437,0.181818,0.144330,0.170732,0.261364
865,coopewi01,Wilbur Cooper,0.368132,0.219780,0.462535,0.260180,0.455696,0.519551,0.591622,0.679654,0.537423,0.611333,0.153846,0.076923,0.206522,0.114754,0.215190,0.203883,0.242105,0.295238,0.221239,0.254098
1012,mamaual01,Al Mamaux,0.000000,0.550000,0.363387,0.473844,0.475806,0.000000,0.479365,0.464706,0.431818,0.991071,0.000000,0.250000,0.163043,0.190909,0.225806,0.000000,0.174603,0.166667,0.181818,0.250000
1194,ruthba01,Babe Ruth,0.500000,0.952325,0.730611,0.856730,0.967903,1.114268,1.379841,1.355617,1.103689,1.313417,0.200000,0.315217,0.268116,0.325203,0.299685,0.319444,0.375546,0.377079,0.314496,0.394231
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61489,perezhe01,Hernan Perez,1.000000,0.444664,0.533333,0.583502,0.730326,0.703826,0.676495,0.641605,0.333333,0.195489,0.500000,0.196970,0.200000,0.243346,0.272277,0.259259,0.253165,0.228448,0.166667,0.052632
61553,gomesya01,Yan Gomes,0.630983,0.825949,0.784906,0.658537,0.527451,0.707728,0.761775,0.704177,0.787218,0.722537,0.204082,0.293515,0.278351,0.231405,0.167331,0.231672,0.265509,0.222930,0.284404,0.252149
61558,simmoan01,Andrelton Simmons,0.750827,0.691599,0.617196,0.659624,0.689723,0.751810,0.754196,0.673284,0.702389,0.557754,0.289157,0.247525,0.244444,0.265421,0.281250,0.278438,0.292419,0.263819,0.296610,0.223301
61607,mercejo03,Jordy Mercer,0.635674,0.771547,0.692806,0.613224,0.701352,0.732539,0.695653,0.747463,0.472727,0.671493,0.209677,0.285285,0.254941,0.243655,0.256262,0.254980,0.251269,0.269531,0.200000,0.254237


Check for NaN

In [92]:
decade_stats.isnull().values.any()

False

rename AVG seasonal stats

In [93]:
decade_stats.rename(columns = {1:'AVG Y1', 2:'AVG Y2', 3:'AVG Y3', 4:'AVG Y4', 5:'AVG Y5',
                            6:'AVG Y6', 7:'AVG Y7', 8:'AVG Y8', 9:'AVG Y9', 10:'AVG Y10'
                            }, inplace = True)

In [94]:
decade_stats

Unnamed: 0,ID,Player,OPS Y1,OPS Y2,OPS Y3,OPS Y4,OPS Y5,OPS Y6,OPS Y7,OPS Y8,OPS Y9,OPS Y10,AVG Y1,AVG Y2,AVG Y3,AVG Y4,AVG Y5,AVG Y6,AVG Y7,AVG Y8,AVG Y9,AVG Y10
533,willeed01,Ed Willett,0.000000,0.219780,0.367510,0.496266,0.379030,0.774655,0.456006,0.692935,0.621154,0.533333,0.000000,0.076923,0.164179,0.190909,0.134146,0.268293,0.165217,0.282609,0.234375,0.200000
813,daussho01,Hooks Dauss,1.000000,0.560802,0.578379,0.470550,0.731898,0.400920,0.559740,0.379742,0.542872,0.634387,0.250000,0.177215,0.216495,0.145631,0.222222,0.126437,0.181818,0.144330,0.170732,0.261364
865,coopewi01,Wilbur Cooper,0.368132,0.219780,0.462535,0.260180,0.455696,0.519551,0.591622,0.679654,0.537423,0.611333,0.153846,0.076923,0.206522,0.114754,0.215190,0.203883,0.242105,0.295238,0.221239,0.254098
1012,mamaual01,Al Mamaux,0.000000,0.550000,0.363387,0.473844,0.475806,0.000000,0.479365,0.464706,0.431818,0.991071,0.000000,0.250000,0.163043,0.190909,0.225806,0.000000,0.174603,0.166667,0.181818,0.250000
1194,ruthba01,Babe Ruth,0.500000,0.952325,0.730611,0.856730,0.967903,1.114268,1.379841,1.355617,1.103689,1.313417,0.200000,0.315217,0.268116,0.325203,0.299685,0.319444,0.375546,0.377079,0.314496,0.394231
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61489,perezhe01,Hernan Perez,1.000000,0.444664,0.533333,0.583502,0.730326,0.703826,0.676495,0.641605,0.333333,0.195489,0.500000,0.196970,0.200000,0.243346,0.272277,0.259259,0.253165,0.228448,0.166667,0.052632
61553,gomesya01,Yan Gomes,0.630983,0.825949,0.784906,0.658537,0.527451,0.707728,0.761775,0.704177,0.787218,0.722537,0.204082,0.293515,0.278351,0.231405,0.167331,0.231672,0.265509,0.222930,0.284404,0.252149
61558,simmoan01,Andrelton Simmons,0.750827,0.691599,0.617196,0.659624,0.689723,0.751810,0.754196,0.673284,0.702389,0.557754,0.289157,0.247525,0.244444,0.265421,0.281250,0.278438,0.292419,0.263819,0.296610,0.223301
61607,mercejo03,Jordy Mercer,0.635674,0.771547,0.692806,0.613224,0.701352,0.732539,0.695653,0.747463,0.472727,0.671493,0.209677,0.285285,0.254941,0.243655,0.256262,0.254980,0.251269,0.269531,0.200000,0.254237


### 6. Complete the Data Splitting by Creating Dataframes

TODO

...

...




With the career data in order and labelled with the Hall of Fame induction data, it is time to use the lists we created earlier in the notebook and split the data into our arbitrarily grouped dataframes.

All players who started their careers before the year 2000 will be in the first dataframe, `pre_2000`, and all players who started their careers in 2000 or later (some of whom will be currently active players) will be in the second dataframe, `from_2000`.

We already have our list of names from earlier work, so now it is a matter of merging data to make new dataframes specific to the chosen eras.

First, our pre-2000 players.

In [89]:
## TODO

pre_2000 = pd.merge(XXXX_NAMED, career_pre_2000, on='ID', how='inner')
pre_2000

Unnamed: 0,ID,Player,Number of Seasons,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,Inductee
0,aaronha01,Henry Aaron,23,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,0.305503,0.555152,0.374276,0.929429,1
1,abreubo01,Bobby Abreu,18,10081,8480,1453,2470,574,59,288,1363,1476,1840,33,7,85,0.291274,0.474764,0.394977,0.869741,0
2,adairje01,Jerry Adair,13,4314,4019,376,1022,163,19,57,365,207,497,17,41,30,0.254292,0.346852,0.291598,0.638451,0
3,adamsbo03,Bobby Adams,14,4335,3846,557,1036,180,47,36,294,394,426,16,74,5,0.269371,0.368695,0.339357,0.708052,0
4,adamssp01,Sparky Adams,13,6175,5558,839,1588,249,49,9,390,453,222,28,95,0,0.285714,0.353005,0.342606,0.695611,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1509,zaungr01,Gregg Zaun,16,4042,3489,431,878,194,9,88,446,479,544,29,14,31,0.251648,0.388077,0.344091,0.732168,0
1510,zeileto01,Todd Zeile,16,8649,7573,986,2004,397,23,253,1110,945,1279,42,8,81,0.264624,0.423346,0.346140,0.769487,0
1511,zernigu01,Gus Zernial,11,4361,3940,551,1049,152,21,227,749,375,731,24,2,20,0.266244,0.488325,0.332186,0.820511,0
1512,zimmedo01,Don Zimmer,12,3523,3218,342,758,127,21,90,348,242,662,13,36,14,0.235550,0.371970,0.290508,0.662478,0


Then finally, our 2000 onward players.

In [91]:
## TODO

from_2000 = pd.merge(XXXX_NAMED, career_2000_onward, on='ID', how='inner')
from_2000

Unnamed: 0,ID,Player,Number of Seasons,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,Inductee
0,adamsma01,Matt Adams,10,2614,2421,297,624,130,6,118,399,165,643,12,0,16,0.257745,0.462619,0.306427,0.769046,0
1,alonsyo01,Yonder Alonso,10,3773,3362,390,872,181,2,100,426,366,648,16,1,28,0.259369,0.403629,0.332450,0.736078,0
2,altuvjo01,Jose Altuve,11,6346,5778,883,1777,340,29,164,639,443,753,54,26,45,0.307546,0.461578,0.359810,0.821389,0
3,andinro01,Robert Andino,10,1491,1344,153,313,58,1,18,97,113,313,6,21,7,0.232887,0.317708,0.293878,0.611586,0
4,andruel01,Elvis Andrus,13,7620,6863,953,1864,328,50,79,673,547,1043,50,103,56,0.271601,0.368498,0.327435,0.695933,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
300,youngde03,Delmon Young,10,4371,4108,473,1162,218,11,109,566,179,784,40,1,43,0.282863,0.420886,0.316018,0.736904,0
301,younger03,Eric Young Jr.,10,1926,1725,253,422,67,22,13,112,147,350,23,26,5,0.244638,0.331594,0.311579,0.643173,0
302,youngmi02,Michael Young,14,8612,7918,1136,2375,441,60,185,1030,575,1235,22,25,72,0.299949,0.440894,0.346105,0.786999,0
303,zimmery01,Ryan Zimmerman,16,7402,6654,963,1846,417,22,284,1061,646,1384,31,1,69,0.277427,0.474752,0.340946,0.815698,0


## It's Time to Save Our Three New Dataframes  TODO...

Let's capture these three dataframes and have them be ready for Step 3's Notebook.   TODO


In [93]:
## TODO .... what files????

import os

if not os.path.exists('./data'):
    os.makedirs('./data')
    
alldata_csv = "./data/step2_alldata.csv"
career_hof.to_csv(alldata_csv, index=False)

pre_2000_csv = "./data/step2_pre_2000.csv"
pre_2000.to_csv(pre_2000_csv, index=False)

from_2000_csv = "./data/step2_from_2000.csv"
from_2000.to_csv(from_2000_csv, index=False)

## Concluding Notebook Comments

**Note:** At this point, we will conclude this notebook for organizational purposes, as it is a logical point for saving and launching into the following experimentation and modelling based on the data prepared here.

Saving the data files in various states makes it easier to re-run parts of the overall project without having to re-run every aspect.

The purpose of this notebook is to continue on with the data preparation started in step 1, by structuring and splitting up the data to the state where the experiments and modelling will be conducted.

The *next* notebook in the series is: `harr2890_project_step5_ops_modelling`, where the saved, data files will be loaded and ......   XXXXXX  TODO