# Using Historical Data to Predict Batting Success: Step 4

Authored by: Donna J. Harris (994042890)

Email: harr2890@mylaurier.ca

For: CP640 Machine Learning (S22) with Professor Elham Harirpoush

## Notebook Series

Just a word about the presentation of this project code.

The code is organized into a series of locally executed Jupyter notebooks, organized by step and needing to be executed in sequence. This is `harr2890_project_step4_ops_data_prep`, the fourth(???) of XXXXX notebooks.   TODO

## *Step 4 - Data Preparation for an OPS Approach*

This notebook encompasses a third phase of data preparation, following Step 1 Notebook's preparation. From here, we will continue with the structuring and splitting up the data to the state where the experiments and modelling will be conducted based on a On Base Percentage + Slugging (OPS) approach.

Here, we will be extracting data and generating multiple seasons of the OPS statistic, preparing the data for exploration and modelling based on various **regression** techniques in a subsequent notebook.

We will also be splitting the data, based on the year 2000, so that later on in the process we can run controlled tests on unseen data which can be used to manually evaluate the results and predictions of the modelling.

## Environment Setup

Import and establish environment for our work, including showing all dataframe column values.

In [1]:
import pandas as pd

pd.set_option('display.max_columns', None)

### Pre-Conditions

Step 1 must be run completely before running this notebook.

The `data` folder must exist with the following prepared data file:
- `./data/core_mlb_dataset.csv`

##  Loading Prepared Data Files

Load in the Major League Baseball batting data (`./data/core_mlb_dataset.csv`) so we can continue with preparing this data.


In [119]:
core_mlb_dataset = "./data/core_mlb_dataset.csv"
df = pd.read_csv(core_mlb_dataset)
df

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Result,Season
0,delahed01,Ed Delahanty,PHI,BRO,5,4,1,2,0,0,0,0,1,0,0,0,0,L,1901
1,dolanjo02,Joe Dolan,PHI,BRO,5,5,0,1,0,0,0,1,0,0,0,0,0,L,1901
2,childcu01,Cupid Childs,CHC,STL,5,5,1,1,0,0,0,0,0,0,0,0,0,W,1901
3,crolifr01,Fred Crolius,BSN,NYG,4,4,0,0,0,0,0,1,0,0,0,0,0,W,1901
4,delahed01,Ed Delahanty,PHI,BRO,4,4,0,0,0,0,0,0,0,2,0,0,0,L,1901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3715517,woodfja01,Jake Woodford,STL,CHC,2,1,0,0,0,0,0,0,0,0,0,1,0,L,2021
3715518,yastrmi01,Mike Yastrzemski,SFG,SDP,4,3,1,1,1,0,0,2,1,1,0,0,0,W,2021
3715519,zimmebr01,Bradley Zimmer,CLE,TEX,4,4,1,2,0,0,0,1,0,0,0,0,0,W,2021
3715520,zimmery01,Ryan Zimmerman,WSN,BOS,4,3,0,0,0,0,0,1,1,2,0,0,0,L,2021


## Preprocessing (Continued from the Step 1 Notebook)

### The OPS Approach

A statistic that has become a very important measurement of a batter's productivity and efficacy at the plate is the On-base Plus Slugging (OPS) statistic, which (as the name implies) is the sum of two other batting statistics:  On-Base Percentage (OBP) and Slugging Percentage (SLG). Generally speaking, an OPS value that is close to (or over) 1.000 indicates a very good batting performance.

Because the overarching goal is to use Major League Baseball data to predict batting success, OPS prediction has the potential to be an approach that helps to predict a batter's success at the plate.

By using OPS values from multiple seasons, we should be able to use **regression** techniques to predict the future OPS values of a batter.

In order to prepare the data for this approach, we will to do the following:  TODO....
1.  Generate seasonal batting statistics for player data.



2.  Arbitrarily split the data, based on the year 2000. (Get the names of these players.)
3.  Filter out players with minimal batting data.
6.  Create dataframes based on the calculated batting data, split by the list of names created in stage XXX.

The end result will be two labelled dataframes containing player OPS data for:
- players whose career began before 2000
- players whose career began during or after 2000

These will be stored and used in the Step 5 Notebook work. 

### 1. Seasonal Batting Statistics

First, we group the game-based statistics by player and season.

In [120]:
filterable = ['ID', 'Player', 'Season']
columns = ['ID', 'Player','PA','AB','R','H','2B','3B','HR','RBI','BB','SO','HBP','SH', 'SF']
group_alldata = df.groupby(filterable)
group_alldata = group_alldata[columns].sum().copy()
group_alldata

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
ID,Player,Season,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0
aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0
aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0
aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4
aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0
zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0
zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0
zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0


Then, we'll remove the grouping so we have independent seasonal statistic records for each player's season.

In [121]:
seasonal_data = group_alldata.reset_index()
seasonal_data

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0


Next, because we're interested in seasonal statistics -- especially OPS -- we need to calculate those values seasonally for each player.

**Note:** The statement/simplifictaion of these statistical calculations is outlined in detail in the ***Step 2 Notebook***, section 4.

In [122]:
seasonal_data['AVG'] = seasonal_data['H'] / (seasonal_data['AB']*1.0)
seasonal_data['SLG'] = (seasonal_data['H'] + seasonal_data['2B'] + 2*seasonal_data['3B'] + 3*seasonal_data['HR']) / (seasonal_data['AB']*1.0)
seasonal_data['OBP'] = (seasonal_data['H'] + seasonal_data['BB'] + seasonal_data['HBP']) / ((seasonal_data['AB'] + seasonal_data['BB'] + seasonal_data['HBP'] + seasonal_data['SF'])*1.0) 
seasonal_data['OPS'] = seasonal_data['SLG'] + seasonal_data['OBP']

seasonal_data

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0,0.000000,0.000000,0.000000,0.000000
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4,0.279915,0.446581,0.322068,0.768649
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4,0.313953,0.539867,0.366261,0.906129
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0,0.185185,0.222222,0.214286,0.436508
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0,0.117647,0.117647,0.166667,0.284314
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0,0.071429,0.071429,0.133333,0.204762
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0,0.222222,0.444444,0.300000,0.744444


Because we've completed a number of calculations, we should look for null or NaN values. (Because we crafted this data in previous steps, we know that any null values detected are newly introduced.)

In [123]:
seasonal_data.isnull().values.any()

True

As anticipated, there are null values from those calculations, so let's find them and resolve the issues.

Let's look at `'AVG'`...

In [124]:
avg_is_nan = seasonal_data.loc[pd.isna(seasonal_data['AVG'])]
avg_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
75,abernte01,Ted Abernathy,1942,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
636,alfonan01,Antonio Alfonseca,2003,1,0,0,0,0,0,0,0,0,0,0,1,0,,,,
844,almanar01,Armando Almanza,2004,1,0,0,0,0,0,0,0,0,0,0,1,0,,,,
1361,anderla02,Larry Andersen,1994,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
1445,andrena01,Nate Andrews,1937,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69277,yancyhu01,Hugh Yancy,1974,1,0,0,0,0,0,0,0,0,0,0,1,0,,,,
69326,yeabsbe01,Bert Yeabsley,1919,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
69766,zannido01,Dom Zanni,1959,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
69807,zavadcl01,Clay Zavada,2009,1,0,0,0,0,0,0,0,0,0,0,1,0,,,,


... and resolve by setting to 0.0, then confirm.

In [125]:
series = avg_is_nan.index
seasonal_data.loc[series, 'AVG'] = 0.0

avg_is_nan = seasonal_data.loc[pd.isna(seasonal_data['AVG'])]
avg_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS


`'AVG'` is addressed, so let's look at `'SLG'`...

In [126]:
slg_is_nan = seasonal_data.loc[pd.isna(seasonal_data['SLG'])]
slg_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
75,abernte01,Ted Abernathy,1942,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
636,alfonan01,Antonio Alfonseca,2003,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,,,
844,almanar01,Armando Almanza,2004,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,,,
1361,anderla02,Larry Andersen,1994,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
1445,andrena01,Nate Andrews,1937,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69277,yancyhu01,Hugh Yancy,1974,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,,,
69326,yeabsbe01,Bert Yeabsley,1919,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
69766,zannido01,Dom Zanni,1959,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
69807,zavadcl01,Clay Zavada,2009,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,,,


... and resolve by setting to 0.0, then confirm.

In [127]:
series = slg_is_nan.index
seasonal_data.loc[series, 'SLG'] = 0.0

slg_is_nan = seasonal_data.loc[pd.isna(seasonal_data['SLG'])]
slg_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS


`'SLG'` is addressed, so let's look at `'OBP'`...

In [128]:
obp_is_nan = seasonal_data.loc[pd.isna(seasonal_data['OBP'])]
obp_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
636,alfonan01,Antonio Alfonseca,2003,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
844,almanar01,Armando Almanza,2004,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
1847,arroylu01,Luis Arroyo,1963,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
2179,avilalu01,Luis Avilan,2013,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
2180,avilalu01,Luis Avilan,2014,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68213,wilsoga03,Gary Wilson,1995,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
68486,winklda01,Dan Winkler,2019,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
69153,wuertmi01,Michael Wuertz,2006,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
69277,yancyhu01,Hugh Yancy,1974,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,


... and resolve by setting to 0.0, then confirm.

In [129]:
series = obp_is_nan.index
seasonal_data.loc[series, 'OBP'] = 0.0

obp_is_nan = seasonal_data.loc[pd.isna(seasonal_data['OBP'])]
obp_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS


`'OBP'` is addressed, so, let's finish this up and look at `'OPS'`...

In [130]:
ops_is_nan = seasonal_data.loc[pd.isna(seasonal_data['OPS'])]
ops_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
75,abernte01,Ted Abernathy,1942,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
636,alfonan01,Antonio Alfonseca,2003,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0.0,
844,almanar01,Armando Almanza,2004,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0.0,
1361,anderla02,Larry Andersen,1994,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
1445,andrena01,Nate Andrews,1937,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69277,yancyhu01,Hugh Yancy,1974,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0.0,
69326,yeabsbe01,Bert Yeabsley,1919,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
69766,zannido01,Dom Zanni,1959,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
69807,zavadcl01,Clay Zavada,2009,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0.0,


... and resolve by recalculating by adding `'SLG'` + `'OBP'`...

In [131]:
series = ops_is_nan.index

for record in series:
    seasonal_data.loc[record, 'OPS'] = seasonal_data.loc[record, 'SLG'] + seasonal_data.loc[record, 'OBP']


... and then confirm.

In [132]:
ops_is_nan = seasonal_data.loc[pd.isna(seasonal_data['OPS'])]
ops_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS


And lastly, let's confirm we addressed all of the issues.

In [133]:
seasonal_data.isnull().values.any()

False

Success. The calculated statistics are now filled out in the `seasonal_data` table.

In [134]:
seasonal_data

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0,0.000000,0.000000,0.000000,0.000000
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4,0.279915,0.446581,0.322068,0.768649
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4,0.313953,0.539867,0.366261,0.906129
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0,0.185185,0.222222,0.214286,0.436508
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0,0.117647,0.117647,0.166667,0.284314
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0,0.071429,0.071429,0.133333,0.204762
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0,0.222222,0.444444,0.300000,0.744444


### 2.  Gathering a Decade of Statistics

get all player IDs and names for the new dataframe

# make sure it is sorted by earliest season played


In [136]:
seasonal_sorted = seasonal_data.sort_values(by=['Season'], ascending=True)
seasonal_sorted

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
21668,gatinfr01,Frank Gatins,1901,207,198,21,45,7,2,1,21,5,27,2,2,0,0.227273,0.297980,0.253659,0.551638
7533,brownfr01,Fred Brown,1901,15,14,1,2,0,0,0,2,0,2,0,1,0,0.142857,0.142857,0.142857,0.285714
40448,mccange01,Gene McCann,1901,13,10,2,0,0,0,0,0,2,4,0,1,0,0.000000,0.000000,0.166667,0.166667
15398,denzero01,Roger Denzer,1901,25,22,0,2,1,0,0,1,2,8,1,0,0,0.090909,0.136364,0.200000,0.336364
30788,jennihu01,Hughie Jennings,1901,345,302,38,79,21,2,1,39,25,25,12,6,0,0.261589,0.354305,0.342183,0.696488
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42312,mileywa01,Wade Miley,2021,59,54,9,10,3,0,0,3,3,13,0,2,0,0.185185,0.240741,0.228070,0.468811
10254,castier01,Erick Castillo,2021,9,8,0,2,0,0,0,0,1,1,0,0,0,0.250000,0.250000,0.333333,0.583333
18738,feltnry01,Ryan Feltner,2021,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
22364,gittech01,Chris Gittens,2021,44,36,1,4,0,0,1,5,7,13,0,0,1,0.111111,0.194444,0.250000,0.444444


Start building the decade stats dataframe TODO

In [137]:
decade_stats = seasonal_sorted.reset_index()
decade_stats = decade_stats[['ID', 'Player']].copy()
decade_stats = decade_stats.drop_duplicates(subset=['ID'], keep='first')
decade_stats

Unnamed: 0,ID,Player
0,gatinfr01,Frank Gatins
1,brownfr01,Fred Brown
2,mccange01,Gene McCann
3,denzero01,Roger Denzer
4,jennihu01,Hughie Jennings
...,...,...
70019,baragca01,Caleb Baragar
70024,castier01,Erick Castillo
70025,feltnry01,Ryan Feltner
70026,gittech01,Chris Gittens


Build out placeholder columns for 10 years of OPS data. TODO

In [138]:
decade_stats = decade_stats.reindex(columns = decade_stats.columns.tolist() 
                                  + [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
decade_stats

Unnamed: 0,ID,Player,1,2,3,4,5,6,7,8,9,10
0,gatinfr01,Frank Gatins,,,,,,,,,,
1,brownfr01,Fred Brown,,,,,,,,,,
2,mccange01,Gene McCann,,,,,,,,,,
3,denzero01,Roger Denzer,,,,,,,,,,
4,jennihu01,Hughie Jennings,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
70019,baragca01,Caleb Baragar,,,,,,,,,,
70024,castier01,Erick Castillo,,,,,,,,,,
70025,feltnry01,Ryan Feltner,,,,,,,,,,
70026,gittech01,Chris Gittens,,,,,,,,,,


calculate OPS for first 10 seasons

**NOTE:** This one takes a bit more time...

In [139]:
first_n_seasons = 10
x = first_n_seasons

for player_id in decade_stats['ID']:
    n_seasons = seasonal_sorted[seasonal_sorted['ID'] == player_id][0:first_n_seasons]
    
    if len(n_seasons) == first_n_seasons:
        ten_year_ops = n_seasons['OPS'] 
        ten_year_ops_list = ten_year_ops.to_list()

        i = 0
        while i < x:
            decade_stats.loc[decade_stats['ID'] == player_id, i+1] = ten_year_ops_list[i]
            i+=1    
    
decade_stats

Unnamed: 0,ID,Player,1,2,3,4,5,6,7,8,9,10
0,gatinfr01,Frank Gatins,,,,,,,,,,
1,brownfr01,Fred Brown,,,,,,,,,,
2,mccange01,Gene McCann,,,,,,,,,,
3,denzero01,Roger Denzer,,,,,,,,,,
4,jennihu01,Hughie Jennings,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
70019,baragca01,Caleb Baragar,,,,,,,,,,
70024,castier01,Erick Castillo,,,,,,,,,,
70025,feltnry01,Ryan Feltner,,,,,,,,,,
70026,gittech01,Chris Gittens,,,,,,,,,,


Now, this doesn't look populated but there will be several players whose data will not be populated because they don't have ten seasons or more of data.

To make this more evident, we'll remove rows with NaN values.

In [140]:
decade_stats = decade_stats.dropna().copy()
decade_stats

Unnamed: 0,ID,Player,1,2,3,4,5,6,7,8,9,10
533,willeed01,Ed Willett,0.000000,0.219780,0.367510,0.496266,0.379030,0.774655,0.456006,0.692935,0.621154,0.533333
813,daussho01,Hooks Dauss,1.000000,0.560802,0.578379,0.470550,0.731898,0.400920,0.559740,0.379742,0.542872,0.634387
865,coopewi01,Wilbur Cooper,0.368132,0.219780,0.462535,0.260180,0.455696,0.519551,0.591622,0.679654,0.537423,0.611333
1012,mamaual01,Al Mamaux,0.000000,0.550000,0.363387,0.473844,0.475806,0.000000,0.479365,0.464706,0.431818,0.991071
1194,ruthba01,Babe Ruth,0.500000,0.952325,0.730611,0.856730,0.967903,1.114268,1.379841,1.355617,1.103689,1.313417
...,...,...,...,...,...,...,...,...,...,...,...,...
61489,perezhe01,Hernan Perez,1.000000,0.444664,0.533333,0.583502,0.730326,0.703826,0.676495,0.641605,0.333333,0.195489
61553,gomesya01,Yan Gomes,0.630983,0.825949,0.784906,0.658537,0.527451,0.707728,0.761775,0.704177,0.787218,0.722537
61558,simmoan01,Andrelton Simmons,0.750827,0.691599,0.617196,0.659624,0.689723,0.751810,0.754196,0.673284,0.702389,0.557754
61607,mercejo03,Jordy Mercer,0.635674,0.771547,0.692806,0.613224,0.701352,0.732539,0.695653,0.747463,0.472727,0.671493


Rename OPS seasonal stats TODO

In [142]:
decade_stats.rename(columns = {1:'OPS Y1', 2:'OPS Y2', 3:'OPS Y3', 4:'OPS Y4', 5:'OPS Y5',
                            6:'OPS Y6', 7:'OPS Y7', 8:'OPS Y8', 9:'OPS Y9', 10:'OPS Y10'
                            }, inplace = True)
decade_stats

Unnamed: 0,ID,Player,OPS Y1,OPS Y2,OPS Y3,OPS Y4,OPS Y5,OPS Y6,OPS Y7,OPS Y8,OPS Y9,OPS Y10
533,willeed01,Ed Willett,0.000000,0.219780,0.367510,0.496266,0.379030,0.774655,0.456006,0.692935,0.621154,0.533333
813,daussho01,Hooks Dauss,1.000000,0.560802,0.578379,0.470550,0.731898,0.400920,0.559740,0.379742,0.542872,0.634387
865,coopewi01,Wilbur Cooper,0.368132,0.219780,0.462535,0.260180,0.455696,0.519551,0.591622,0.679654,0.537423,0.611333
1012,mamaual01,Al Mamaux,0.000000,0.550000,0.363387,0.473844,0.475806,0.000000,0.479365,0.464706,0.431818,0.991071
1194,ruthba01,Babe Ruth,0.500000,0.952325,0.730611,0.856730,0.967903,1.114268,1.379841,1.355617,1.103689,1.313417
...,...,...,...,...,...,...,...,...,...,...,...,...
61489,perezhe01,Hernan Perez,1.000000,0.444664,0.533333,0.583502,0.730326,0.703826,0.676495,0.641605,0.333333,0.195489
61553,gomesya01,Yan Gomes,0.630983,0.825949,0.784906,0.658537,0.527451,0.707728,0.761775,0.704177,0.787218,0.722537
61558,simmoan01,Andrelton Simmons,0.750827,0.691599,0.617196,0.659624,0.689723,0.751810,0.754196,0.673284,0.702389,0.557754
61607,mercejo03,Jordy Mercer,0.635674,0.771547,0.692806,0.613224,0.701352,0.732539,0.695653,0.747463,0.472727,0.671493


repeat for decade of AVG data  TODO  (not sure if we're using this or not... just blocking it out)

In [143]:
decade_stats = decade_stats.reindex(columns = decade_stats.columns.tolist() 
                                  + [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
decade_stats

Unnamed: 0,ID,Player,OPS Y1,OPS Y2,OPS Y3,OPS Y4,OPS Y5,OPS Y6,OPS Y7,OPS Y8,OPS Y9,OPS Y10,1,2,3,4,5,6,7,8,9,10
533,willeed01,Ed Willett,0.000000,0.219780,0.367510,0.496266,0.379030,0.774655,0.456006,0.692935,0.621154,0.533333,,,,,,,,,,
813,daussho01,Hooks Dauss,1.000000,0.560802,0.578379,0.470550,0.731898,0.400920,0.559740,0.379742,0.542872,0.634387,,,,,,,,,,
865,coopewi01,Wilbur Cooper,0.368132,0.219780,0.462535,0.260180,0.455696,0.519551,0.591622,0.679654,0.537423,0.611333,,,,,,,,,,
1012,mamaual01,Al Mamaux,0.000000,0.550000,0.363387,0.473844,0.475806,0.000000,0.479365,0.464706,0.431818,0.991071,,,,,,,,,,
1194,ruthba01,Babe Ruth,0.500000,0.952325,0.730611,0.856730,0.967903,1.114268,1.379841,1.355617,1.103689,1.313417,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61489,perezhe01,Hernan Perez,1.000000,0.444664,0.533333,0.583502,0.730326,0.703826,0.676495,0.641605,0.333333,0.195489,,,,,,,,,,
61553,gomesya01,Yan Gomes,0.630983,0.825949,0.784906,0.658537,0.527451,0.707728,0.761775,0.704177,0.787218,0.722537,,,,,,,,,,
61558,simmoan01,Andrelton Simmons,0.750827,0.691599,0.617196,0.659624,0.689723,0.751810,0.754196,0.673284,0.702389,0.557754,,,,,,,,,,
61607,mercejo03,Jordy Mercer,0.635674,0.771547,0.692806,0.613224,0.701352,0.732539,0.695653,0.747463,0.472727,0.671493,,,,,,,,,,


In [144]:
first_n_seasons = 10
x = first_n_seasons

for player_id in decade_stats['ID']:
    n_seasons = seasonal_sorted[seasonal_sorted['ID'] == player_id][0:first_n_seasons]
    
    if len(n_seasons) == first_n_seasons:
        ten_year_avg = n_seasons['AVG'] 
        ten_year_avg_list = ten_year_avg.to_list()

        i = 0
        while i < x:
            decade_stats.loc[decade_stats['ID'] == player_id, i+1] = ten_year_avg_list[i]
            i+=1    
    
decade_stats

Unnamed: 0,ID,Player,OPS Y1,OPS Y2,OPS Y3,OPS Y4,OPS Y5,OPS Y6,OPS Y7,OPS Y8,OPS Y9,OPS Y10,1,2,3,4,5,6,7,8,9,10
533,willeed01,Ed Willett,0.000000,0.219780,0.367510,0.496266,0.379030,0.774655,0.456006,0.692935,0.621154,0.533333,0.000000,0.076923,0.164179,0.190909,0.134146,0.268293,0.165217,0.282609,0.234375,0.200000
813,daussho01,Hooks Dauss,1.000000,0.560802,0.578379,0.470550,0.731898,0.400920,0.559740,0.379742,0.542872,0.634387,0.250000,0.177215,0.216495,0.145631,0.222222,0.126437,0.181818,0.144330,0.170732,0.261364
865,coopewi01,Wilbur Cooper,0.368132,0.219780,0.462535,0.260180,0.455696,0.519551,0.591622,0.679654,0.537423,0.611333,0.153846,0.076923,0.206522,0.114754,0.215190,0.203883,0.242105,0.295238,0.221239,0.254098
1012,mamaual01,Al Mamaux,0.000000,0.550000,0.363387,0.473844,0.475806,0.000000,0.479365,0.464706,0.431818,0.991071,0.000000,0.250000,0.163043,0.190909,0.225806,0.000000,0.174603,0.166667,0.181818,0.250000
1194,ruthba01,Babe Ruth,0.500000,0.952325,0.730611,0.856730,0.967903,1.114268,1.379841,1.355617,1.103689,1.313417,0.200000,0.315217,0.268116,0.325203,0.299685,0.319444,0.375546,0.377079,0.314496,0.394231
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61489,perezhe01,Hernan Perez,1.000000,0.444664,0.533333,0.583502,0.730326,0.703826,0.676495,0.641605,0.333333,0.195489,0.500000,0.196970,0.200000,0.243346,0.272277,0.259259,0.253165,0.228448,0.166667,0.052632
61553,gomesya01,Yan Gomes,0.630983,0.825949,0.784906,0.658537,0.527451,0.707728,0.761775,0.704177,0.787218,0.722537,0.204082,0.293515,0.278351,0.231405,0.167331,0.231672,0.265509,0.222930,0.284404,0.252149
61558,simmoan01,Andrelton Simmons,0.750827,0.691599,0.617196,0.659624,0.689723,0.751810,0.754196,0.673284,0.702389,0.557754,0.289157,0.247525,0.244444,0.265421,0.281250,0.278438,0.292419,0.263819,0.296610,0.223301
61607,mercejo03,Jordy Mercer,0.635674,0.771547,0.692806,0.613224,0.701352,0.732539,0.695653,0.747463,0.472727,0.671493,0.209677,0.285285,0.254941,0.243655,0.256262,0.254980,0.251269,0.269531,0.200000,0.254237


It's unexpected here, but let's do a sanity check and look for null/NaN values.

In [145]:
decade_stats.isnull().values.any()

False

rename AVG seasonal stats  TODO

In [146]:
decade_stats.rename(columns = {1:'AVG Y1', 2:'AVG Y2', 3:'AVG Y3', 4:'AVG Y4', 5:'AVG Y5',
                            6:'AVG Y6', 7:'AVG Y7', 8:'AVG Y8', 9:'AVG Y9', 10:'AVG Y10'
                            }, inplace = True)
decade_stats

Unnamed: 0,ID,Player,OPS Y1,OPS Y2,OPS Y3,OPS Y4,OPS Y5,OPS Y6,OPS Y7,OPS Y8,OPS Y9,OPS Y10,AVG Y1,AVG Y2,AVG Y3,AVG Y4,AVG Y5,AVG Y6,AVG Y7,AVG Y8,AVG Y9,AVG Y10
533,willeed01,Ed Willett,0.000000,0.219780,0.367510,0.496266,0.379030,0.774655,0.456006,0.692935,0.621154,0.533333,0.000000,0.076923,0.164179,0.190909,0.134146,0.268293,0.165217,0.282609,0.234375,0.200000
813,daussho01,Hooks Dauss,1.000000,0.560802,0.578379,0.470550,0.731898,0.400920,0.559740,0.379742,0.542872,0.634387,0.250000,0.177215,0.216495,0.145631,0.222222,0.126437,0.181818,0.144330,0.170732,0.261364
865,coopewi01,Wilbur Cooper,0.368132,0.219780,0.462535,0.260180,0.455696,0.519551,0.591622,0.679654,0.537423,0.611333,0.153846,0.076923,0.206522,0.114754,0.215190,0.203883,0.242105,0.295238,0.221239,0.254098
1012,mamaual01,Al Mamaux,0.000000,0.550000,0.363387,0.473844,0.475806,0.000000,0.479365,0.464706,0.431818,0.991071,0.000000,0.250000,0.163043,0.190909,0.225806,0.000000,0.174603,0.166667,0.181818,0.250000
1194,ruthba01,Babe Ruth,0.500000,0.952325,0.730611,0.856730,0.967903,1.114268,1.379841,1.355617,1.103689,1.313417,0.200000,0.315217,0.268116,0.325203,0.299685,0.319444,0.375546,0.377079,0.314496,0.394231
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61489,perezhe01,Hernan Perez,1.000000,0.444664,0.533333,0.583502,0.730326,0.703826,0.676495,0.641605,0.333333,0.195489,0.500000,0.196970,0.200000,0.243346,0.272277,0.259259,0.253165,0.228448,0.166667,0.052632
61553,gomesya01,Yan Gomes,0.630983,0.825949,0.784906,0.658537,0.527451,0.707728,0.761775,0.704177,0.787218,0.722537,0.204082,0.293515,0.278351,0.231405,0.167331,0.231672,0.265509,0.222930,0.284404,0.252149
61558,simmoan01,Andrelton Simmons,0.750827,0.691599,0.617196,0.659624,0.689723,0.751810,0.754196,0.673284,0.702389,0.557754,0.289157,0.247525,0.244444,0.265421,0.281250,0.278438,0.292419,0.263819,0.296610,0.223301
61607,mercejo03,Jordy Mercer,0.635674,0.771547,0.692806,0.613224,0.701352,0.732539,0.695653,0.747463,0.472727,0.671493,0.209677,0.285285,0.254941,0.243655,0.256262,0.254980,0.251269,0.269531,0.200000,0.254237


###  Career Stats...  TODO

This process will be similar to that done in the Step 2 Notebook in sections 1, 2, and 4.  TODO

First, we group the game-based statistics by player and season.

In [147]:
filterable = ['ID', 'Player', 'Season']
columns = ['ID', 'Player','PA','AB','R','H','2B','3B','HR','RBI','BB','SO','HBP','SH', 'SF']
group_alldata = df.groupby(filterable)
group_alldata = group_alldata[columns].sum().copy()
group_alldata

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
ID,Player,Season,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0
aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0
aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0
aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4
aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0
zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0
zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0
zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0


Then, we'll remove the grouping so we have independent seasonal statistic records for each player's season.

In [148]:
group_alldata = group_alldata.reset_index()
group_alldata

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0


Next, because we're interested in career statistics, we want to sum seasonal statistics to generate career statistics. For more details, see the ***Step 2 Notebook***, section 1.  .... TODO?

In [149]:
group_alldata['Number of Seasons'] = 1
group_alldata

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0,1
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0,1
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0,1
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4,1
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0,1
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0,1
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0,1
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0,1


But, we want to split the data  ... TODO

For more details, see the ***Step 2 Notebook***, section 2.

In [150]:
career_pre_2000 = group_alldata.copy()
career_pre_2000 = career_pre_2000[career_pre_2000['Season'] < 2000].copy()
career_pre_2000

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4,1
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4,1
5,aaronha01,Henry Aaron,1956,660,609,106,200,34,14,26,92,37,54,2,5,7,1
6,aaronha01,Henry Aaron,1957,400,372,71,130,17,4,29,78,26,34,0,0,2,1
7,aaronha01,Henry Aaron,1958,664,601,109,196,34,4,30,95,59,49,1,0,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0,1
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0,1
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0,1
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0,1


In [151]:
career_pre_2000 = career_pre_2000[['ID']].copy()
career_pre_2000 = career_pre_2000.drop_duplicates(subset=['ID'], keep='first')
career_pre_2000 = career_pre_2000.sort_values(by='ID', ascending=True)
career_pre_2000

Unnamed: 0,ID
3,aaronha01
26,aaronto01
33,aasedo01
41,abbotje01
46,abbotji01
...,...
69993,zuberjo01
70007,zupcibo01
70011,zupofr01
70014,zuvelpa01


In [152]:
career_2000_onward = group_alldata.copy()
career_2000_onward = career_2000_onward[career_2000_onward['Season'] > 1999].copy()
career_2000_onward

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0,1
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0,1
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0,1
34,abadan01,Andy Abad,2001,1,1,0,0,0,0,0,0,0,0,0,0,0,1
35,abadan01,Andy Abad,2003,19,17,1,2,0,0,0,0,2,5,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70002,zuninmi01,Mike Zunino,2017,435,387,52,97,25,0,25,64,39,160,8,0,1,1
70003,zuninmi01,Mike Zunino,2018,405,373,37,75,18,0,20,44,24,150,6,0,2,1
70004,zuninmi01,Mike Zunino,2019,289,266,30,44,10,1,9,32,20,98,3,0,0,1
70005,zuninmi01,Mike Zunino,2020,84,75,8,11,4,0,4,10,6,37,3,0,0,1


In [153]:
career_2000_onward = career_2000_onward[['ID']].copy()
career_2000_onward = career_2000_onward.drop_duplicates(subset=['ID'], keep='first')
career_2000_onward = career_2000_onward.sort_values(by='ID', ascending=True)
career_2000_onward

Unnamed: 0,ID
0,aardsda01
34,abadan01
37,abadfe01
40,abbotco01
44,abbotje01
...,...
69967,zoccope01
69981,zoskyed01
69995,zuberty01
69996,zuletju01


In [154]:
overlapped_names = []
overlapped_names = pd.merge(career_pre_2000, career_2000_onward, on=['ID'], how='inner')
overlapped_names

Unnamed: 0,ID
0,abbotje01
1,abbotku01
2,abreubo01
3,aceveju01
4,adamste01
...,...
952,younger02
953,youngke01
954,zaungr01
955,zeileto01


In [155]:
for dupe in overlapped_names['ID']:
    career_2000_onward.drop(career_2000_onward.index[career_2000_onward['ID'] == dupe], inplace=True)
    
career_2000_onward

Unnamed: 0,ID
0,aardsda01
34,abadan01
37,abadfe01
40,abbotco01
58,abbotpa01
...,...
69953,zobribe01
69967,zoccope01
69995,zuberty01
69996,zuletju01


This gives us the two lists for later in the last stage.  TODO

Now, career stats are summed:  TODO

In [156]:
group_alldata

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0,1
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0,1
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0,1
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4,1
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0,1
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0,1
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0,1
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0,1


In [157]:
del group_alldata['Season']
career_alldata = group_alldata.groupby(['ID','Player']).sum()
career_alldata

Unnamed: 0_level_0,Unnamed: 1_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
ID,Player,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
aardsda01,David Aardsma,5,4,0,0,0,0,0,0,0,2,0,1,0,3
aaronha01,Henry Aaron,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,23
aaronto01,Tommie Aaron,1045,944,99,216,42,6,13,94,85,145,0,9,6,7
aasedo01,Don Aase,5,5,0,0,0,0,0,0,0,3,0,0,0,1
abadan01,Andy Abad,25,21,1,2,0,0,0,0,4,5,0,0,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zuninmi01,Mike Zunino,2835,2559,308,518,111,5,141,345,198,981,58,8,12,9
zupcibo01,Bob Zupcic,886,795,93,199,47,4,7,80,57,137,6,20,8,4
zupofr01,Frank Zupo,8,7,1,2,1,0,0,0,1,2,0,0,0,3
zuvelpa01,Paul Zuvella,545,491,40,109,17,2,2,20,34,50,2,18,0,8


In [158]:
career_alldata = career_alldata.reset_index()
career_alldata

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
0,aardsda01,David Aardsma,5,4,0,0,0,0,0,0,0,2,0,1,0,3
1,aaronha01,Henry Aaron,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,23
2,aaronto01,Tommie Aaron,1045,944,99,216,42,6,13,94,85,145,0,9,6,7
3,aasedo01,Don Aase,5,5,0,0,0,0,0,0,0,3,0,0,0,1
4,abadan01,Andy Abad,25,21,1,2,0,0,0,0,4,5,0,0,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14331,zuninmi01,Mike Zunino,2835,2559,308,518,111,5,141,345,198,981,58,8,12,9
14332,zupcibo01,Bob Zupcic,886,795,93,199,47,4,7,80,57,137,6,20,8,4
14333,zupofr01,Frank Zupo,8,7,1,2,1,0,0,0,1,2,0,0,0,3
14334,zuvelpa01,Paul Zuvella,545,491,40,109,17,2,2,20,34,50,2,18,0,8


Instead of the filtering we did in Step 2's Notebook, we will want to filter on the names in the `decade_stats` dataframe.

In [159]:
decade_stats_names = decade_stats[['ID']].copy()
decade_stats_names

Unnamed: 0,ID
533,willeed01
813,daussho01
865,coopewi01
1012,mamaual01
1194,ruthba01
...,...
61489,perezhe01
61553,gomesya01
61558,simmoan01
61607,mercejo03


Filter by decade stats names...  TODO

In [160]:
career_alldata = pd.merge(decade_stats_names, career_alldata, on='ID', how='left')
career_alldata

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons
0,willeed01,Ed Willett,704,649,56,129,25,10,5,68,25,174,6,24,0,10
1,daussho01,Hooks Dauss,1324,1125,107,212,41,15,6,112,145,288,8,41,0,15
2,coopewi01,Wilbur Cooper,1320,1232,111,295,34,18,6,103,46,139,4,35,0,15
3,mamaual01,Al Mamaux,466,421,39,77,12,2,1,31,27,90,3,12,0,12
4,ruthba01,Babe Ruth,10624,8400,2173,2872,506,137,714,2215,2063,1331,45,45,0,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2394,perezhe01,Hernan Perez,1846,1745,188,436,74,11,45,180,76,397,1,12,12,10
2395,gomesya01,Yan Gomes,3274,3006,369,742,158,8,117,416,185,794,50,2,31,10
2396,simmoan01,Andrelton Simmons,4731,4366,493,1156,200,23,70,437,297,448,26,13,29,10
2397,mercejo03,Jordy Mercer,3416,3104,327,796,173,15,66,308,246,589,27,22,17,10


calculate career calculated stats  TODO

In [162]:
career_alldata['Career AVG'] = career_alldata['H'] / (career_alldata['AB']*1.0)
career_alldata['Career SLG'] = (career_alldata['H'] + career_alldata['2B'] + 2*career_alldata['3B'] + 3*career_alldata['HR']) / (career_alldata['AB']*1.0)
career_alldata['Career OBP'] = (career_alldata['H'] + career_alldata['BB'] + career_alldata['HBP']) / ((career_alldata['AB'] + career_alldata['BB'] + career_alldata['HBP'] + career_alldata['SF'])*1.0) 
career_alldata['Career OPS'] = career_alldata['Career SLG'] + career_alldata['Career OBP']

career_alldata

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons,Career AVG,Career SLG,Career OBP,Career OPS
0,willeed01,Ed Willett,704,649,56,129,25,10,5,68,25,174,6,24,0,10,0.198767,0.291217,0.235294,0.526511
1,daussho01,Hooks Dauss,1324,1125,107,212,41,15,6,112,145,288,8,41,0,15,0.188444,0.267556,0.285603,0.553158
2,coopewi01,Wilbur Cooper,1320,1232,111,295,34,18,6,103,46,139,4,35,0,15,0.239448,0.310877,0.269111,0.579987
3,mamaual01,Al Mamaux,466,421,39,77,12,2,1,31,27,90,3,12,0,12,0.182898,0.228029,0.237251,0.465279
4,ruthba01,Babe Ruth,10624,8400,2173,2872,506,137,714,2215,2063,1331,45,45,0,22,0.341905,0.689762,0.473925,1.163687
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2394,perezhe01,Hernan Perez,1846,1745,188,436,74,11,45,180,76,397,1,12,12,10,0.249857,0.382235,0.279716,0.661951
2395,gomesya01,Yan Gomes,3274,3006,369,742,158,8,117,416,185,794,50,2,31,10,0.246840,0.421490,0.298594,0.720084
2396,simmoan01,Andrelton Simmons,4731,4366,493,1156,200,23,70,437,297,448,26,13,29,10,0.264773,0.369217,0.313480,0.682697
2397,mercejo03,Jordy Mercer,3416,3104,327,796,173,15,66,308,246,589,27,22,17,10,0.256443,0.385631,0.314968,0.700599


Merge Data with the decade_stats   TODO

NOT going to remove any career stats.... I can choose what to deselect later, either here or in Step 5

In [166]:
mergeables = ['ID', 'Player']
merged_data = pd.merge(career_alldata, decade_stats, on=mergeables, how='left')

In [167]:
merged_data

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons,Career AVG,Career SLG,Career OBP,Career OPS,OPS Y1,OPS Y2,OPS Y3,OPS Y4,OPS Y5,OPS Y6,OPS Y7,OPS Y8,OPS Y9,OPS Y10,AVG Y1,AVG Y2,AVG Y3,AVG Y4,AVG Y5,AVG Y6,AVG Y7,AVG Y8,AVG Y9,AVG Y10
0,willeed01,Ed Willett,704,649,56,129,25,10,5,68,25,174,6,24,0,10,0.198767,0.291217,0.235294,0.526511,0.000000,0.219780,0.367510,0.496266,0.379030,0.774655,0.456006,0.692935,0.621154,0.533333,0.000000,0.076923,0.164179,0.190909,0.134146,0.268293,0.165217,0.282609,0.234375,0.200000
1,daussho01,Hooks Dauss,1324,1125,107,212,41,15,6,112,145,288,8,41,0,15,0.188444,0.267556,0.285603,0.553158,1.000000,0.560802,0.578379,0.470550,0.731898,0.400920,0.559740,0.379742,0.542872,0.634387,0.250000,0.177215,0.216495,0.145631,0.222222,0.126437,0.181818,0.144330,0.170732,0.261364
2,coopewi01,Wilbur Cooper,1320,1232,111,295,34,18,6,103,46,139,4,35,0,15,0.239448,0.310877,0.269111,0.579987,0.368132,0.219780,0.462535,0.260180,0.455696,0.519551,0.591622,0.679654,0.537423,0.611333,0.153846,0.076923,0.206522,0.114754,0.215190,0.203883,0.242105,0.295238,0.221239,0.254098
3,mamaual01,Al Mamaux,466,421,39,77,12,2,1,31,27,90,3,12,0,12,0.182898,0.228029,0.237251,0.465279,0.000000,0.550000,0.363387,0.473844,0.475806,0.000000,0.479365,0.464706,0.431818,0.991071,0.000000,0.250000,0.163043,0.190909,0.225806,0.000000,0.174603,0.166667,0.181818,0.250000
4,ruthba01,Babe Ruth,10624,8400,2173,2872,506,137,714,2215,2063,1331,45,45,0,22,0.341905,0.689762,0.473925,1.163687,0.500000,0.952325,0.730611,0.856730,0.967903,1.114268,1.379841,1.355617,1.103689,1.313417,0.200000,0.315217,0.268116,0.325203,0.299685,0.319444,0.375546,0.377079,0.314496,0.394231
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2394,perezhe01,Hernan Perez,1846,1745,188,436,74,11,45,180,76,397,1,12,12,10,0.249857,0.382235,0.279716,0.661951,1.000000,0.444664,0.533333,0.583502,0.730326,0.703826,0.676495,0.641605,0.333333,0.195489,0.500000,0.196970,0.200000,0.243346,0.272277,0.259259,0.253165,0.228448,0.166667,0.052632
2395,gomesya01,Yan Gomes,3274,3006,369,742,158,8,117,416,185,794,50,2,31,10,0.246840,0.421490,0.298594,0.720084,0.630983,0.825949,0.784906,0.658537,0.527451,0.707728,0.761775,0.704177,0.787218,0.722537,0.204082,0.293515,0.278351,0.231405,0.167331,0.231672,0.265509,0.222930,0.284404,0.252149
2396,simmoan01,Andrelton Simmons,4731,4366,493,1156,200,23,70,437,297,448,26,13,29,10,0.264773,0.369217,0.313480,0.682697,0.750827,0.691599,0.617196,0.659624,0.689723,0.751810,0.754196,0.673284,0.702389,0.557754,0.289157,0.247525,0.244444,0.265421,0.281250,0.278438,0.292419,0.263819,0.296610,0.223301
2397,mercejo03,Jordy Mercer,3416,3104,327,796,173,15,66,308,246,589,27,22,17,10,0.256443,0.385631,0.314968,0.700599,0.635674,0.771547,0.692806,0.613224,0.701352,0.732539,0.695653,0.747463,0.472727,0.671493,0.209677,0.285285,0.254941,0.243655,0.256262,0.254980,0.251269,0.269531,0.200000,0.254237


### 6. Complete the Data Splitting by Creating Dataframes

TODO

...

...




With the career data in order and labelled with the Hall of Fame induction data, it is time to use the lists we created earlier in the notebook and split the data into our arbitrarily grouped dataframes.

All players who started their careers before the year 2000 will be in the first dataframe, `pre_2000`, and all players who started their careers in 2000 or later (some of whom will be currently active players) will be in the second dataframe, `from_2000`.

We already have our list of names from earlier work, so now it is a matter of merging data to make new dataframes specific to the chosen eras.

First, our pre-2000 players.

In [169]:
## TODO

pre_2000 = pd.merge(merged_data, career_pre_2000, on='ID', how='inner')
pre_2000

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons,Career AVG,Career SLG,Career OBP,Career OPS,OPS Y1,OPS Y2,OPS Y3,OPS Y4,OPS Y5,OPS Y6,OPS Y7,OPS Y8,OPS Y9,OPS Y10,AVG Y1,AVG Y2,AVG Y3,AVG Y4,AVG Y5,AVG Y6,AVG Y7,AVG Y8,AVG Y9,AVG Y10
0,willeed01,Ed Willett,704,649,56,129,25,10,5,68,25,174,6,24,0,10,0.198767,0.291217,0.235294,0.526511,0.000000,0.219780,0.367510,0.496266,0.379030,0.774655,0.456006,0.692935,0.621154,0.533333,0.000000,0.076923,0.164179,0.190909,0.134146,0.268293,0.165217,0.282609,0.234375,0.200000
1,daussho01,Hooks Dauss,1324,1125,107,212,41,15,6,112,145,288,8,41,0,15,0.188444,0.267556,0.285603,0.553158,1.000000,0.560802,0.578379,0.470550,0.731898,0.400920,0.559740,0.379742,0.542872,0.634387,0.250000,0.177215,0.216495,0.145631,0.222222,0.126437,0.181818,0.144330,0.170732,0.261364
2,coopewi01,Wilbur Cooper,1320,1232,111,295,34,18,6,103,46,139,4,35,0,15,0.239448,0.310877,0.269111,0.579987,0.368132,0.219780,0.462535,0.260180,0.455696,0.519551,0.591622,0.679654,0.537423,0.611333,0.153846,0.076923,0.206522,0.114754,0.215190,0.203883,0.242105,0.295238,0.221239,0.254098
3,mamaual01,Al Mamaux,466,421,39,77,12,2,1,31,27,90,3,12,0,12,0.182898,0.228029,0.237251,0.465279,0.000000,0.550000,0.363387,0.473844,0.475806,0.000000,0.479365,0.464706,0.431818,0.991071,0.000000,0.250000,0.163043,0.190909,0.225806,0.000000,0.174603,0.166667,0.181818,0.250000
4,ruthba01,Babe Ruth,10624,8400,2173,2872,506,137,714,2215,2063,1331,45,45,0,22,0.341905,0.689762,0.473925,1.163687,0.500000,0.952325,0.730611,0.856730,0.967903,1.114268,1.379841,1.355617,1.103689,1.313417,0.200000,0.315217,0.268116,0.325203,0.299685,0.319444,0.375546,0.377079,0.314496,0.394231
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1991,berkmla01,Lance Berkman,7814,6491,1145,1905,422,30,366,1234,1201,1300,66,1,54,15,0.293483,0.536897,0.406042,0.942939,0.707851,0.949396,1.050683,0.982479,0.927351,1.015958,0.934124,1.040773,0.896031,0.986336,0.236559,0.297450,0.331023,0.292388,0.288104,0.316176,0.292735,0.315299,0.278075,0.312274
1992,burnea.01,A.J. Burnett,575,493,20,56,8,3,4,19,20,241,2,59,1,17,0.113590,0.166329,0.151163,0.317491,0.235294,0.877143,0.215385,0.370402,0.476190,0.275862,0.450840,0.000000,0.000000,0.000000,0.117647,0.280000,0.080000,0.105263,0.142857,0.137931,0.147059,0.000000,0.000000,0.000000
1993,garcifr03,Freddy Garcia,99,77,1,12,2,0,0,4,2,26,0,19,1,13,0.155844,0.181818,0.175000,0.356818,0.500000,1.333333,0.285714,0.833333,0.400000,0.000000,0.000000,0.400000,0.594118,0.000000,0.250000,0.666667,0.142857,0.333333,0.200000,0.000000,0.000000,0.200000,0.235294,0.000000
1994,weaveje01,Jeff Weaver,239,208,13,43,6,1,0,13,6,73,1,24,0,10,0.206731,0.245192,0.232558,0.477750,1.250000,0.000000,0.000000,0.571429,0.496781,0.507143,0.378788,0.500000,0.461538,0.400000,0.500000,0.000000,0.000000,0.285714,0.214286,0.228571,0.133333,0.000000,0.230769,0.200000


Then finally, our 2000 onward players.

In [170]:
## TODO

from_2000 = pd.merge(merged_data, career_2000_onward, on='ID', how='inner')
from_2000

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Number of Seasons,Career AVG,Career SLG,Career OBP,Career OPS,OPS Y1,OPS Y2,OPS Y3,OPS Y4,OPS Y5,OPS Y6,OPS Y7,OPS Y8,OPS Y9,OPS Y10,AVG Y1,AVG Y2,AVG Y3,AVG Y4,AVG Y5,AVG Y6,AVG Y7,AVG Y8,AVG Y9,AVG Y10
0,pierrju01,Juan Pierre,8280,7525,1073,2217,255,94,18,517,464,479,102,167,20,14,0.294618,0.360664,0.343114,0.703779,0.673488,0.793087,0.675184,0.733904,0.780886,0.679930,0.717426,0.684746,0.654683,0.757490,0.310000,0.327391,0.287162,0.305389,0.325959,0.275915,0.291845,0.293413,0.282667,0.307895
1,nadyxa01,Xavier Nady,3241,2969,365,797,159,7,104,410,189,626,58,7,18,12,0.268441,0.431795,0.322820,0.754615,2.000000,0.711731,0.716789,0.759777,0.790246,0.805425,0.867406,0.738916,0.659671,0.646219,1.000000,0.266846,0.246753,0.260736,0.279915,0.278422,0.304505,0.285714,0.255521,0.247573
2,moellch01,Chad Moeller,1539,1392,146,315,74,7,29,132,108,331,15,16,8,11,0.226293,0.352011,0.287590,0.639602,0.534307,0.627880,0.851913,0.769733,0.568145,0.623844,0.506279,0.407581,0.640350,0.751334,0.210938,0.232143,0.285714,0.267782,0.208202,0.206030,0.183673,0.160714,0.230769,0.258427
3,schnebr01,Brian Schneider,3570,3165,284,781,167,9,67,387,331,526,22,26,26,13,0.246761,0.368720,0.319977,0.688698,0.563379,0.859248,0.798068,0.702540,0.724134,0.739263,0.649357,0.661366,0.706638,0.626961,0.234783,0.317073,0.275362,0.229851,0.256881,0.268293,0.256098,0.235294,0.256716,0.217647
4,wellski01,Kip Wells,371,331,30,63,10,1,4,17,4,143,2,34,0,11,0.190332,0.262840,0.204748,0.467588,0.000000,0.333333,0.460317,0.482310,0.541456,0.400484,0.181818,0.716981,0.000000,0.181818,0.000000,0.166667,0.190476,0.191176,0.186047,0.157895,0.090909,0.320755,0.000000,0.090909
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
398,perezhe01,Hernan Perez,1846,1745,188,436,74,11,45,180,76,397,1,12,12,10,0.249857,0.382235,0.279716,0.661951,1.000000,0.444664,0.533333,0.583502,0.730326,0.703826,0.676495,0.641605,0.333333,0.195489,0.500000,0.196970,0.200000,0.243346,0.272277,0.259259,0.253165,0.228448,0.166667,0.052632
399,gomesya01,Yan Gomes,3274,3006,369,742,158,8,117,416,185,794,50,2,31,10,0.246840,0.421490,0.298594,0.720084,0.630983,0.825949,0.784906,0.658537,0.527451,0.707728,0.761775,0.704177,0.787218,0.722537,0.204082,0.293515,0.278351,0.231405,0.167331,0.231672,0.265509,0.222930,0.284404,0.252149
400,simmoan01,Andrelton Simmons,4731,4366,493,1156,200,23,70,437,297,448,26,13,29,10,0.264773,0.369217,0.313480,0.682697,0.750827,0.691599,0.617196,0.659624,0.689723,0.751810,0.754196,0.673284,0.702389,0.557754,0.289157,0.247525,0.244444,0.265421,0.281250,0.278438,0.292419,0.263819,0.296610,0.223301
401,mercejo03,Jordy Mercer,3416,3104,327,796,173,15,66,308,246,589,27,22,17,10,0.256443,0.385631,0.314968,0.700599,0.635674,0.771547,0.692806,0.613224,0.701352,0.732539,0.695653,0.747463,0.472727,0.671493,0.209677,0.285285,0.254941,0.243655,0.256262,0.254980,0.251269,0.269531,0.200000,0.254237


## It's Time to Save Our Three New Dataframes  TODO...

Let's capture these three dataframes and have them be ready for Step 3's Notebook.   TODO


In [171]:
## TODO .... what files????

import os

if not os.path.exists('./data'):
    os.makedirs('./data')
    
alldata_csv = "./data/step4_alldata.csv"
merged_data.to_csv(alldata_csv, index=False)

pre_2000_csv = "./data/step4_pre_2000.csv"
pre_2000.to_csv(pre_2000_csv, index=False)

from_2000_csv = "./data/step4_from_2000.csv"
from_2000.to_csv(from_2000_csv, index=False)

## Concluding Notebook Comments

**Note:** At this point, we will conclude this notebook for organizational purposes, as it is a logical point for saving and launching into the following experimentation and modelling based on the data prepared here.

Saving the data files in various states makes it easier to re-run parts of the overall project without having to re-run every aspect.

The purpose of this notebook is to continue on with the data preparation started in step 1, by structuring and splitting up the data to the state where the experiments and modelling will be conducted.

The *next* notebook in the series is: `harr2890_project_step5_ops_modelling`, where the saved, data files will be loaded and ......   XXXXXX  TODO