# Using Historical Data to Predict Batting Success: Step 4 of 5

Authored by: Donna J. Harris (994042890)

Email: harr2890@mylaurier.ca

For: CP640 Machine Learning (S22) with Professor Elham Harirpoush

## Notebook Series

Just a word about the presentation of this project code.

The code is organized into a series of locally executed Jupyter notebooks, organized by step and needing to be executed in sequence to follow the flow of the entire project.

This is `harr2890_project_step4_ops_data_prep`, the fourth of five notebooks.

## *Step 4 - Data Preparation for an OPS Approach*

This notebook encompasses a third phase of data preparation, following Step 1 Notebook's preparation (`harr2890_project_step1_data_prep`). From here, we will continue with the structuring and splitting up the data to the state where the experiments and modelling will be conducted based on an On Base Percentage + Slugging (OPS) approach, a.k.a. the OPS Approach.

Here, we will be extracting data and generating multiple seasons of the OPS statistic, preparing the data for exploration and modelling based on various **regression** techniques in a subsequent notebook.

## Environment Setup

Import and establish environment for our work, including showing all dataframe column values.

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

### Pre-Conditions

Step 1 (`harr2890_project_step1_data_prep`) must be run completely before running this notebook.

The `data` folder must exist with the following prepared data file:
- `./data/core_mlb_dataset.csv`

##  Loading Prepared Data Files

Load in the Major League Baseball batting data (`./data/core_mlb_dataset.csv`) so we can continue with preparing this data.


In [2]:
core_mlb_dataset = "./data/core_mlb_dataset.csv"
df = pd.read_csv(core_mlb_dataset)
df

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Result,Season
0,delahed01,Ed Delahanty,PHI,BRO,5,4,1,2,0,0,0,0,1,0,0,0,0,L,1901
1,dolanjo02,Joe Dolan,PHI,BRO,5,5,0,1,0,0,0,1,0,0,0,0,0,L,1901
2,childcu01,Cupid Childs,CHC,STL,5,5,1,1,0,0,0,0,0,0,0,0,0,W,1901
3,crolifr01,Fred Crolius,BSN,NYG,4,4,0,0,0,0,0,1,0,0,0,0,0,W,1901
4,delahed01,Ed Delahanty,PHI,BRO,4,4,0,0,0,0,0,0,0,2,0,0,0,L,1901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3715517,woodfja01,Jake Woodford,STL,CHC,2,1,0,0,0,0,0,0,0,0,0,1,0,L,2021
3715518,yastrmi01,Mike Yastrzemski,SFG,SDP,4,3,1,1,1,0,0,2,1,1,0,0,0,W,2021
3715519,zimmebr01,Bradley Zimmer,CLE,TEX,4,4,1,2,0,0,0,1,0,0,0,0,0,W,2021
3715520,zimmery01,Ryan Zimmerman,WSN,BOS,4,3,0,0,0,0,0,1,1,2,0,0,0,L,2021


## Preprocessing (Continued from the Step 1 Notebook)

### The OPS Approach

A statistic that has become a very important measurement of a batter's productivity and efficacy at the plate is the On-base Plus Slugging (OPS) statistic, which (as the name implies) is the sum of two other batting statistics:  On-Base Percentage (OBP) and Slugging Percentage (SLG). Generally speaking, an OPS value that is close to (or over) 1.000 indicates extremely good batting performance.

Because the overarching goal is to use Major League Baseball data to predict batting success, OPS prediction has the potential to be a useful approach

By using OPS values from multiple seasons, we will use **regression** techniques to predict the future OPS values of a batter.

### Overview of Tasks

In order to prepare the data for this approach, we will to do the following:
1.  Generate seasonal batting statistics for player data, including calculated statistics per season.
2.  Gather the first ten seasons of statistics for all players, including capturing seasonal OPS statistics.
3.  Calculate the career OPS for the players with ten or more seasons. (For all seasons on record not just the ten.)
4.  Combine the seasonal OPS statistics with the Career OPS in the same dataframe.
5.  Create the dataframe filled with OPS batting data.


The end result will be one labelled dataframes containing player OPS data across multiple seasons and their career OPS.

These will be stored and used in the Step 5 Notebook work. 

### 1. Seasonal Batting Statistics

First, we group the game-based statistics by player and season.

In [3]:
filterable = ['ID', 'Player', 'Season']
columns = ['ID', 'Player','PA','AB','R','H','2B','3B','HR','RBI','BB','SO','HBP','SH', 'SF']
group_alldata = df.groupby(filterable)
group_alldata = group_alldata[columns].sum().copy()
group_alldata

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
ID,Player,Season,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0
aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0
aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0
aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4
aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0
zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0
zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0
zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0


Then, we'll remove the grouping so we have independent seasonal statistic records for each player's season.

In [4]:
seasonal_data = group_alldata.reset_index()
seasonal_data

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0


Next, because we're interested in seasonal statistics -- especially OPS -- we need to calculate those values seasonally for each player.

**Note:** The statement/simplifictaion of these statistical calculations is outlined in detail in the ***Step 2 Notebook***, section 4.

In [5]:
seasonal_data['AVG'] = seasonal_data['H'] / (seasonal_data['AB']*1.0)
seasonal_data['SLG'] = (seasonal_data['H'] + seasonal_data['2B'] + 2*seasonal_data['3B'] + 3*seasonal_data['HR']) / (seasonal_data['AB']*1.0)
seasonal_data['OBP'] = (seasonal_data['H'] + seasonal_data['BB'] + seasonal_data['HBP']) / ((seasonal_data['AB'] + seasonal_data['BB'] + seasonal_data['HBP'] + seasonal_data['SF'])*1.0) 
seasonal_data['OPS'] = seasonal_data['SLG'] + seasonal_data['OBP']

seasonal_data

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0,0.000000,0.000000,0.000000,0.000000
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4,0.279915,0.446581,0.322068,0.768649
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4,0.313953,0.539867,0.366261,0.906129
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0,0.185185,0.222222,0.214286,0.436508
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0,0.117647,0.117647,0.166667,0.284314
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0,0.071429,0.071429,0.133333,0.204762
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0,0.222222,0.444444,0.300000,0.744444


Because we've completed a number of calculations, we should look for null or NaN values. (Because we crafted this data in previous steps, we know that any null values detected are newly introduced.)

In [6]:
seasonal_data.isnull().values.any()

True

As anticipated, there are null values from those calculations, so let's find them and resolve the issues.

Let's look at `'AVG'`...

In [7]:
avg_is_nan = seasonal_data.loc[pd.isna(seasonal_data['AVG'])]
avg_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
75,abernte01,Ted Abernathy,1942,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
636,alfonan01,Antonio Alfonseca,2003,1,0,0,0,0,0,0,0,0,0,0,1,0,,,,
844,almanar01,Armando Almanza,2004,1,0,0,0,0,0,0,0,0,0,0,1,0,,,,
1361,anderla02,Larry Andersen,1994,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
1445,andrena01,Nate Andrews,1937,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69277,yancyhu01,Hugh Yancy,1974,1,0,0,0,0,0,0,0,0,0,0,1,0,,,,
69326,yeabsbe01,Bert Yeabsley,1919,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
69766,zannido01,Dom Zanni,1959,1,0,0,0,0,0,0,0,1,0,0,0,0,,,1.0,
69807,zavadcl01,Clay Zavada,2009,1,0,0,0,0,0,0,0,0,0,0,1,0,,,,


... and resolve by setting to 0.0, then confirm.

In [8]:
series = avg_is_nan.index
seasonal_data.loc[series, 'AVG'] = 0.0

avg_is_nan = seasonal_data.loc[pd.isna(seasonal_data['AVG'])]
avg_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS


`'AVG'` is addressed, so let's look at `'SLG'`...

In [9]:
slg_is_nan = seasonal_data.loc[pd.isna(seasonal_data['SLG'])]
slg_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
75,abernte01,Ted Abernathy,1942,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
636,alfonan01,Antonio Alfonseca,2003,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,,,
844,almanar01,Armando Almanza,2004,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,,,
1361,anderla02,Larry Andersen,1994,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
1445,andrena01,Nate Andrews,1937,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69277,yancyhu01,Hugh Yancy,1974,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,,,
69326,yeabsbe01,Bert Yeabsley,1919,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
69766,zannido01,Dom Zanni,1959,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,,1.0,
69807,zavadcl01,Clay Zavada,2009,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,,,


... and resolve by setting to 0.0, then confirm.

In [10]:
series = slg_is_nan.index
seasonal_data.loc[series, 'SLG'] = 0.0

slg_is_nan = seasonal_data.loc[pd.isna(seasonal_data['SLG'])]
slg_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS


`'SLG'` is addressed, so let's look at `'OBP'`...

In [11]:
obp_is_nan = seasonal_data.loc[pd.isna(seasonal_data['OBP'])]
obp_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
636,alfonan01,Antonio Alfonseca,2003,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
844,almanar01,Armando Almanza,2004,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
1847,arroylu01,Luis Arroyo,1963,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
2179,avilalu01,Luis Avilan,2013,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
2180,avilalu01,Luis Avilan,2014,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68213,wilsoga03,Gary Wilson,1995,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
68486,winklda01,Dan Winkler,2019,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
69153,wuertmi01,Michael Wuertz,2006,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,
69277,yancyhu01,Hugh Yancy,1974,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,,


... and resolve by setting to 0.0, then confirm.

In [12]:
series = obp_is_nan.index
seasonal_data.loc[series, 'OBP'] = 0.0

obp_is_nan = seasonal_data.loc[pd.isna(seasonal_data['OBP'])]
obp_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS


`'OBP'` is addressed, so, let's finish this up and look at `'OPS'`...

In [13]:
ops_is_nan = seasonal_data.loc[pd.isna(seasonal_data['OPS'])]
ops_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
75,abernte01,Ted Abernathy,1942,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
636,alfonan01,Antonio Alfonseca,2003,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0.0,
844,almanar01,Armando Almanza,2004,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0.0,
1361,anderla02,Larry Andersen,1994,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
1445,andrena01,Nate Andrews,1937,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69277,yancyhu01,Hugh Yancy,1974,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0.0,
69326,yeabsbe01,Bert Yeabsley,1919,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
69766,zannido01,Dom Zanni,1959,1,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0.0,1.0,
69807,zavadcl01,Clay Zavada,2009,1,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0.0,


... and resolve by recalculating by adding `'SLG'` + `'OBP'`...

In [14]:
series = ops_is_nan.index

for record in series:
    seasonal_data.loc[record, 'OPS'] = seasonal_data.loc[record, 'SLG'] + seasonal_data.loc[record, 'OBP']

... and then confirm.

In [15]:
ops_is_nan = seasonal_data.loc[pd.isna(seasonal_data['OPS'])]
ops_is_nan

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS


And lastly, let's confirm we addressed all of the issues.

In [16]:
seasonal_data.isnull().values.any()

False

Success. The calculated statistics are now filled out in the `seasonal_data` table.

In [17]:
seasonal_data

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
0,aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0,0.000000,0.000000,0.000000,0.000000
1,aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
2,aardsda01,David Aardsma,2015,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
3,aaronha01,Henry Aaron,1954,509,468,58,131,27,6,13,69,28,39,3,6,4,0.279915,0.446581,0.322068,0.768649
4,aaronha01,Henry Aaron,1955,665,602,106,189,37,9,27,106,49,61,3,7,4,0.313953,0.539867,0.366261,0.906129
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70023,zuverge01,George Zuverink,1955,29,27,1,5,1,0,0,0,1,7,0,1,0,0.185185,0.222222,0.214286,0.436508
70024,zuverge01,George Zuverink,1956,22,17,0,2,0,0,0,2,1,7,0,4,0,0.117647,0.117647,0.166667,0.284314
70025,zuverge01,George Zuverink,1957,17,14,1,1,0,0,0,0,1,5,0,2,0,0.071429,0.071429,0.133333,0.204762
70026,zuverge01,George Zuverink,1958,10,9,0,2,0,1,0,2,1,2,0,0,0,0.222222,0.444444,0.300000,0.744444


### 2.  Gathering the First Ten Seasons of Statistics

Get all player IDs and names for the new dataframe, ensuring it is sorted by earliest season played.

In [18]:
seasonal_sorted = seasonal_data.sort_values(by=['Season'], ascending=True)
seasonal_sorted

Unnamed: 0,ID,Player,Season,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
21668,gatinfr01,Frank Gatins,1901,207,198,21,45,7,2,1,21,5,27,2,2,0,0.227273,0.297980,0.253659,0.551638
7533,brownfr01,Fred Brown,1901,15,14,1,2,0,0,0,2,0,2,0,1,0,0.142857,0.142857,0.142857,0.285714
40448,mccange01,Gene McCann,1901,13,10,2,0,0,0,0,0,2,4,0,1,0,0.000000,0.000000,0.166667,0.166667
15398,denzero01,Roger Denzer,1901,25,22,0,2,1,0,0,1,2,8,1,0,0,0.090909,0.136364,0.200000,0.336364
30788,jennihu01,Hughie Jennings,1901,345,302,38,79,21,2,1,39,25,25,12,6,0,0.261589,0.354305,0.342183,0.696488
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42312,mileywa01,Wade Miley,2021,59,54,9,10,3,0,0,3,3,13,0,2,0,0.185185,0.240741,0.228070,0.468811
10254,castier01,Erick Castillo,2021,9,8,0,2,0,0,0,0,1,1,0,0,0,0.250000,0.250000,0.333333,0.583333
18738,feltnry01,Ryan Feltner,2021,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000
22364,gittech01,Chris Gittens,2021,44,36,1,4,0,0,1,5,7,13,0,0,1,0.111111,0.194444,0.250000,0.444444


Start building the `ten_season_stats` dataframe, starting with creating a list of all the players in the seasonal dataframe.

In [19]:
ten_season_stats = seasonal_sorted.reset_index()
ten_season_stats = ten_season_stats[['ID', 'Player']].copy()
ten_season_stats = ten_season_stats.drop_duplicates(subset=['ID'], keep='first')
ten_season_stats

Unnamed: 0,ID,Player
0,gatinfr01,Frank Gatins
1,brownfr01,Fred Brown
2,mccange01,Gene McCann
3,denzero01,Roger Denzer
4,jennihu01,Hughie Jennings
...,...,...
70019,baragca01,Caleb Baragar
70024,castier01,Erick Castillo
70025,feltnry01,Ryan Feltner
70026,gittech01,Chris Gittens


Next, build out placeholder columns for 10 years of seasonal OPS data.

In [20]:
ten_season_stats = ten_season_stats.reindex(columns = ten_season_stats.columns.tolist() 
                                  + [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
ten_season_stats

Unnamed: 0,ID,Player,1,2,3,4,5,6,7,8,9,10
0,gatinfr01,Frank Gatins,,,,,,,,,,
1,brownfr01,Fred Brown,,,,,,,,,,
2,mccange01,Gene McCann,,,,,,,,,,
3,denzero01,Roger Denzer,,,,,,,,,,
4,jennihu01,Hughie Jennings,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
70019,baragca01,Caleb Baragar,,,,,,,,,,
70024,castier01,Erick Castillo,,,,,,,,,,
70025,feltnry01,Ryan Feltner,,,,,,,,,,
70026,gittech01,Chris Gittens,,,,,,,,,,


Calculate the season OPS values for every player that has been playing for ten or more seasons.

**NOTE:** This step takes a bit more time than most to generate the values.

In [None]:
first_n_seasons = 10
x = first_n_seasons

for player_id in ten_season_stats['ID']:
    n_seasons = seasonal_sorted[seasonal_sorted['ID'] == player_id][0:first_n_seasons]
    
    if len(n_seasons) == first_n_seasons:
        ten_season_ops_list = n_seasons['OPS'] 
        ten_season_ops_list = ten_season_ops_list.to_list()

        i = 0
        while i < x:
            ten_season_stats.loc[ten_season_stats['ID'] == player_id, i+1] = ten_season_ops_list[i]
            i+=1    
    
ten_season_stats

Now, this doesn't look populated but there will be several players whose data will not be populated because they don't have ten seasons or more of data.

To make this more evident, we'll remove rows with NaN values. (That is, we'll remove the players with less than ten seasons in the Major Leagues.)

In [None]:
ten_season_stats = ten_season_stats.dropna().copy()
ten_season_stats

Now, renaming the columns will make their contents more obvious.

In [None]:
ten_season_stats.rename(columns = {1:'OPS Y1', 2:'OPS Y2', 3:'OPS Y3', 4:'OPS Y4', 5:'OPS Y5',
                            6:'OPS Y6', 7:'OPS Y7', 8:'OPS Y8', 9:'OPS Y9', 10:'OPS Y10'
                            }, inplace = True)
ten_season_stats

###  3. Calculate Career OPS

One thing missing from our take is the Career OPS statistic for these players. We cannot use what is in the `ten_season_stats` dataframe because this does not reflect all of the career data for most of these players.

This process will be similar to that done in the Step 2 Notebook, throughout sections 1, 2, and 4.

First, we group the game-based statistics by player and season.

In [None]:
filterable = ['ID', 'Player', 'Season']
columns = ['ID', 'Player','PA','AB','R','H','2B','3B','HR','RBI','BB','SO','HBP','SH', 'SF']
group_alldata = df.groupby(filterable)
group_alldata = group_alldata[columns].sum().copy()
group_alldata

Then, we'll remove the grouping so we have independent seasonal statistic records for each player's season.

In [None]:
group_alldata = group_alldata.reset_index()
group_alldata

Instead of the filtering we did in Step 2's Notebook, we will want to filter on the names in the `ten_season_stats` dataframe, so we'll make that list now.

In [None]:
ten_season_stats_names = ten_season_stats[['ID']].copy()
ten_season_stats_names

Merge the career data with `ten_season_stats_names` to filter out players we aren't interested in.

In [None]:
career_alldata = pd.merge(ten_season_stats_names, group_alldata, on='ID', how='left')
career_alldata

Sum all the career statistics for these players...

In [None]:
career_stats = career_alldata.copy()
del career_stats['Season']
career_stats = career_stats.groupby(['ID','Player']).sum()
career_stats

... then generate calculated batting statistics, in order to find Career OPS. (**Note:** we omitted `'AVG'`  as it is not necessary for calculating `'OPS'` and will not be used in the Step 5 Notebook.)

In [None]:
#career_stats['Career AVG'] = career_stats['H'] / (career_stats['AB']*1.0)
career_stats['Career SLG'] = (career_stats['H'] + career_stats['2B'] + 2*career_stats['3B'] + 3*career_stats['HR']) / (career_stats['AB']*1.0)
career_stats['Career OBP'] = (career_stats['H'] + career_stats['BB'] + career_stats['HBP']) / ((career_stats['AB'] + career_stats['BB'] + career_stats['HBP'] + career_stats['SF'])*1.0) 
career_stats['Career OPS'] = career_stats['Career SLG'] + career_stats['Career OBP']

career_stats

Then, we'll remove the grouping so we have independent career statistics for each player.

In [None]:
career_stats = career_stats.reset_index()
career_stats

### 4. Merge Career Data with the Seasonal OPS

Prepare a dataframe with `'Career OPS'` to merge with the seasonal OPS data.

In [None]:
career_ops = career_stats[['ID', 'Player', 'Career OPS']].copy()
career_ops

Merge to get our dataframe ready for Step 5 and modelling.

In [None]:
mergeables = ['ID', 'Player']
merged_data = pd.merge(career_ops, ten_season_stats, on=mergeables, how='left')
merged_data

One last sanity check for null values.

In [None]:
merged_data.isnull().values.any()

## It's Time to Save The OPS Seasonal Data

In [None]:
import os

if not os.path.exists('./data'):
    os.makedirs('./data')
    
ops_data_csv = "./data/step4_ops_data.csv"
merged_data.to_csv(ops_data_csv, index=False)

## Concluding Notebook Comments

**Note:** At this point, we will conclude this notebook for organizational purposes.

Saving the data files in various states makes it easier to re-run parts of the overall project without having to re-run every aspect.

The **purpose of this notebook** is to prepare the OPS data for use in the next notebook.

**The *next* notebook in the series is: `harr2890_project_step5_ops_modelling`,** where the saved, data file will be loaded and experimentation and modelling will take place using **the OPS Approach**.