# Using Historical Data to Predict Batting Success

Authored by: Donna J. Harris (994042890)

Email: harr2890@mylaurier.ca

For: CP640 Machine Learning (S22) with Professor Elham Harirpoush

## Introductory Comments

Based on the project proposal of the same name, this Jupyter Notebook demonstrates the process and exploration of Major League Baseball batting data from 1901 to 2021 to the end of discovering how or if historical data can be used to predict batting success.

The Kaggle dataset being used as the primary data source can be found in the `data` folder of the project folder structure: `./data/mlbbatting1901-2021.csv`

Each data record in the original dataset represents an individual batter's performance in a single game. This is why there are so many records. In a single game, there will be at least 18 batters with plate appearances across both teams, and often more with player substitutions, especially across extra innings.

## Environment Setup

Import and establish environment for initial work, including showing all dataframe column values.

In [1]:
import pandas as pd

pd.set_option('display.max_columns', None)

## Preprocessing

### Original Data

Acquire the data from the Kaggle dataset and place into a dataframe.

In [2]:
original_data_source = "./data/mlbbatting1901-2021.csv"

df = pd.read_csv(original_data_source)

Confirm the original data has been loaded into the data frame. (It should start with the earliest records from 1901 and end with the latest records from 2021.)

In [3]:
print(df)

                ID            Player        Date   Tm  Opp    Rslt  PA  AB  R  \
0        crossmo01       Monte Cross  1901-04-18  PHI  BRO  L 7-12   5   4  2   
1        dahlebi01       Bill Dahlen  1901-04-18  BRO  PHI  W 12-7   5   4  2   
2         dalyto01          Tom Daly  1901-04-18  BRO  PHI  W 12-7   5   5  1   
3        davisle01       Lefty Davis  1901-04-18  BRO  PHI  W 12-7   5   5  1   
4        delahed01      Ed Delahanty  1901-04-18  PHI  BRO  L 7-12   5   4  1   
...            ...               ...         ...  ...  ...     ...  ..  .. ..   
4285624  woodfja01     Jake Woodford  2021-10-03  STL  CHC   L 2-3   2   1  0   
4285625  yastrmi01  Mike Yastrzemski  2021-10-03  SFG  SDP  W 11-4   4   3  1   
4285626  zimmebr01    Bradley Zimmer  2021-10-03  CLE  TEX   W 6-0   4   4  1   
4285627  zimmery01    Ryan Zimmerman  2021-10-03  WSN  BOS   L 5-7   4   3  0   
4285628  zuninmi01       Mike Zunino  2021-10-03  TBR  NYY   L 0-1   4   4  0   

         H  2B  3B  HR  RBI

In [4]:
print(df.columns)

Index(['ID', 'Player', 'Date', 'Tm', 'Opp', 'Rslt', 'PA', 'AB', 'R', 'H', '2B',
       '3B', 'HR', 'RBI', 'BB', 'IBB', 'SO', 'HBP', 'SH', 'SF', 'ROE', 'GDP',
       'SB', 'CS', 'WPA', 'RE24', 'aLI', 'BOP', 'Pos Summary', 'DFS(DK)',
       'DFS(FD)'],
      dtype='object')


We have our confirmation that there are 31 feature columns and 4,285,629 data records.

### Extract Data

There is some data that might be of use to us that is trapped in existing columns. First, we want to extract the result of the game and the score of the game from the `'Rslt'` column. Then, we want to extract the year the game was played (denoting the season) from the `'Date'` column.

In [5]:
df[['Result','Score']] = df['Rslt'].str.split(' ', expand=True)

df['Season'] = df['Date'].str[:4]

### Column Removal

There are a number of columns in the original dataset that we know before going any further that we have no use for.

We no longer need `'Rslt'`, as we just split its interesting information into separate columns. Likewise with `'Date'`, we have what we need in the new `'Season'` column, so we can remove the `'Date'` column.

In [6]:
del df['Rslt']

In [7]:
del df['Date']

Daily fantasy sports points (used for fantasy leagues and betting) have no purpose within this project, so we can safely remove `'DFS(DK)'` and `'DFS(FD)'` from the data.

In [8]:
del df['DFS(DK)']
del df['DFS(FD)']

Similarly, to reduce complexity, we are not considering any statistics relating to fielding or base running/stealing. As such, we can remove `'SB'` and `'CS'`, which are the number of stolen bases and time caught stealing, respectively, as well as the `'Pos Summary'` (position summary) data.

In [9]:
del df['SB']
del df['CS']
del df['Pos Summary']

To further reduce complexity, we will remove the `'IBB'` (intentional walks) column as this is a subset of values tracked under the walks column (`'BB'`).  (Reference: https://en.wikipedia.org/wiki/Base_on_balls#Intentional_base_on_balls)

In [10]:
del df['IBB']

We can remove the `'GDP'` column, which represents the number of times a player hits into a double play (two outs). While this statistic has some bearing to the success of a batter, for this project we will exclude this nuance and focus more on the aspects of run production and getting on base through other statistical means.

In [11]:
del df['GDP']

Similarly to `'GDP'`, we will disregard the `'ROE'` column, which represents the number of times a player reaches base due to a fielding error by the opposing team. Like `'GDP'`, this statistic has some bearing to the success of a batter but it leans more toward their running abilities and street smarts of the player, as well as a bit of luck. Again, for this project, these nuances will be excluded for simplicity.

In [12]:
del df['ROE']

Before continuing on to value checking, let's look and see where the data is at after these processing operations.

In [13]:
print(df)

                ID            Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
0        crossmo01       Monte Cross  PHI  BRO   5   4  2  2   1   0   0  1.0   
1        dahlebi01       Bill Dahlen  BRO  PHI   5   4  2  3   0   0   0  0.0   
2         dalyto01          Tom Daly  BRO  PHI   5   5  1  2   1   0   0  3.0   
3        davisle01       Lefty Davis  BRO  PHI   5   5  1  1   0   0   0  0.0   
4        delahed01      Ed Delahanty  PHI  BRO   5   4  1  2   0   0   0  0.0   
...            ...               ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...   
4285624  woodfja01     Jake Woodford  STL  CHC   2   1  0  0   0   0   0  0.0   
4285625  yastrmi01  Mike Yastrzemski  SFG  SDP   4   3  1  1   1   0   0  2.0   
4285626  zimmebr01    Bradley Zimmer  CLE  TEX   4   4  1  2   0   0   0  1.0   
4285627  zimmery01    Ryan Zimmerman  WSN  BOS   4   3  0  0   0   0   0  1.0   
4285628  zuninmi01       Mike Zunino  TBR  NYY   4   4  0  0   0   0   0  0.0   

         BB  SO  HBP  SH   

In [14]:
print(df.columns)

Index(['ID', 'Player', 'Tm', 'Opp', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR',
       'RBI', 'BB', 'SO', 'HBP', 'SH', 'SF', 'WPA', 'RE24', 'aLI', 'BOP',
       'Result', 'Score', 'Season'],
      dtype='object')


We added three new columns (`'Result'`, `'Score'`, and `'Season'`) -- which we may or may not need later -- and also removed ten existing columns. So, this looks correct: 31 original features + 3 new features - 11 features = 23 feature columns.

We still have 4,285,629 data records, as we have not removed any records yet.

### Value Checking and Data Validation

**ID - Player ID**

Should be unique to each player. This will indicate how many players we currently have, based on player ID.

In [15]:
print(pd.unique(df['ID']))

print("\nNumber of unique player IDs: ", len(pd.unique(df['ID'])))

['crossmo01' 'dahlebi01' 'dalyto01' ... 'adonjo01' 'paynety01' 'stridsp01']

Number of unique player IDs:  15985


**Player - Player Name**

Should be roughly the same number of player IDs. Discrepancies are possible, errors in spelling, etc. but this is not highly significant for our statistics since we will key everything on the more reliable Player ID. We will hang onto this field to help humanly identify players by name. 

In [16]:
print(pd.unique(df['Player']))

print("\nNumber of unique player names: ", len(pd.unique(df['Player'])))

['Monte Cross' 'Bill Dahlen' 'Tom Daly' ... 'Joan Adon' 'Tyler Payne'
 'Spencer Strider']

Number of unique player names:  15595


**Note:** We found there were 15,985 unique IDs and 15,595 unique player names, which is a difference of 390 in favour of the IDs. This could easily be explained by different players over the years with the same name. This is a nuance that will be disregarded for this project. As previously stated, we will use the ID for all but human identification purposes.

**Season - Year the game was played in**
(Extracted from `'Date'`)

The values should all be visibly in YYYY format between the years of 1901 and 2021, inclusive.

Ideally, these values will be Integers -- but they have started as strings.

In [17]:
print(pd.unique(df['Season']))

['1901' '1902' '1903' '1904' '1905' '1906' '1907' '1908' '1909' '1910'
 '1911' '1912' '1913' '1914' '1915' '1916' '1917' '1918' '1919' '1920'
 '1921' '1922' '1923' '1924' '1925' '1926' '1927' '1928' '1929' '1930'
 '1931' '1932' '1933' '1934' '1935' '1936' '1937' '1938' '1939' '1940'
 '1941' '1942' '1943' '1944' '1945' '1946' '1947' '1948' '1949' '1950'
 '1951' '1952' '1953' '1954' '1955' '1956' '1957' '1958' '1959' '1960'
 '1961' '1962' '1963' '1964' '1965' '1966' '1967' '1968' '1969' '1970'
 '1971' '1972' '1973' '1974' '1975' '1976' '1977' '1978' '1979' '1980'
 '1981' '1982' '1983' '1984' '1985' '1986' '1987' '1988' '1989' '1990'
 '1991' '1992' '1993' '1994' '1995' '1996' '1997' '1998' '1999' '2000'
 '2001' '2002' '2003' '2004' '2005' '2006' '2007' '2008' '2009' '2010'
 '2011' '2012' '2013' '2014' '2015' '2016' '2017' '2018' '2019' '2020'
 '2021']


We can see visibly that these values are all YYYY integers, so let's convert them to actual integers in the dataframe.

In [18]:
df['Season'] = df['Season'].astype(int)
# test
print(pd.unique(df['Season']))

[1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914
 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928
 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942
 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956
 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984
 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998
 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
 2013 2014 2015 2016 2017 2018 2019 2020 2021]


**Tm - Player's Team** and **Opp - Opponent**

The team values (both the player's team and the opposing team) should all be visibly in ZZZ format, belonging to a recognizable team between the 1901-2021 seasons. These are fields that will likely get dropped later on, but I'm keeping them until I know for sure I don't want them.

In [19]:
print("Player's Team (Tm):\n", pd.unique(df['Tm']))
print("\nOpposing Team (Opp):\n", pd.unique(df['Opp']))

Player's Team (Tm):
 ['PHI' 'BRO' 'BSN' 'NYG' 'STL' 'CHC' 'PIT' 'CIN' 'CLE' 'CHW' 'MLA' 'DET'
 'PHA' 'WSH' 'BOS' 'BLA' 'SLB' 'NYY' 'BUF' 'BAL' 'PBS' 'BTT' 'CHI' 'IND'
 'SLM' 'KCP' 'NEW' 'MLN' 'KCA' 'LAD' 'SFG' 'WSA' 'MIN' 'LAA' 'HOU' 'NYM'
 'CAL' 'ATL' 'OAK' 'KCR' 'MON' 'SDP' 'SEP' 'MIL' 'TEX' 'SEA' 'TOR' 'FLA'
 'COL' 'ANA' 'ARI' 'TBD' 'WSN' 'TBR' 'MIA']

Opposing Team (Opp):
 ['BRO' 'PHI' 'NYG' 'BSN' 'CHC' 'STL' 'CIN' 'PIT' 'CHW' 'CLE' 'DET' 'MLA'
 'WSH' 'PHA' 'BLA' 'BOS' 'SLB' 'NYY' 'BAL' 'BUF' 'BTT' 'PBS' 'KCP' 'SLM'
 'IND' 'CHI' 'NEW' 'MLN' 'KCA' 'SFG' 'LAD' 'WSA' 'MIN' 'LAA' 'HOU' 'NYM'
 'CAL' 'ATL' 'OAK' 'MON' 'SDP' 'KCR' 'SEP' 'MIL' 'TEX' 'SEA' 'TOR' 'COL'
 'FLA' 'ANA' 'ARI' 'TBD' 'WSN' 'TBR' 'MIA']


**PA - Plate Appearances**

Appearances should be an integer value, between the range of 1 and some upper value. (A 0 would indicate the player didn't bat in the game which would mean there should not be a record.)

The upper value will vary, although (speaking as a baseball fan) five plate appearances is pretty standard in a regular, nine-inning, low scoring game. But as soon as you get into higher scores and/or extra inning games, players can be up to bat many times.

In [20]:
print(pd.unique(df['PA']))

[ 5  4  3  1  2  6  7  8  9 10 11 12]


**Note:** 12 was the upper value. I'm curious of what era these are from, so let's take a look:

In [21]:
print(df[df['PA'] == 12])

                ID         Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
2062446  millafe01   Felix Millan  NYM  STL  12  10  1  4   0   0   0  0.0   
2062447  milnejo01    John Milner  NYM  STL  12  10  0  2   1   0   0  1.0   
2458599  baineha01  Harold Baines  CHW  MIL  12  10  1  2   1   0   1  1.0   
2458643   fiskca01   Carlton Fisk  CHW  MIL  12  11  1  3   1   0   0  1.0   
2458686    lawru01       Rudy Law  CHW  MIL  12  11  1  4   0   0   0  1.0   

         BB  SO  HBP  SH   SF    WPA   RE24    aLI  BOP Result Score  Season  
2062446   1   0    0   1  0.0  0.060  0.422  1.894    2      L   3-4    1974  
2062447   2   3    0   0  0.0 -0.250 -0.399  2.147    4      L   3-4    1974  
2458599   2   0    0   0  0.0  0.195 -0.083  2.204    5      W   7-6    1984  
2458643   1   3    0   0  0.0 -0.237 -0.649  2.455    2      W   7-6    1984  
2458686   1   0    0   0  0.0  0.511  1.816  1.958    1      W   7-6    1984  


**AB - At Bats**

Similarly to Plate Appearances, At Bats should be an integer value. It should be between the range of 0 and some upper value. (Here, a 0 would indicate the player had one plate appearance that did not statistically count as an At Bat, such as a walk.)

The upper range should follow, and not exceed the upper value of Plate Appearances, which was 12. Note that 12 is possible, but does not have to be a value in this collection of data.

In [22]:
print(pd.unique(df['AB']))

[ 4  5  3  2  1  6  0  7  8  9 10 11]


**Note:** 11 was the upper value, which is less then 12. (The max number of plate appearances.)

We should validate that there are no records where there are more At Bats than Plate Appearances.

In [23]:
print("-----------------------------------------")
print("All records with PA < AB")
print("-----------------------------------------")
print(df[df['PA'] < df['AB']])

-----------------------------------------
All records with PA < AB
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**Note:** The validation checks pass, as we have found no records where PA < AB. No extra investigation or validation is required here.

**R - Runs**

Runs should also be an integer value, in the range of 0 and some upper value. The upper value can be, at most, one larger than the number of plate appearances. In general, that's 12+1 for this dataset, but that's a highly unlikely value to see as number of runs. (We'll address individual records in the next step.)

In [24]:
print(pd.unique(df['R']))

[2 1 0 4 3 5 6]


**Note:** We want to do some validation within each individual data record to look for instances where the number of plate appearances is less than the number of runs (e.g., when a player pinch runs for a teammate and then scores a run they have 0 plate appearances and 1 run). And within that subset of records, look for instances where there is more than a difference of one between the `'PA'` and `'R'` values. (If we find any instances with a difference larger than one, we may have a data issue.)

In [25]:
records_pa_lt_r = df[df['PA'] < df['R']]
print("-----------------------------------------")
print("All records with PA < R")
print("-----------------------------------------")
print(records_pa_lt_r)

print("\n\n-----------------------------------------")
print("All records with PA < R where PA != R-1")
print("-----------------------------------------")
print(records_pa_lt_r[ records_pa_lt_r['PA'] != ((records_pa_lt_r['R']-1)) ])


-----------------------------------------
All records with PA < R
-----------------------------------------
                ID          Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
97903     lowebo01      Bobby Lowe  DET  WSH   1   1  2  1   1   0   0  0.0   
203202   stanljo02     Joe Stanley  CHC  BSN   1   0  2  0   0   0   0  0.0   
221883   collibi02    Bill Collins  BSN  STL   1   1  2  1   0   0   0  NaN   
223671   keelewi01   Willie Keeler  NYG  CHC   1   1  2  1   0   0   0  NaN   
230698   butlear01      Art Butler  BSN  PHI   1   1  2  0   0   0   0  NaN   
...            ...             ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...   
4245067  hamilbi02  Billy Hamilton  CHW  MIN   1   1  2  0   0   0   0  0.0   
4256455  dubonma01  Mauricio Dubon  SFG  PHI   1   1  2  1   0   0   0  0.0   
4262325  whiteel04       Eli White  TEX  OAK   1   1  2  0   0   0   0  0.0   
4264040   kempto01       Tony Kemp  OAK  LAA   1   1  2  1   0   0   0  0.0   
4268197  davisjo05  Jon

**Note:** The validation checks pass, as we have found no records where PA < R and PA != R-1. No extra investigation or validation is required here.

**H - Hits**

Hits should also be an integer value, in the range of 0 and some upper value. The upper value can be, at most, the number of plate appearances. In general, that's 12 for this dataset, but that's a highly unlikely value to see as number of hits. (We'll address individual records in the next step.)

In [26]:
pd.unique(df['H'])

array([2, 3, 1, 0, 4, 5, 6, 9, 7])

**Note:** We should also confirm that there are never more hits than plate appearances within individual records.

In [27]:
records_pa_lt_h = df[df['PA'] < df['H']]
print("-----------------------------------------")
print("All records with PA < H")
print("-----------------------------------------")
print(records_pa_lt_h)

-----------------------------------------
All records with PA < H
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**Note:** The validation checks pass, as we have found no records where PA < H. No extra investigation or validation is required here.

**2B - Doubles**
**3B - Triples**
**HR - Home Run**

All of these extra base hits must be integer values, within the range of 0 up to the number of Plate Appearances.

In [28]:
print("2B: ", pd.unique(df['2B']))
print("3B: ", pd.unique(df['3B']))
print("HR: ", pd.unique(df['HR']))

2B:  [1 0 2 4 3]
3B:  [0 1 3 2]
HR:  [0 1 2 3 4]


**Note:** These are all reasonable values at a glance.

We should also confirm that these values are never greater than the number of Plate Appearances, At Bats, or Hits within individual records. (Note that Hits (`'H'`) represents all kinds of hits, not just single-base hits.)

In [29]:
records_pa_lt_2b = df[df['PA'] < df['2B']]
print("-----------------------------------------")
print("All records with PA < 2B")
print("-----------------------------------------")
print(records_pa_lt_2b)


records_pa_lt_3b = df[df['PA'] < df['3B']]
print("\n-----------------------------------------")
print("All records with PA < 3B")
print("-----------------------------------------")
print(records_pa_lt_3b)


records_pa_lt_hr = df[df['PA'] < df['HR']]
print("\n-----------------------------------------")
print("All records with PA < HR")
print("-----------------------------------------")
print(records_pa_lt_hr)


records_ab_lt_2b = df[df['AB'] < df['2B']]
print("\n-----------------------------------------")
print("All records with AB < 2B")
print("-----------------------------------------")
print(records_ab_lt_2b)


records_ab_lt_3b = df[df['AB'] < df['3B']]
print("\n-----------------------------------------")
print("All records with AB < 3B")
print("-----------------------------------------")
print(records_ab_lt_3b)


records_ab_lt_hr = df[df['AB'] < df['HR']]
print("\n-----------------------------------------")
print("All records with AB < HR")
print("-----------------------------------------")
print(records_ab_lt_hr)


records_h_lt_2b = df[df['H'] < df['2B']]
print("\n-----------------------------------------")
print("All records with H < 2B")
print("-----------------------------------------")
print(records_h_lt_2b)


records_h_lt_3b = df[df['H'] < df['3B']]
print("\n-----------------------------------------")
print("All records with H < 3B")
print("-----------------------------------------")
print(records_h_lt_3b)


records_h_lt_hr = df[df['H'] < df['HR']]
print("\n-----------------------------------------")
print("All records with H < HR")
print("-----------------------------------------")
print(records_h_lt_hr)

-----------------------------------------
All records with PA < 2B
-----------------------------------------
                ID          Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
1275367  robined01  Eddie Robinson  CHW  SLB   1   1  0  0   2   0   0  0.0   

         BB  SO  HBP  SH  SF  WPA  RE24  aLI  BOP Result Score  Season  
1275367   0   0    0   0 NaN  NaN   NaN  NaN    4      L  6-10    1950  

-----------------------------------------
All records with PA < 3B
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []

-----------------------------------------
All records with PA < HR
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []

-----------------------------------------
All records with 

**Note:** These validation checks caught a couple of problematic records. Looking at the entirety of these records, it looks like a simple data error, where the number of Hits needs to be updated to reflect the number of extra base hits. Instead of removing all records for these players, we will make these small data adjustments.

**First for Ed Robinson:**

In [30]:
ed_robinson = 1275367

df.loc[[ed_robinson]]

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,WPA,RE24,aLI,BOP,Result,Score,Season
1275367,robined01,Eddie Robinson,CHW,SLB,1,1,0,0,2,0,0,0.0,0,0,0,0,,,,,4,L,6-10,1950


For **Ed Robinson's record** (1275367), we will use the doubles (`'2B'`) statistic value as the value for Hits, Plate Appearances, and At Bats. These are reasonable guesses, based on the record.

In [31]:
df.at[ed_robinson, 'H'] = df.at[ed_robinson, '2B']
df.at[ed_robinson, 'AB'] = df.at[ed_robinson, '2B']
df.at[ed_robinson, 'PA'] = df.at[ed_robinson, '2B']

df.loc[[ed_robinson]]

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,WPA,RE24,aLI,BOP,Result,Score,Season
1275367,robined01,Eddie Robinson,CHW,SLB,2,2,0,2,2,0,0,0.0,0,0,0,0,,,,,4,L,6-10,1950


**Next for Joe Tipton:**

In [32]:
joe_tipton = 1272928

df.loc[[joe_tipton]]

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,WPA,RE24,aLI,BOP,Result,Score,Season
1272928,tiptojo01,Joe Tipton,PHA,CHW,3,2,0,0,0,1,0,0.0,1,1,0,0,,,,,8,L,3-10,1950


For **Joe Tipton's record** (1272928), we will use the triples (`'3B'`) statistic value as the value for Hits but it is possible that the statistics for Plate Appearances and At Bats is correct. Because it is not obviously wrong, we won't change these. These are reasonable guesses, based on the record.

In [33]:
df.at[joe_tipton, 'H'] = df.at[joe_tipton, '3B']

df.loc[[joe_tipton]]

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,WPA,RE24,aLI,BOP,Result,Score,Season
1272928,tiptojo01,Joe Tipton,PHA,CHW,3,2,0,1,0,1,0,0.0,1,1,0,0,,,,,8,L,3-10,1950


**Re-testing the original checks**

Re-running the tests that found these data issues should now pass and not introduce any new issues.

In [34]:
records_pa_lt_2b = df[df['PA'] < df['2B']]
print("-----------------------------------------")
print("All records with PA < 2B")
print("-----------------------------------------")
print(records_pa_lt_2b)


records_pa_lt_3b = df[df['PA'] < df['3B']]
print("\n-----------------------------------------")
print("All records with PA < 3B")
print("-----------------------------------------")
print(records_pa_lt_3b)


records_pa_lt_hr = df[df['PA'] < df['HR']]
print("\n-----------------------------------------")
print("All records with PA < HR")
print("-----------------------------------------")
print(records_pa_lt_hr)


records_ab_lt_2b = df[df['AB'] < df['2B']]
print("\n-----------------------------------------")
print("All records with AB < 2B")
print("-----------------------------------------")
print(records_ab_lt_2b)


records_ab_lt_3b = df[df['AB'] < df['3B']]
print("\n-----------------------------------------")
print("All records with AB < 3B")
print("-----------------------------------------")
print(records_ab_lt_3b)


records_ab_lt_hr = df[df['AB'] < df['HR']]
print("\n-----------------------------------------")
print("All records with AB < HR")
print("-----------------------------------------")
print(records_ab_lt_hr)


records_h_lt_2b = df[df['H'] < df['2B']]
print("\n-----------------------------------------")
print("All records with H < 2B")
print("-----------------------------------------")
print(records_h_lt_2b)


records_h_lt_3b = df[df['H'] < df['3B']]
print("\n-----------------------------------------")
print("All records with H < 3B")
print("-----------------------------------------")
print(records_h_lt_3b)


records_h_lt_hr = df[df['H'] < df['HR']]
print("\n-----------------------------------------")
print("All records with H < HR")
print("-----------------------------------------")
print(records_h_lt_hr)

-----------------------------------------
All records with PA < 2B
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []

-----------------------------------------
All records with PA < 3B
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []

-----------------------------------------
All records with PA < HR
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []

-----------------------------------------
All records with AB < 2B
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA

**SUCCESS!!** These data issues have been resolved. `'2B'`, `'3B'`, and `'HR'` are now validated.

**RBI - RBIs (Runs Batted In)**

This is an important statistic, especially for calculated batting statistics which may prove useful later.

RBIs should be an integer value, in the range of 0 and some upper value. The upper value is dependent on the number of runners on base at the time of the plate appearance, which we do not know directly from the data. We can estimate a maximum possible value of Plate Appearances * 4 (the maximum number of runs possible to bat in). This maximum value would be statistically possible but highly unlikely. But it means that if we see a value higher than 12 * 4 = 48 in a single game then it is definitely out of range.


In [35]:
print(pd.unique(df['RBI']))

[ 1.  0.  3.  4.  2.  5.  6.  8.  7. nan  9. 12. 11. 10.]


**Note:** Of course, we don't see anything nearly as extreme as 48, but we have multiple problems here: (1) RBI is stored as a Float value, which makes no sense in this context, and (2) we have a NaN/undefined value to deal with.

While it would be nice to convert to integer values, the presence of NaN values blocks our carrying out this operation. So, first, we have to make a decision about how to deal with the NaN values.

Let's look at how many records include these NaN values and how many unique players are impacted by this data issue.

In [36]:
rbi_is_nan = df.loc[pd.isna(df['RBI'])]
print(rbi_is_nan)
print()

print("\nUnique Players with this RBI-NaN problem:")
rbi_nan_players = pd.unique(rbi_is_nan['ID'])
print(rbi_nan_players)
print("Count: ",len(rbi_nan_players))

print()
print("\nPercentage of overall records (",len(df),") impacted: ", len(rbi_is_nan)/len(df)*100)
print("There are",len(rbi_nan_players),"players impacted by these",len(rbi_is_nan),"records.")
print("")

               ID           Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
41584   becklja01     Jake Beckley  CIN  PIT   4   4  2  2   0   0   0  NaN   
41594   corcoto01   Tommy Corcoran  CIN  PIT   4   4  0  0   0   0   0  NaN   
41598   donlimi01      Mike Donlin  CIN  PIT   4   4  1  1   0   0   0  NaN   
41615   kellejo01       Joe Kelley  CIN  PIT   4   3  0  1   0   0   0  NaN   
41621   magooge01    George Magoon  CIN  PIT   3   3  0  1   0   0   0  NaN   
...           ...              ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...   
476024  wambsbi01  Bill Wambsganss  CLE  SLB   4   4  0  1   0   0   0  NaN   
476025  weavebu01      Buck Weaver  CHW  DET   5   4  1  3   0   2   0  NaN   
476027  wilkiro01    Roy Wilkinson  CHW  DET   3   3  1  1   1   0   0  NaN   
476029   woodjo02   Smoky Joe Wood  CLE  SLB   4   4  0  1   0   0   0  NaN   
476031  youngra01      Ralph Young  DET  CHW   5   4  0  2   0   0   0  NaN   

        BB  SO  HBP  SH  SF  WPA  RE24  aLI  BOP Re

While the number of overall records impacted is less than 1%, we see that there are many more players than there are records with this problem, which means that it is safest to remove all records for these players -- not only the problematic ones. Since all of the impacted data records are from seasons early in the 20th century, this does not raise any great concerns for this project.

In [37]:
# the removal
df = df[df['ID'].isin(rbi_nan_players) == False]

# test
print("\nTest for an empty list of RBI==NaN values, after removing rows:\n\n", df.loc[pd.isna(df['RBI'])])
print("\n\nReduced to",len(df),"records in main data frame after removing those players records entirely.")


Test for an empty list of RBI==NaN values, after removing rows:

 Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


Reduced to 3715522 records in main data frame after removing those players records entirely.


**Finally,** we can address our other issue: converting the type for RBI from float to integer.

In [38]:
df['RBI'] = df['RBI'].astype(int)
# test
print(pd.unique(df['RBI']))

[ 0  1  3  2  4  5  6  7  8 12  9 11 10]


**SUCCESS!!** No NaN values and all Integer values. And all of these are reasonable values at a glance.

We should also confirm that RBI values are never greater than the number of Plate Appearances within individual records. (Note: There are many different scenarios for earning an RBI, but they all require a corresponding Plate Appearance.)

In [39]:
records_pa4_lt_rbi = df[df['PA']*4 < (df['RBI']) ]
print("\n-----------------------------------------")
print("All records with PA*4 < RBI")
print("-----------------------------------------")
print(records_pa4_lt_rbi)


-----------------------------------------
All records with PA*4 < RBI
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**Note:** All RBI-related data issues have been resolved and our check within a record has passed. `'RBI'` is now validated.

**BB - Base On Balls (Walks)**

Walks should be an integer value, within the range of 0 up to the number of Plate Appearances.

In [40]:
pd.unique(df['BB'])

array([1, 0, 2, 3, 4, 5, 6])

**Note:** These are all reasonable values at a glance.

We should also confirm that Walk values are never greater than the number of Plate Appearances within individual records.

In [41]:
records_pa_lt_bb = df[df['PA'] < (df['BB']) ]
print("\n-----------------------------------------")
print("All records with PA < BB")
print("-----------------------------------------")
print(records_pa_lt_bb)


-----------------------------------------
All records with PA < BB
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**HBP - Hit By Pitch**

Hit By Pitch should be an integer value, within the range of 0 up to the number of Plate Appearances.

Reference: https://en.wikipedia.org/wiki/Base_on_balls
HBP is **not** recorded as a walk/BB. ("A hit by pitch is not counted statistically as a walk, though the effect is mostly the same, with the batter receiving a free pass to first base.")

In [42]:
pd.unique(df['HBP'])

array([0, 1, 2, 3])

**Note:** These are all reasonable values at a glance.

We should also confirm that HBP values are never greater than the number of Plate Appearances within individual records.

In [43]:
records_pa_lt_hbp = df[df['PA'] < (df['HBP']) ]
print("\n-----------------------------------------")
print("All records with PA < HBP")
print("-----------------------------------------")
print(records_pa_lt_hbp)


-----------------------------------------
All records with PA < HBP
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**SO - Strikeouts**

Strikeouts should be an integer value, within the range of 0 up to the number of Plate Appearances.

In [44]:
print(pd.unique(df['SO']))

[0 2 1 3 4 5 6]


**Note:** These are all reasonable values at a glance.

We should also confirm that Strikeout values are never greater than the number of Plate Appearances or At Bats within individual records.

In [45]:
records_pa_lt_so = df[df['PA'] < (df['SO']) ]
print("\n-----------------------------------------")
print("All records with PA < SO")
print("-----------------------------------------")
print(records_pa_lt_so)

records_ab_lt_so = df[df['AB'] < (df['SO']) ]
print("\n-----------------------------------------")
print("All records with AB < SO")
print("-----------------------------------------")
print(records_ab_lt_so)


-----------------------------------------
All records with PA < SO
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []

-----------------------------------------
All records with AB < SO
-----------------------------------------
                ID       Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  BB  \
1252967  frienow01  Owen Friend  SLB  NYY   2   0  1  0   0   0   0    0   1   

         SO  HBP  SH  SF  WPA  RE24  aLI  BOP Result Score  Season  
1252967   1    0   1 NaN  NaN   NaN  NaN    8      L  9-11    1950  


**Note:** We found one record with a strikeout but no at bats -- which doesn't make sense.

In [46]:
owen_friend = 1252967
df.loc[[owen_friend]]

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,WPA,RE24,aLI,BOP,Result,Score,Season
1252967,frienow01,Owen Friend,SLB,NYY,2,0,1,0,0,0,0,0,1,1,0,1,,,,,8,L,9-11,1950


Looking closer we can make a minor data adjustment for this record, and assume that there should be two At Bats instead of 0. In this game, the player stuck out once -- which means there was at least one At Bat.

Note: we also see a sacrifice hit (`'SH'`) listed. But, Sacrifice Hits and Flies (`'SF'`) do not count as times At Bat. They do, however, count as Plate Appearances.

These statistics demonstrate there must have been at least one At Bat. There was also one Walk (`'BB'`), which should mean there was at least three Plate Appearances, between the Sacrifice Hit, the Walk, and the Strikeout.

The most reasonable guess here is to set PA = 3 and AB = 1.

In [47]:
df.at[owen_friend, 'PA'] = 3
df.at[owen_friend, 'AB'] = 1

df.loc[[owen_friend]]

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,WPA,RE24,aLI,BOP,Result,Score,Season
1252967,frienow01,Owen Friend,SLB,NYY,3,1,1,0,0,0,0,0,1,1,0,1,,,,,8,L,9-11,1950


Re-Testing:

In [48]:
records_pa_lt_so = df[df['PA'] < (df['SO']) ]
print("\n-----------------------------------------")
print("All records with PA < SO")
print("-----------------------------------------")
print(records_pa_lt_so)

records_ab_lt_so = df[df['AB'] < (df['SO']) ]
print("\n-----------------------------------------")
print("All records with AB < SO")
print("-----------------------------------------")
print(records_ab_lt_so)


-----------------------------------------
All records with PA < SO
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []

-----------------------------------------
All records with AB < SO
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**SH - Sacrifice Hits**

Sacrifice Hits should be an integer value, within the range of 0 up to the number of Plate Appearances.

Note: Sacrifice plays do not count against the batter and, as such, don't count as At Bats.

References:
https://www.mlb.com/glossary/standard-stats/sacrifice-bunt
https://www.baseball-reference.com/bullpen/Sacrifice_hit

In [49]:
pd.unique(df['SH'])

array([0, 1, 2, 3, 4])

**Note:** These are all reasonable values at a glance.

We should also confirm that Sacrifce Hits values are never greater than the number of Plate Appearances within individual records.

In [50]:
records_pa_lt_sh = df[df['PA'] < (df['SH']) ]
print("\n-----------------------------------------")
print("All records with PA < SH")
print("-----------------------------------------")
print(records_pa_lt_sh)


-----------------------------------------
All records with PA < SH
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**SF - Sacrifice Flies**

Sacrifice Flies should be an integer value, within the range of 0 up to the number of Plate Appearances.

Note: Sacrifice plays do not count against the batter and, as such, don't count as At Bats.

Reference:
https://www.mlb.com/glossary/standard-stats/sacrifice-fly

In [51]:
pd.unique(df['SF'])

array([nan,  0.,  1.,  2.,  3.])

**Note:** We find that our Sacrifice Flies data is non-integer -- as with the RBI data processing, this is because there are NaN values present.

We will take the same approach for investigation, record removal, and conversion as with RBI.

In [52]:
sf_is_nan = df.loc[pd.isna(df['SF'])]
print(sf_is_nan)
print()

print("\nUnique Players with this SF-NaN problem:")
sf_nan_players = pd.unique(sf_is_nan['ID'])
print(sf_nan_players)
print("Count: ",len(sf_nan_players))

print()
print("\nPercentage of overall records (",len(df),") impacted: ", len(sf_is_nan)/len(df)*100)
print("There are",len(sf_nan_players),"players impacted by these",len(sf_is_nan),"records.")
print("")

                ID           Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
4        delahed01     Ed Delahanty  PHI  BRO   5   4  1  2   0   0   0    0   
5        dolanjo02        Joe Dolan  PHI  BRO   5   5  0  1   0   0   0    1   
21       childcu01     Cupid Childs  CHC  STL   5   5  1  1   0   0   0    0   
22       crolifr01     Fred Crolius  BSN  NYG   4   4  0  0   0   0   0    1   
28       delahed01     Ed Delahanty  PHI  BRO   4   4  0  0   0   0   0    0   
...            ...              ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...   
1355415  wyrosjo01  Johnny Wyrostek  PHI  BRO   4   4  0  1   0   0   0    0   
1355416   yosted01       Eddie Yost  WSH  PHA   5   3  1  2   1   0   0    1   
1355417  youngbo01      Bobby Young  SLB  CHW   4   4  0  0   0   0   0    0   
1355418  zernigu01      Gus Zernial  PHA  WSH   5   4  2  3   0   0   0    3   
1670929  bankser01      Ernie Banks  CHC  LAD   1   1  0  0   0   0   0    0   

         BB  SO  HBP  SH  SF    WPA   R

We see here that a much larger percentage of our dataset is impacted (21%), so before removing all of these records and all of the player data associated, we need to consider the potential importance of the Sacrifice Flies feature for the project.

Removing the SF column itself is not an option if we want to calculate the OBP (On Base Percentage) statistic.

Instead of NaN values, we will replace the NaN values with zeroes. It won't be accurate in all cases (since these haven't always been tracked), but the worst case will be that the affected records will underreport this statistic and consequently have a lower OBP percentage. But, it is the best we can do without completely removing the data.

In [87]:
series_of_sf = sf_is_nan.index
series_of_sf

df.loc[series_of_sf, 'SF'] = 0
df['SF'] = df['SF'].astype(int)
df

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Result,Season
4,delahed01,Ed Delahanty,PHI,BRO,5,4,1,2,0,0,0,0,1,0,0,0,0,L,1901
5,dolanjo02,Joe Dolan,PHI,BRO,5,5,0,1,0,0,0,1,0,0,0,0,0,L,1901
21,childcu01,Cupid Childs,CHC,STL,5,5,1,1,0,0,0,0,0,0,0,0,0,W,1901
22,crolifr01,Fred Crolius,BSN,NYG,4,4,0,0,0,0,0,1,0,0,0,0,0,W,1901
28,delahed01,Ed Delahanty,PHI,BRO,4,4,0,0,0,0,0,0,0,2,0,0,0,L,1901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4285624,woodfja01,Jake Woodford,STL,CHC,2,1,0,0,0,0,0,0,0,0,0,1,0,L,2021
4285625,yastrmi01,Mike Yastrzemski,SFG,SDP,4,3,1,1,1,0,0,2,1,1,0,0,0,W,2021
4285626,zimmebr01,Bradley Zimmer,CLE,TEX,4,4,1,2,0,0,0,1,0,0,0,0,0,W,2021
4285627,zimmery01,Ryan Zimmerman,WSN,BOS,4,3,0,0,0,0,0,1,1,2,0,0,0,L,2021


In [86]:
pd.unique(df['SF'])

array([0, 1, 2, 3])

In [55]:
records_pa_lt_sf = df[df['PA'] < (df['SF']) ]
print("\n-----------------------------------------")
print("All records with PA < SF")
print("-----------------------------------------")
print(records_pa_lt_sf)


-----------------------------------------
All records with PA < SF
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**WPA - Win Probability Added**


Reference:
https://en.wikipedia.org/wiki/Win_probability_added

In [56]:
pd.unique(df['WPA'])

array([   nan,  0.03 , -0.042, ...,  0.974,  0.947, -0.655])

In [57]:
wpa_is_nan = df.loc[pd.isna(df['WPA'])]
print(wpa_is_nan)
print()

print("\nUnique Players with this SF-NaN problem:")
wpa_nan_players = pd.unique(wpa_is_nan['ID'])
print(wpa_nan_players)
print("Count: ",len(wpa_nan_players))

print()
print("\nPercentage of overall records (",len(df),") impacted: ", len(wpa_is_nan)/len(df)*100)
print("There are",len(wpa_nan_players),"players impacted by these",len(wpa_is_nan),"records.")
print("")

                ID         Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
4        delahed01   Ed Delahanty  PHI  BRO   5   4  1  2   0   0   0    0   
5        dolanjo02      Joe Dolan  PHI  BRO   5   5  0  1   0   0   0    1   
21       childcu01   Cupid Childs  CHC  STL   5   5  1  1   0   0   0    0   
22       crolifr01   Fred Crolius  BSN  NYG   4   4  0  0   0   0   0    1   
28       delahed01   Ed Delahanty  PHI  BRO   4   4  0  0   0   0   0    0   
...            ...            ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...   
1981210  raderdo02     Doug Rader  HOU  ATL   4   4  1  2   0   0   1    1   
1981254  watsobo01     Bob Watson  HOU  ATL   4   4  0  1   0   0   0    1   
1981257  williea02  Earl Williams  ATL  HOU   4   4  0  2   0   0   0    1   
1981259  wilsodo01     Don Wilson  HOU  ATL   3   2  0  0   0   0   0    0   
1981262   wynnji01       Jim Wynn  HOU  ATL   5   4  1  1   0   0   0    1   

         BB  SO  HBP  SH  SF  WPA  RE24  aLI  BOP Result Score 

**RE24 - Base-Out Runs Added**
This statistic represents the run expectancy based on 24 base outs.

Reference:
https://thebaseballscholar.com/2017/08/14/sabermetrics-101-re24/

In [58]:
pd.unique(df['RE24'])

array([   nan,  1.036, -0.138, ...,  5.928, -3.176,  5.142])

In [59]:
re24_is_nan = df.loc[pd.isna(df['RE24'])]
print(re24_is_nan)
print()

print("\nUnique Players with this SF-NaN problem:")
re24_nan_players = pd.unique(re24_is_nan['ID'])
print(re24_nan_players)
print("Count: ",len(re24_nan_players))

print()
print("\nPercentage of overall records (",len(df),") impacted: ", len(re24_is_nan)/len(df)*100)
print("There are",len(re24_nan_players),"players impacted by these",len(re24_is_nan),"records.")
print("")

                ID         Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
4        delahed01   Ed Delahanty  PHI  BRO   5   4  1  2   0   0   0    0   
5        dolanjo02      Joe Dolan  PHI  BRO   5   5  0  1   0   0   0    1   
21       childcu01   Cupid Childs  CHC  STL   5   5  1  1   0   0   0    0   
22       crolifr01   Fred Crolius  BSN  NYG   4   4  0  0   0   0   0    1   
28       delahed01   Ed Delahanty  PHI  BRO   4   4  0  0   0   0   0    0   
...            ...            ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...   
1981210  raderdo02     Doug Rader  HOU  ATL   4   4  1  2   0   0   1    1   
1981254  watsobo01     Bob Watson  HOU  ATL   4   4  0  1   0   0   0    1   
1981257  williea02  Earl Williams  ATL  HOU   4   4  0  2   0   0   0    1   
1981259  wilsodo01     Don Wilson  HOU  ATL   3   2  0  0   0   0   0    0   
1981262   wynnji01       Jim Wynn  HOU  ATL   5   4  1  1   0   0   0    1   

         BB  SO  HBP  SH  SF  WPA  RE24  aLI  BOP Result Score 

**aLI - Average Leverage Index**

Reference:
https://www.azsnakepit.com/2021/11/9/22742763/what-average-leverage-index-revealed-about-the-2021-diamondbacks

In [60]:
pd.unique(df['aLI'])

array([  nan, 0.14 , 1.187, ..., 4.373, 3.954, 7.45 ])

In [61]:
ali_is_nan = df.loc[pd.isna(df['aLI'])]
print(ali_is_nan)
print()

print("\nUnique Players with this SF-NaN problem:")
ali_nan_players = pd.unique(ali_is_nan['ID'])
print(ali_nan_players)
print("Count: ",len(ali_nan_players))

print()
print("\nPercentage of overall records (",len(df),") impacted: ", len(ali_is_nan)/len(df)*100)
print("There are",len(ali_nan_players),"players impacted by these",len(ali_is_nan),"records.")
print("")

                ID         Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
4        delahed01   Ed Delahanty  PHI  BRO   5   4  1  2   0   0   0    0   
5        dolanjo02      Joe Dolan  PHI  BRO   5   5  0  1   0   0   0    1   
21       childcu01   Cupid Childs  CHC  STL   5   5  1  1   0   0   0    0   
22       crolifr01   Fred Crolius  BSN  NYG   4   4  0  0   0   0   0    1   
28       delahed01   Ed Delahanty  PHI  BRO   4   4  0  0   0   0   0    0   
...            ...            ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...   
1981210  raderdo02     Doug Rader  HOU  ATL   4   4  1  2   0   0   1    1   
1981254  watsobo01     Bob Watson  HOU  ATL   4   4  0  1   0   0   0    1   
1981257  williea02  Earl Williams  ATL  HOU   4   4  0  2   0   0   0    1   
1981259  wilsodo01     Don Wilson  HOU  ATL   3   2  0  0   0   0   0    0   
1981262   wynnji01       Jim Wynn  HOU  ATL   5   4  1  1   0   0   0    1   

         BB  SO  HBP  SH  SF  WPA  RE24  aLI  BOP Result Score 

**Note:** An interesting phenomenon was seen with the WPA, RE24, and aLI statistics. They all seem to have been calculated in a similar range of time.

***A DECISION ... FOR WPA, RE24, and aLI STATISTICS***

After looking at all of this data, the nature of the statistics, and the goals of this project, the best decision is to remove these three features entirely. These three statistics are quite advanced to a point that is beyond the scope of this project.

So, we will remove these columns from the data frame.

In [62]:
del df['WPA']
del df['RE24']
del df['aLI']

In [63]:
df

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,BOP,Result,Score,Season
4,delahed01,Ed Delahanty,PHI,BRO,5,4,1,2,0,0,0,0,1,0,0,0,0,3,L,7-12,1901
5,dolanjo02,Joe Dolan,PHI,BRO,5,5,0,1,0,0,0,1,0,0,0,0,0,7,L,7-12,1901
21,childcu01,Cupid Childs,CHC,STL,5,5,1,1,0,0,0,0,0,0,0,0,0,1,W,8-7,1901
22,crolifr01,Fred Crolius,BSN,NYG,4,4,0,0,0,0,0,1,0,0,0,0,0,6,W,7-0,1901
28,delahed01,Ed Delahanty,PHI,BRO,4,4,0,0,0,0,0,0,0,2,0,0,0,3,L,2-10,1901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4285624,woodfja01,Jake Woodford,STL,CHC,2,1,0,0,0,0,0,0,0,0,0,1,0,9,L,2-3,2021
4285625,yastrmi01,Mike Yastrzemski,SFG,SDP,4,3,1,1,1,0,0,2,1,1,0,0,0,7,W,11-4,2021
4285626,zimmebr01,Bradley Zimmer,CLE,TEX,4,4,1,2,0,0,0,1,0,0,0,0,0,6,W,6-0,2021
4285627,zimmery01,Ryan Zimmerman,WSN,BOS,4,3,0,0,0,0,0,1,1,2,0,0,0,5,L,5-7,2021


Considering what remains, `'BOP'` (Batting Order Position) and the extracted `'Score'`, I do not foresee any use for either of these, so I will remove these columns as well.

At this time, I have decided to keep the `'Tm'` and `'Opp'` columns in the data for context but do not anticipate using it in any of the learning aspects of the project.

In [64]:
del df['BOP']
del df['Score']

In [65]:
df

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Result,Season
4,delahed01,Ed Delahanty,PHI,BRO,5,4,1,2,0,0,0,0,1,0,0,0,0,L,1901
5,dolanjo02,Joe Dolan,PHI,BRO,5,5,0,1,0,0,0,1,0,0,0,0,0,L,1901
21,childcu01,Cupid Childs,CHC,STL,5,5,1,1,0,0,0,0,0,0,0,0,0,W,1901
22,crolifr01,Fred Crolius,BSN,NYG,4,4,0,0,0,0,0,1,0,0,0,0,0,W,1901
28,delahed01,Ed Delahanty,PHI,BRO,4,4,0,0,0,0,0,0,0,2,0,0,0,L,1901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4285624,woodfja01,Jake Woodford,STL,CHC,2,1,0,0,0,0,0,0,0,0,0,1,0,L,2021
4285625,yastrmi01,Mike Yastrzemski,SFG,SDP,4,3,1,1,1,0,0,2,1,1,0,0,0,W,2021
4285626,zimmebr01,Bradley Zimmer,CLE,TEX,4,4,1,2,0,0,0,1,0,0,0,0,0,W,2021
4285627,zimmery01,Ryan Zimmerman,WSN,BOS,4,3,0,0,0,0,0,1,1,2,0,0,0,L,2021


**Result**

I am intentionally keeping `'Result'` at this time, so that I have options to understand how many winning games a player participated in, although there is a good chance this level of nuance will end up being outside of the scope of the project.

These values are extracted strings and should be one of 3 values: "W", "L", or "T". (Note that "T" represents a tied game, which is relatively unusual in baseball.)

In [66]:
print(pd.unique(df['Result']))

['L' 'W' 'T']


We might decide later to convert this to numeric values, if it proves to be helpful.

### Looking at Player Data

So far, we've looked at data from the perspective of each individual game played by an individual player. Now the dat is healthier than when we started, we want to look at individual players during their careers.

Keeping in mind that our goal is to find trends that might predict a batter's success in the future, we will want to be able to combine their statistics over time but also attempt to find ways to exameine their data during parts of their careers.

In [67]:
print("There are",len(pd.unique(df['ID'])),"unique players remaining in this dataset.")

There are 14336 unique players remaining in this dataset.


Let's try splitting the data into these arbitrary categories:

In [68]:
df_early = df[df['Season'] < 1977].copy()

In [69]:
df_early

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Result,Season
4,delahed01,Ed Delahanty,PHI,BRO,5,4,1,2,0,0,0,0,1,0,0,0,0,L,1901
5,dolanjo02,Joe Dolan,PHI,BRO,5,5,0,1,0,0,0,1,0,0,0,0,0,L,1901
21,childcu01,Cupid Childs,CHC,STL,5,5,1,1,0,0,0,0,0,0,0,0,0,W,1901
22,crolifr01,Fred Crolius,BSN,NYG,4,4,0,0,0,0,0,1,0,0,0,0,0,W,1901
28,delahed01,Ed Delahanty,PHI,BRO,4,4,0,0,0,0,0,0,0,2,0,0,0,L,1901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2150289,rudolke01,Ken Rudolph,STL,PIT,3,3,0,0,0,0,0,0,0,0,0,0,0,L,1976
2150290,stennre01,Rennie Stennett,PIT,STL,4,4,0,0,0,0,0,0,0,0,0,0,0,W,1976
2150291,taverfr01,Frank Taveras,PIT,STL,1,1,0,0,0,0,0,0,0,1,0,0,0,W,1976
2150292,templga01,Garry Templeton,STL,PIT,2,2,0,0,0,0,0,0,0,0,0,0,0,L,1976


In [70]:
df_mid = df[df['Season'] > 1976].copy()

In [71]:
df_mid = df_mid[df_mid['Season'] < 2012].copy()

In [72]:
df_mid

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Result,Season
2150294,almonbi01,Bill Almon,SDP,CIN,4,4,0,0,0,0,0,0,0,1,0,0,0,L,1977
2150295,armbred01,Ed Armbrister,CIN,SDP,1,1,0,0,0,0,0,0,0,0,0,0,0,W,1977
2150296,baezjo01,Jose Baez,SEA,CAL,4,4,0,2,0,0,0,0,0,0,0,0,0,L,1977
2150297,baylodo01,Don Baylor,CAL,SEA,5,3,1,1,1,0,0,1,2,0,0,0,0,W,1977
2150298,bochtbr01,Bruce Bochte,CAL,SEA,5,3,2,2,0,0,0,0,2,0,0,0,0,W,1977
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3805700,youngch04,Chris Young,ARI,LAD,2,2,0,0,0,0,0,0,0,2,0,0,0,L,2011
3805701,youngde03,Delmon Young,DET,CLE,4,4,1,1,0,0,0,0,0,1,0,0,0,W,2011
3805702,younger03,Eric Young Jr.,COL,SFG,5,5,1,2,0,1,0,0,0,2,0,0,0,W,2011
3805703,youngmi02,Michael Young,TEX,LAA,4,4,1,1,0,0,0,0,0,0,0,0,0,W,2011


In [73]:
df_later = df[df['Season'] > 2011].copy()

In [74]:
df_later

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Result,Season
3805705,ackledu01,Dustin Ackley,SEA,OAK,5,5,2,2,0,0,1,2,0,1,0,0,0,W,2012
3805706,allenbr01,Brandon Allen,OAK,SEA,4,4,0,0,0,0,0,0,0,2,0,0,0,L,2012
3805707,carpmi01,Mike Carp,SEA,OAK,4,4,0,0,0,0,0,0,0,0,0,0,0,W,2012
3805708,cespeyo01,Yoenis Céspedes,OAK,SEA,4,3,0,1,1,0,0,0,0,2,1,0,0,L,2012
3805709,crispco01,Coco Crisp,OAK,SEA,5,5,0,0,0,0,0,0,0,1,0,0,0,L,2012
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4285624,woodfja01,Jake Woodford,STL,CHC,2,1,0,0,0,0,0,0,0,0,0,1,0,L,2021
4285625,yastrmi01,Mike Yastrzemski,SFG,SDP,4,3,1,1,1,0,0,2,1,1,0,0,0,W,2021
4285626,zimmebr01,Bradley Zimmer,CLE,TEX,4,4,1,2,0,0,0,1,0,0,0,0,0,W,2021
4285627,zimmery01,Ryan Zimmerman,WSN,BOS,4,3,0,0,0,0,0,1,1,2,0,0,0,L,2021


In [75]:
df_mid

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Result,Season
2150294,almonbi01,Bill Almon,SDP,CIN,4,4,0,0,0,0,0,0,0,1,0,0,0,L,1977
2150295,armbred01,Ed Armbrister,CIN,SDP,1,1,0,0,0,0,0,0,0,0,0,0,0,W,1977
2150296,baezjo01,Jose Baez,SEA,CAL,4,4,0,2,0,0,0,0,0,0,0,0,0,L,1977
2150297,baylodo01,Don Baylor,CAL,SEA,5,3,1,1,1,0,0,1,2,0,0,0,0,W,1977
2150298,bochtbr01,Bruce Bochte,CAL,SEA,5,3,2,2,0,0,0,0,2,0,0,0,0,W,1977
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3805700,youngch04,Chris Young,ARI,LAD,2,2,0,0,0,0,0,0,0,2,0,0,0,L,2011
3805701,youngde03,Delmon Young,DET,CLE,4,4,1,1,0,0,0,0,0,1,0,0,0,W,2011
3805702,younger03,Eric Young Jr.,COL,SFG,5,5,1,2,0,1,0,0,0,2,0,0,0,W,2011
3805703,youngmi02,Michael Young,TEX,LAA,4,4,1,1,0,0,0,0,0,0,0,0,0,W,2011


In [76]:
### TEST -- but if successful might do it for the entire df for career/lifetime stats as a record

temp = df_mid.groupby('ID')
temp

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12b5673a0>

In [88]:
summables = ['PA','AB','R','H','2B','3B','HR','RBI','BB','SO','HBP','SH', 'SF']
temp_mids = temp[summables].sum()
temp_mids

Unnamed: 0_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
aardsda01,4,3,0,0,0,0,0,0,0,1,0,1,0
aasedo01,5,5,0,0,0,0,0,0,0,3,0,0,0
abadan01,25,21,1,2,0,0,0,0,4,5,0,0,0
abadfe01,1,1,0,0,0,0,0,0,0,1,0,0,0
abbotje01,651,596,82,157,33,2,18,83,38,91,3,5,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
zoskyed01,53,50,4,8,1,2,0,3,1,13,0,1,1
zuberjo01,151,136,13,34,7,1,3,16,12,20,1,1,1
zuletju01,191,174,23,43,11,0,9,36,10,51,6,0,1
zupcibo01,886,795,93,199,47,4,7,80,57,137,6,20,8


In [79]:
temp_player_names = df_mid[['Player','ID']]
temp_player_names

Unnamed: 0,Player,ID
2150294,Bill Almon,almonbi01
2150295,Ed Armbrister,armbred01
2150296,Jose Baez,baezjo01
2150297,Don Baylor,baylodo01
2150298,Bruce Bochte,bochtbr01
...,...,...
3805700,Chris Young,youngch04
3805701,Delmon Young,youngde03
3805702,Eric Young Jr.,younger03
3805703,Michael Young,youngmi02


In [89]:
print(temp_mids)
pd.merge(temp_mids, temp_player_names, on = "ID", how = "right")


            PA   AB   R    H  2B  3B  HR  RBI  BB   SO  HBP  SH  SF
ID                                                                 
aardsda01    4    3   0    0   0   0   0    0   0    1    0   1   0
aasedo01     5    5   0    0   0   0   0    0   0    3    0   0   0
abadan01    25   21   1    2   0   0   0    0   4    5    0   0   0
abadfe01     1    1   0    0   0   0   0    0   0    1    0   0   0
abbotje01  651  596  82  157  33   2  18   83  38   91    3   5   7
...        ...  ...  ..  ...  ..  ..  ..  ...  ..  ...  ...  ..  ..
zoskyed01   53   50   4    8   1   2   0    3   1   13    0   1   1
zuberjo01  151  136  13   34   7   1   3   16  12   20    1   1   1
zuletju01  191  174  23   43  11   0   9   36  10   51    6   0   1
zupcibo01  886  795  93  199  47   4   7   80  57  137    6  20   8
zuvelpa01  545  491  40  109  17   2   2   20  34   50    2  18   0

[5825 rows x 13 columns]


Unnamed: 0,ID,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Player
0,almonbi01,3549,3225,366,816,134,25,35,287,246,617,6,43,29,Bill Almon
1,armbred01,91,78,9,20,4,3,1,5,10,21,0,2,1,Ed Armbrister
2,baezjo01,391,355,43,87,14,2,1,19,25,27,1,10,0,Jose Baez
3,baylodo01,6722,5846,904,1506,265,13,266,979,583,775,201,6,86,Don Baylor
4,bochtbr01,4802,4196,525,1198,210,16,90,535,526,543,8,31,41,Bruce Bochte
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1655406,youngch04,3225,2848,417,683,174,17,118,367,324,735,18,12,23,Chris Young
1655407,youngde03,2967,2784,342,802,158,9,71,408,125,514,25,1,32,Delmon Young
1655408,younger03,479,427,67,105,10,4,1,19,47,82,3,1,1,Eric Young Jr.
1655409,youngmi02,7396,6788,1005,2061,388,52,169,917,499,1082,20,25,64,Michael Young


In [188]:
### TEST

tester = df_mid[df_mid['ID'] == 'baezjo01'].head()
tester.at[2151314,'Player'] = "DJ Harris"
tester.at[2151314, 'ID'] = 'aaadjhtest01'
tester.at[2151774,'Season'] = 1978
tester



Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,Result,Season
2150296,baezjo01,Jose Baez,SEA,CAL,4,4,0,2,0,0,0,0,0,0,0,0,L,1977
2150834,baezjo01,Jose Baez,SEA,CAL,5,5,0,1,0,0,0,0,0,0,0,0,L,1977
2151141,baezjo01,Jose Baez,SEA,MIN,4,4,0,1,0,0,0,0,0,0,0,0,L,1977
2151314,aaadjhtest01,DJ Harris,SEA,MIN,5,5,0,2,1,0,0,0,0,0,0,0,L,1977
2151774,baezjo01,Jose Baez,SEA,MIN,3,3,0,0,0,0,0,0,0,0,0,0,W,1978


In [189]:
##

# 1. Collapse by player
groupables = ['ID','Season']

temp_tester = tester.groupby(groupables)
temp_player_seasons = temp_tester.sum()
temp_player_seasons


Unnamed: 0_level_0,Unnamed: 1_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH
ID,Season,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
aaadjhtest01,1977,5,5,0,2,1,0,0,0,0,0,0,0
baezjo01,1977,13,13,0,4,0,0,0,0,0,0,0,0
baezjo01,1978,3,3,0,0,0,0,0,0,0,0,0,0


In [182]:
# 2. Count number of games per season

temp_counter = tester.groupby(groupables)['ID'].count().reset_index(name='Games Played')
temp_counter

Unnamed: 0,ID,Season,Games Played
0,aaadjhtest01,1977,1
1,baezjo01,1977,4


In [196]:
# 3. ...

temp_names = tester[['ID','Player']]
temp_names

Unnamed: 0,ID,Player
2150296,baezjo01,Jose Baez
2150834,baezjo01,Jose Baez
2151141,baezjo01,Jose Baez
2151314,aaadjhtest01,DJ Harris
2151774,baezjo01,Jose Baez


In [187]:
# 4. Combine some things

temp_result = pd.merge(temp_player_seasons,temp_names,on='ID',how='right')
temp_result

Unnamed: 0,ID,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,Player
0,baezjo01,16,16,0,4,0,0,0,0,0,0,0,0,Jose Baez
1,baezjo01,16,16,0,4,0,0,0,0,0,0,0,0,Jose Baez
2,baezjo01,16,16,0,4,0,0,0,0,0,0,0,0,Jose Baez
3,aaadjhtest01,5,5,0,2,1,0,0,0,0,0,0,0,DJ Harris
4,baezjo01,16,16,0,4,0,0,0,0,0,0,0,0,Jose Baez


In [90]:
## ALL stats
summables = ['PA','AB','R','H','2B','3B','HR','RBI','BB','SO','HBP','SH', 'SF']
temp_career_stats = temp[summables].sum()
temp_career_stats



Unnamed: 0_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
aardsda01,4,3,0,0,0,0,0,0,0,1,0,1,0
aasedo01,5,5,0,0,0,0,0,0,0,3,0,0,0
abadan01,25,21,1,2,0,0,0,0,4,5,0,0,0
abadfe01,1,1,0,0,0,0,0,0,0,1,0,0,0
abbotje01,651,596,82,157,33,2,18,83,38,91,3,5,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
zoskyed01,53,50,4,8,1,2,0,3,1,13,0,1,1
zuberjo01,151,136,13,34,7,1,3,16,12,20,1,1,1
zuletju01,191,174,23,43,11,0,9,36,10,51,6,0,1
zupcibo01,886,795,93,199,47,4,7,80,57,137,6,20,8


singles (int) = H - (2B + 3B + HR)

total_bases (int) = 1 * singles + 2 * 2B + 3 * 3B + 4 * HR


AVG (float) = H / AB

SLG (float) = total_bases / AB

OBP (float) = (H + BB + HBP)/(AB + BB + HBP + SF)

OPS (float) = SLG + OBP
https://en.wikipedia.org/wiki/On-base_plus_slugging  *** Reference is great for testing against

RC (float) = total_bases * ( (H + BB) / (AB + BB) )

ISO (float) = (1 * 2B + 2 * 3B + 3 * HR) / AB

PA/SO (float) = PA/SO

Simplification of SLG formula??

SLG = ( 1 * (H - (2B + 3B + HR)) + 2 * 2B + 3 * 3B + 4 * HR ) / AB
( H - 2B - 3B - HR + 2 * 2B + 3 * 3B + 4 * HR ) / AB
( H - 2B + 2 * 2B - 3B + 3 * 3B - HR + 4 * HR ) / AB
( H + 2B + 2 * 3B + 3 * HR ) / AB

In [124]:
## add lifetime percentages

temp_career_stats['AVG'] = temp_career_stats['H'] / (temp_career_stats['AB']*1.0)

temp_career_stats['SLG'] = (temp_career_stats['H'] + temp_career_stats['2B'] + 2*temp_career_stats['3B'] + 3*temp_career_stats['HR']) / (temp_career_stats['AB']*1.0)

temp_career_stats['OBP'] = (temp_career_stats['H'] + temp_career_stats['BB'] + temp_career_stats['HBP']) / ((temp_career_stats['AB'] + temp_career_stats['BB'] + temp_career_stats['HBP'] + temp_career_stats['SF'])*1.0) 

temp_career_stats['OPS'] = temp_career_stats['SLG'] + temp_career_stats['OBP']

temp_career_stats['RC'] = (temp_career_stats['H'] + temp_career_stats['2B'] + 2*temp_career_stats['3B'] + 3*temp_career_stats['HR']) * ( (temp_career_stats['H'] + temp_career_stats['BB']) / (temp_career_stats['AB'] + temp_career_stats['BB']) )

temp_career_stats['ISO'] = (temp_career_stats['2B'] + 2*temp_career_stats['3B'] + 3*temp_career_stats['HR']) / (temp_career_stats['AB']*1.0)

temp_career_stats['PA_SO'] = temp_career_stats['PA'] / temp_career_stats['SO']*1.0
temp_career_stats['PA_BB'] = temp_career_stats['PA'] / temp_career_stats['BB']*1.0
temp_career_stats['SO_BB'] = temp_career_stats['SO'] / temp_career_stats['BB']*1.0


temp_career_stats['PA_RBI'] = temp_career_stats['PA'] / temp_career_stats['RBI']*1.0


temp_career_stats

Unnamed: 0_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO,PA_SO,PA_BB,SO_BB,RBI_PA,PA_RBI
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
aardsda01,4,3,0,0,0,0,0,0,0,1,0,1,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,4.000000,inf,inf,0.000000,inf
aasedo01,5,5,0,0,0,0,0,0,0,3,0,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.666667,inf,inf,0.000000,inf
abadan01,25,21,1,2,0,0,0,0,4,5,0,0,0,0.095238,0.095238,0.240000,0.335238,0.480000,0.000000,5.000000,6.250000,1.250000,0.000000,inf
abadfe01,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,inf,inf,0.000000,inf
abbotje01,651,596,82,157,33,2,18,83,38,91,3,5,7,0.263423,0.416107,0.307453,0.723561,76.277603,0.152685,7.153846,17.131579,2.394737,0.127496,7.843373
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zoskyed01,53,50,4,8,1,2,0,3,1,13,0,1,1,0.160000,0.260000,0.173077,0.433077,2.294118,0.100000,4.076923,53.000000,13.000000,0.056604,17.666667
zuberjo01,151,136,13,34,7,1,3,16,12,20,1,1,1,0.250000,0.382353,0.313333,0.695686,16.162162,0.132353,7.550000,12.583333,1.666667,0.105960,9.437500
zuletju01,191,174,23,43,11,0,9,36,10,51,6,0,1,0.247126,0.465517,0.308901,0.774418,23.331522,0.218391,3.745098,19.100000,5.100000,0.188482,5.305556
zupcibo01,886,795,93,199,47,4,7,80,57,137,6,20,8,0.250314,0.345912,0.302540,0.648452,82.629108,0.095597,6.467153,15.543860,2.403509,0.090293,11.075000


In [115]:
print(df[df['Player'] == "Kelly Gruber"])

                ID        Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  BB  \
2457869  grubeke01  Kelly Gruber  TOR  KCR   1   1  0  0   0   0   0    0   0   
2460312  grubeke01  Kelly Gruber  TOR  MIN   2   2  0  0   0   0   0    0   0   
2489714  grubeke01  Kelly Gruber  TOR  NYY   1   1  0  0   0   0   0    0   0   
2492546  grubeke01  Kelly Gruber  TOR  DET   2   2  0  0   0   0   0    0   0   
2493086  grubeke01  Kelly Gruber  TOR  BOS   3   3  0  0   0   0   0    0   0   
...            ...           ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...  ..   
2880328  grubeke01  Kelly Gruber  CAL  KCR   4   4  1  3   1   0   1    4   0   
2880787  grubeke01  Kelly Gruber  CAL  MIN   5   5  1  3   1   0   0    2   0   
2881079  grubeke01  Kelly Gruber  CAL  MIN   4   4  0  1   0   0   0    0   0   
2881367  grubeke01  Kelly Gruber  CAL  MIN   3   3  0  0   0   0   0    0   0   
2881875  grubeke01  Kelly Gruber  CAL  OAK   2   2  0  0   0   0   0    0   0   

         SO  HBP  SH  SF Re

In [106]:
temp_career_stats.loc[['bellge02']]

Unnamed: 0_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
bellge02,6592,6123,812,1702,308,34,265,1002,331,771,49,0,83,0.277968,0.469214,0.316125,0.78534,904.990548,0.191246


In [108]:
temp_career_stats.loc[['cartejo01']]

Unnamed: 0_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
cartejo01,9154,8422,1167,2184,432,53,396,1445,527,1387,90,10,105,0.259321,0.46426,0.306321,0.770581,1184.491005,0.204939


In [110]:
temp_career_stats.loc[['boggswa01']]

Unnamed: 0_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
boggswa01,10740,9180,1513,3010,578,61,118,1014,1412,745,23,29,96,0.327887,0.442702,0.414994,0.857695,1696.65861,0.114815


In [112]:
temp_career_stats.loc[['cansejo01']]

Unnamed: 0_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
cansejo01,8129,7057,1186,1877,340,14,462,1407,906,1942,84,1,81,0.265977,0.514525,0.352731,0.867256,1269.003265,0.248548


In [117]:
temp_career_stats.loc[['grubeke01']]

Unnamed: 0_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
grubeke01,3442,3159,420,818,148,24,117,443,197,504,36,15,35,0.258943,0.432099,0.306682,0.738781,412.835221,0.173156


In [137]:

temp2 = df_mid.groupby(['ID', 'Player', 'Season'])
temp2

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12731d910>

In [138]:
t2 = temp2.sum()
t2

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
ID,Player,Season,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0
aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0
aasedo01,Don Aase,1989,5,5,0,0,0,0,0,0,0,3,0,0,0
abadan01,Andy Abad,2001,1,1,0,0,0,0,0,0,0,0,0,0,0
abadan01,Andy Abad,2003,19,17,1,2,0,0,0,0,2,5,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zuvelpa01,Paul Zuvella,1985,210,190,16,48,8,1,0,4,16,14,0,4,0
zuvelpa01,Paul Zuvella,1986,57,48,2,4,1,0,0,2,5,4,0,4,0
zuvelpa01,Paul Zuvella,1987,36,34,2,6,0,0,0,0,0,4,0,2,0
zuvelpa01,Paul Zuvella,1988,146,130,9,30,5,1,0,7,8,13,0,8,0


In [139]:
## add lifetime percentages

t2['AVG'] = t2['H'] / (t2['AB']*1.0)

t2['SLG'] = (t2['H'] + t2['2B'] + 2*t2['3B'] + 3*t2['HR']) / (t2['AB']*1.0)

t2['OBP'] = (t2['H'] + t2['BB'] + t2['HBP']) / ((t2['AB'] + t2['BB'] + t2['HBP'] + t2['SF'])*1.0) 

t2['OPS'] = t2['SLG'] + t2['OBP']

t2['RC'] = (t2['H'] + t2['2B'] + 2*t2['3B'] + 3*t2['HR']) * ( (t2['H'] + t2['BB']) / (t2['AB'] + t2['BB']) )

t2['ISO'] = (t2['2B'] + 2*t2['3B'] + 3*t2['HR']) / (t2['AB']*1.0)

t2['PA_SO'] = t2['PA'] / t2['SO']*1.0
t2['PA_BB'] = t2['PA'] / t2['BB']*1.0
t2['SO_BB'] = t2['SO'] / t2['BB']*1.0


t2['PA_RBI'] = t2['PA'] / t2['RBI']*1.0


t2

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO,PA_SO,PA_BB,SO_BB,PA_RBI
ID,Player,Season,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
aardsda01,David Aardsma,2006,3,2,0,0,0,0,0,0,0,0,0,1,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,inf,inf,,inf
aardsda01,David Aardsma,2008,1,1,0,0,0,0,0,0,0,1,0,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,inf,inf,inf
aasedo01,Don Aase,1989,5,5,0,0,0,0,0,0,0,3,0,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.666667,inf,inf,inf
abadan01,Andy Abad,2001,1,1,0,0,0,0,0,0,0,0,0,0,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,inf,inf,,inf
abadan01,Andy Abad,2003,19,17,1,2,0,0,0,0,2,5,0,0,0,0.117647,0.117647,0.210526,0.328173,0.421053,0.000000,3.800000,9.500,2.500,inf
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zuvelpa01,Paul Zuvella,1985,210,190,16,48,8,1,0,4,16,14,0,4,0,0.252632,0.305263,0.310680,0.615943,18.019417,0.052632,15.000000,13.125,0.875,52.500000
zuvelpa01,Paul Zuvella,1986,57,48,2,4,1,0,0,2,5,4,0,4,0,0.083333,0.104167,0.169811,0.273978,0.849057,0.020833,14.250000,11.400,0.800,28.500000
zuvelpa01,Paul Zuvella,1987,36,34,2,6,0,0,0,0,0,4,0,2,0,0.176471,0.176471,0.176471,0.352941,1.058824,0.000000,9.000000,inf,inf,inf
zuvelpa01,Paul Zuvella,1988,146,130,9,30,5,1,0,7,8,13,0,8,0,0.230769,0.284615,0.275362,0.559978,10.188406,0.053846,11.230769,18.250,1.625,20.857143


In [140]:
t2.loc[['bellge02']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO,PA_SO,PA_BB,SO_BB,PA_RBI
ID,Player,Season,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
bellge02,George Bell,1981,168,163,17,38,2,1,5,12,5,27,0,0,0,0.233129,0.349693,0.255952,0.605646,14.589286,0.116564,6.222222,33.6,5.4,14.0
bellge02,George Bell,1983,118,112,5,30,5,4,2,17,4,17,2,0,0,0.267857,0.4375,0.305085,0.742585,14.362069,0.169643,6.941176,29.5,4.25,6.941176
bellge02,George Bell,1984,643,606,85,177,39,4,26,87,24,86,8,0,3,0.292079,0.49835,0.326053,0.824403,96.352381,0.206271,7.476744,26.791667,3.583333,7.390805
bellge02,George Bell,1985,667,607,87,167,28,6,28,95,43,90,8,0,8,0.275124,0.479407,0.327327,0.806734,94.015385,0.204283,7.411111,15.511628,2.093023,7.021053
bellge02,George Bell,1986,690,641,101,198,38,6,31,108,41,62,2,0,6,0.308892,0.531981,0.349275,0.881257,119.5,0.223089,11.129032,16.829268,1.512195,6.388889
bellge02,George Bell,1987,667,610,111,188,32,4,47,134,39,75,7,0,9,0.308197,0.604918,0.35188,0.956798,129.064715,0.296721,8.893333,17.102564,1.923077,4.977612
bellge02,George Bell,1988,658,614,78,165,27,5,24,97,34,66,1,0,8,0.26873,0.446254,0.304414,0.750668,84.145062,0.177524,9.969697,19.352941,1.941176,6.783505
bellge02,George Bell,1989,664,613,88,182,41,2,18,104,33,60,4,0,14,0.2969,0.458401,0.329819,0.788221,93.521672,0.161501,11.066667,20.121212,1.818182,6.384615
bellge02,George Bell,1990,608,562,67,149,25,0,21,86,32,80,3,0,11,0.265125,0.421708,0.302632,0.72434,72.217172,0.156584,7.6,19.0,2.5,7.069767
bellge02,George Bell,1991,603,558,63,159,27,0,25,86,32,62,4,0,9,0.284946,0.467742,0.323383,0.791125,84.49322,0.182796,9.725806,18.84375,1.9375,7.011628


In [161]:
sk = t2.loc[['bellge02']]
tj = sk[0:5]
tj

ok = tj.groupby('ID')
ok.mean()

Unnamed: 0_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO,PA_SO,PA_BB,SO_BB,PA_RBI
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
bellge02,457.2,425.8,59.0,122.0,22.4,4.2,18.4,63.8,23.4,56.4,4.0,0.0,3.4,0.275416,0.459386,0.312739,0.772125,67.763824,0.18397,7.836057,24.446513,3.36771,8.348385


In [165]:
sk = t2.loc[['bellge02']]
tj = sk[-5:]
tj

ok = tj.groupby('ID')
ok.mean()

Unnamed: 0_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO,PA_SO,PA_BB,SO_BB,PA_RBI
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
bellge02,596.2,554.0,65.6,147.8,27.4,0.8,20.4,90.4,28.2,69.6,4.2,0.0,9.8,0.263846,0.425826,0.298597,0.724422,72.442563,0.16198,8.83953,22.623265,2.630789,6.652131


In [166]:
temp_career_stats.loc[['bellge02']]

Unnamed: 0_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO,PA_SO,PA_BB,SO_BB,RBI_PA,PA_RBI
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
bellge02,6592,6123,812,1702,308,34,265,1002,331,771,49,0,83,0.277968,0.469214,0.316125,0.78534,904.990548,0.191246,8.549935,19.915408,2.329305,0.152002,6.578842


In [142]:
t2.loc[['cartejo01']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO,PA_SO,PA_BB,SO_BB,PA_RBI
ID,Player,Season,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
cartejo01,Joe Carter,1983,52,51,4,9,1,1,0,1,0,21,0,1,0,0.176471,0.235294,0.176471,0.411765,2.117647,0.058824,2.47619,inf,inf,52.0
cartejo01,Joe Carter,1984,257,244,31,67,6,1,13,41,11,48,1,0,1,0.27459,0.467213,0.307393,0.774606,34.870588,0.192623,5.354167,23.363636,4.363636,6.268293
cartejo01,Joe Carter,1985,523,489,64,128,27,0,15,59,25,74,2,3,4,0.261759,0.408998,0.298077,0.707075,59.533074,0.147239,7.067568,20.92,2.96,8.864407
cartejo01,Joe Carter,1986,709,663,108,200,36,9,29,121,32,95,5,1,8,0.301659,0.514329,0.334746,0.849075,113.830216,0.21267,7.463158,22.15625,2.96875,5.859504
cartejo01,Joe Carter,1987,629,588,83,155,27,2,32,106,27,105,9,1,4,0.263605,0.479592,0.30414,0.783732,83.453659,0.215986,5.990476,23.296296,3.888889,5.933962
cartejo01,Joe Carter,1988,670,621,85,168,36,6,27,98,35,82,7,1,6,0.270531,0.478261,0.313901,0.792162,91.907012,0.207729,8.170732,19.142857,2.342857,6.836735
cartejo01,Joe Carter,1989,705,651,84,158,32,4,35,105,39,112,8,2,5,0.242704,0.465438,0.291607,0.757045,86.508696,0.222734,6.294643,18.076923,2.871795,6.714286
cartejo01,Joe Carter,1990,697,634,79,147,27,1,24,115,48,93,7,0,8,0.231861,0.391167,0.289813,0.680981,70.909091,0.159306,7.494624,14.520833,1.9375,6.06087
cartejo01,Joe Carter,1991,706,638,89,174,42,3,33,108,49,112,10,0,9,0.272727,0.503135,0.330028,0.833163,104.196507,0.230408,6.303571,14.408163,2.285714,6.537037
cartejo01,Joe Carter,1992,683,622,97,164,30,7,34,119,36,109,11,1,13,0.263666,0.498392,0.309384,0.807776,94.224924,0.234727,6.266055,18.972222,3.027778,5.739496


In [143]:
t2.loc[['boggswa01']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO,PA_SO,PA_BB,SO_BB,PA_RBI
ID,Player,Season,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
boggswa01,Wade Boggs,1982,381,338,51,118,14,1,5,44,35,21,0,4,4,0.349112,0.440828,0.405836,0.846664,61.117962,0.091716,18.142857,10.885714,0.6,8.659091
boggswa01,Wade Boggs,1983,685,582,100,210,44,7,5,74,92,36,1,3,7,0.360825,0.486254,0.444282,0.930536,126.804154,0.12543,19.027778,7.445652,0.391304,9.256757
boggswa01,Wade Boggs,1984,726,625,109,203,31,4,6,55,89,44,0,8,4,0.3248,0.416,0.406685,0.822685,106.330532,0.0912,16.5,8.157303,0.494382,13.2
boggswa01,Wade Boggs,1985,758,653,107,240,42,3,8,78,96,61,4,3,2,0.367534,0.477795,0.450331,0.928126,139.962617,0.11026,12.42623,7.895833,0.635417,9.717949
boggswa01,Wade Boggs,1986,693,580,107,207,47,2,8,71,105,44,0,4,4,0.356897,0.486207,0.45283,0.939037,128.443796,0.12931,15.75,6.6,0.419048,9.760563
boggswa01,Wade Boggs,1987,667,551,108,200,40,6,24,89,105,48,2,1,8,0.362976,0.588022,0.460961,1.048983,150.640244,0.225045,13.895833,6.352381,0.457143,7.494382
boggswa01,Wade Boggs,1988,719,584,128,214,45,6,5,58,125,34,3,0,7,0.366438,0.489726,0.475661,0.965387,136.747532,0.123288,21.147059,5.752,0.272,12.396552
boggswa01,Wade Boggs,1989,742,621,113,205,51,7,3,54,107,51,7,0,7,0.330113,0.449275,0.429919,0.879194,119.571429,0.119163,14.54902,6.934579,0.476636,13.740741
boggswa01,Wade Boggs,1990,713,619,89,187,44,5,6,63,87,68,1,0,6,0.3021,0.418417,0.385694,0.804111,100.518414,0.116317,10.485294,8.195402,0.781609,11.31746
boggswa01,Wade Boggs,1991,641,546,93,181,42,2,8,51,89,32,0,0,6,0.331502,0.459707,0.421217,0.880924,106.724409,0.128205,20.03125,7.202247,0.359551,12.568627


In [144]:
t2.loc[['cansejo01']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO,PA_SO,PA_BB,SO_BB,PA_RBI
ID,Player,Season,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
cansejo01,Jose Canseco,1985,100,96,16,29,3,0,5,13,4,31,0,0,0,0.302083,0.489583,0.33,0.819583,15.51,0.1875,3.225806,25.0,7.75,7.692308
cansejo01,Jose Canseco,1986,682,600,85,144,29,1,33,117,65,175,8,0,9,0.24,0.456667,0.318182,0.774848,86.114286,0.216667,3.897143,10.492308,2.692308,5.82906
cansejo01,Jose Canseco,1987,691,630,81,162,35,3,31,113,50,157,2,0,9,0.257143,0.469841,0.309696,0.779537,92.282353,0.212698,4.401274,13.82,3.14,6.115044
cansejo01,Jose Canseco,1988,705,610,120,187,34,0,42,124,78,128,10,1,6,0.306557,0.568852,0.390625,0.959477,133.655523,0.262295,5.507812,9.038462,1.641026,5.685484
cansejo01,Jose Canseco,1989,258,227,40,61,9,1,17,57,23,69,2,0,6,0.268722,0.54185,0.333333,0.875184,41.328,0.273128,3.73913,11.217391,3.0,4.526316
cansejo01,Jose Canseco,1990,563,481,83,132,14,2,37,101,72,158,5,0,5,0.274428,0.54262,0.371226,0.913845,96.282098,0.268191,3.563291,7.819444,2.194444,5.574257
cansejo01,Jose Canseco,1991,665,572,115,152,32,1,44,122,78,152,9,0,6,0.265734,0.555944,0.359398,0.915343,112.523077,0.29021,4.375,8.525641,1.948718,5.45082
cansejo01,Jose Canseco,1992,512,439,74,107,15,0,26,87,63,128,6,0,4,0.243736,0.455581,0.34375,0.799331,67.729084,0.211845,4.0,8.126984,2.031746,5.885057
cansejo01,Jose Canseco,1993,253,231,30,59,14,1,10,46,16,62,3,0,3,0.255411,0.454545,0.3083,0.762846,31.882591,0.199134,4.080645,15.8125,3.875,5.5
cansejo01,Jose Canseco,1994,505,429,88,121,19,2,31,90,69,114,5,0,2,0.282051,0.552448,0.386139,0.938586,90.421687,0.270396,4.429825,7.318841,1.652174,5.611111


In [145]:
t2.loc[['grubeke01']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS,RC,ISO,PA_SO,PA_BB,SO_BB,PA_RBI
ID,Player,Season,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
grubeke01,Kelly Gruber,1984,16,16,1,1,0,0,1,2,0,5,0,0,0,0.0625,0.25,0.0625,0.3125,0.25,0.1875,3.2,inf,inf,8.0
grubeke01,Kelly Gruber,1985,13,13,0,3,0,0,0,1,0,3,0,0,0,0.230769,0.230769,0.230769,0.461538,0.692308,0.0,4.333333,inf,inf,13.0
grubeke01,Kelly Gruber,1986,152,143,12,28,4,1,5,15,5,27,0,2,2,0.195804,0.342657,0.22,0.562657,10.925676,0.146853,5.62963,30.4,5.4,10.133333
grubeke01,Kelly Gruber,1987,368,341,47,80,14,3,12,36,17,70,7,1,2,0.234604,0.398827,0.283379,0.682206,36.849162,0.164223,5.257143,21.647059,4.117647,10.222222
grubeke01,Kelly Gruber,1988,623,569,75,158,33,5,16,81,38,92,7,5,4,0.27768,0.43761,0.328479,0.766089,80.401977,0.15993,6.771739,16.394737,2.421053,7.691358
grubeke01,Kelly Gruber,1989,583,545,83,158,24,4,18,73,30,60,3,0,5,0.289908,0.447706,0.327616,0.775322,79.777391,0.157798,9.716667,19.433333,2.0,7.986301
grubeke01,Kelly Gruber,1990,662,592,92,162,36,6,31,118,48,94,8,1,13,0.273649,0.511824,0.329803,0.841628,99.421875,0.238176,7.042553,13.791667,1.958333,5.610169
grubeke01,Kelly Gruber,1991,474,429,58,108,18,2,20,65,31,70,6,3,5,0.251748,0.44289,0.307856,0.750746,57.413043,0.191142,6.771429,15.290323,2.258065,7.292308
grubeke01,Kelly Gruber,1992,481,446,42,102,16,3,11,43,26,72,4,1,4,0.2287,0.352018,0.275,0.627018,42.576271,0.123318,6.680556,18.5,2.769231,11.186047
grubeke01,Kelly Gruber,1993,70,65,10,18,3,0,3,9,2,11,1,2,0,0.276923,0.461538,0.308824,0.770362,8.955224,0.184615,6.363636,35.0,5.5,7.777778


### Visualization(s)

??? Before and after bar graphs?

## Model Selection

### Visualization(s)

## Model Evaluation

### Visualization(s)

## Concluding Comments