# Using Historical Data to Predict Batting Success

Authored by: Donna J. Harris (994042890)

Email: harr2890@mylaurier.ca

For: CP640 Machine Learning (S22) with Professor Elham Harirpoush

## Introductory Comments

Based on the project proposal of the same name, this Jupyter Notebook demonstrates the process and exploration of Major League Baseball batting data from 1901 to 2021 to the end of discovering how or if historical data can be used to predict batting success.

The Kaggle dataset being used as the primary data source can be found in the `data` folder of the project folder structure: `./data/mlbbatting1901-2021.csv`

Each data record in the original dataset represents an individual batter's performance in a single game. This is why there are so many records. In a single game, there will be at least 18 batters with plate appearances across both teams, and often more with player substitutions, especially across extra innings.

## Environment Setup

Import and establish environment for initial work, including showing all dataframe column values.

In [489]:
import pandas as pd

pd.set_option('display.max_columns', None)

## Preprocessing

### Original Data

Acquire the data from the Kaggle dataset and place into a dataframe.

In [490]:
original_data_source = "./data/mlbbatting1901-2021.csv"

df = pd.read_csv(original_data_source)

Confirm the original data has been loaded into the data frame. (It should start with the earliest records from 1901 and end with the latest records from 2021.)

In [491]:
print(df)

                ID            Player        Date   Tm  Opp    Rslt  PA  AB  R  \
0        crossmo01       Monte Cross  1901-04-18  PHI  BRO  L 7-12   5   4  2   
1        dahlebi01       Bill Dahlen  1901-04-18  BRO  PHI  W 12-7   5   4  2   
2         dalyto01          Tom Daly  1901-04-18  BRO  PHI  W 12-7   5   5  1   
3        davisle01       Lefty Davis  1901-04-18  BRO  PHI  W 12-7   5   5  1   
4        delahed01      Ed Delahanty  1901-04-18  PHI  BRO  L 7-12   5   4  1   
...            ...               ...         ...  ...  ...     ...  ..  .. ..   
4285624  woodfja01     Jake Woodford  2021-10-03  STL  CHC   L 2-3   2   1  0   
4285625  yastrmi01  Mike Yastrzemski  2021-10-03  SFG  SDP  W 11-4   4   3  1   
4285626  zimmebr01    Bradley Zimmer  2021-10-03  CLE  TEX   W 6-0   4   4  1   
4285627  zimmery01    Ryan Zimmerman  2021-10-03  WSN  BOS   L 5-7   4   3  0   
4285628  zuninmi01       Mike Zunino  2021-10-03  TBR  NYY   L 0-1   4   4  0   

         H  2B  3B  HR  RBI

In [492]:
print(df.columns)

Index(['ID', 'Player', 'Date', 'Tm', 'Opp', 'Rslt', 'PA', 'AB', 'R', 'H', '2B',
       '3B', 'HR', 'RBI', 'BB', 'IBB', 'SO', 'HBP', 'SH', 'SF', 'ROE', 'GDP',
       'SB', 'CS', 'WPA', 'RE24', 'aLI', 'BOP', 'Pos Summary', 'DFS(DK)',
       'DFS(FD)'],
      dtype='object')


We have our confirmation that there are 31 feature columns and 4,285,629 data records.

### Extract Data

There is some data that might be of use to us that is trapped in existing columns. First, we want to extract the result of the game and the score of the game from the `'Rslt'` column. Then, we want to extract the year the game was played (denoting the season) from the `'Date'` column.

In [493]:
df[['Result','Score']] = df['Rslt'].str.split(' ', expand=True)

df['Season'] = df['Date'].str[:4]

### Column Removal

There are a number of columns in the original dataset that we know before going any further that we have no use for.

We no longer need `'Rslt'`, as we just split its interesting information into separate columns. Likewise with `'Date'`, we have what we need in the new `'Season'` column, so we can remove the `'Date'` column.

In [494]:
del df['Rslt']

In [495]:
del df['Date']

Daily fantasy sports points (used for fantasy leagues and betting) have no purpose within this project, so we can safely remove `'DFS(DK)'` and `'DFS(FD)'` from the data.

In [496]:
del df['DFS(DK)']
del df['DFS(FD)']

Similarly, to reduce complexity, we are not considering any statistics relating to fielding or base running/stealing. As such, we can remove `'SB'` and `'CS'`, which are the number of stolen bases and time caught stealing, respectively, as well as the `'Pos Summary'` (position summary) data.

In [497]:
del df['SB']
del df['CS']
del df['Pos Summary']

To further reduce complexity, we will remove the `'IBB'` (intentional walks) column as this is a subset of values tracked under the walks column (`'BB'`).  (Reference: https://en.wikipedia.org/wiki/Base_on_balls#Intentional_base_on_balls)

In [498]:
del df['IBB']

We can remove the `'GDP'` column, which represents the number of times a player hits into a double play (two outs). While this statistic has some bearing to the success of a batter, for this project we will exclude this nuance and focus more on the aspects of run production and getting on base through other statistical means.

In [499]:
del df['GDP']

Similarly to `'GDP'`, we will disregard the `'ROE'` column, which represents the number of times a player reaches base due to a fielding error by the opposing team. Like `'GDP'`, this statistic has some bearing to the success of a batter but it leans more toward their running abilities and street smarts of the player, as well as a bit of luck. Again, for this project, these nuances will be excluded for simplicity.

In [500]:
del df['ROE']

Before continuing on to value checking, let's look and see where the data is at after these processing operations.

In [501]:
print(df)

                ID            Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
0        crossmo01       Monte Cross  PHI  BRO   5   4  2  2   1   0   0  1.0   
1        dahlebi01       Bill Dahlen  BRO  PHI   5   4  2  3   0   0   0  0.0   
2         dalyto01          Tom Daly  BRO  PHI   5   5  1  2   1   0   0  3.0   
3        davisle01       Lefty Davis  BRO  PHI   5   5  1  1   0   0   0  0.0   
4        delahed01      Ed Delahanty  PHI  BRO   5   4  1  2   0   0   0  0.0   
...            ...               ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...   
4285624  woodfja01     Jake Woodford  STL  CHC   2   1  0  0   0   0   0  0.0   
4285625  yastrmi01  Mike Yastrzemski  SFG  SDP   4   3  1  1   1   0   0  2.0   
4285626  zimmebr01    Bradley Zimmer  CLE  TEX   4   4  1  2   0   0   0  1.0   
4285627  zimmery01    Ryan Zimmerman  WSN  BOS   4   3  0  0   0   0   0  1.0   
4285628  zuninmi01       Mike Zunino  TBR  NYY   4   4  0  0   0   0   0  0.0   

         BB  SO  HBP  SH   

In [502]:
print(df.columns)

Index(['ID', 'Player', 'Tm', 'Opp', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR',
       'RBI', 'BB', 'SO', 'HBP', 'SH', 'SF', 'WPA', 'RE24', 'aLI', 'BOP',
       'Result', 'Score', 'Season'],
      dtype='object')


We added three new columns (`'Result'`, `'Score'`, and `'Season'`) -- which we may or may not need later -- and also removed ten existing columns. So, this looks correct: 31 original features + 3 new features - 11 features = 23 feature columns.

We still have 4,285,629 data records, as we have not removed any records yet.

### Value Checking and Data Validation

**ID - Player ID**

Should be unique to each player. This will indicate how many players we currently have, based on player ID.

In [742]:
player_ids = pd.unique(df['ID'])
print(player_ids)

print("\nNumber of unique player IDs: ", len(player_ids))

['delahed01' 'dolanjo02' 'childcu01' ... 'adonjo01' 'paynety01'
 'stridsp01']

Number of unique player IDs:  14336


**Player - Player Name**

Should be roughly the same number of player IDs. Discrepancies are possible, errors in spelling, etc. but this is not highly significant for our statistics since we will key everything on the more reliable Player ID. We will hang onto this field to help humanly identify players by name. 

In [504]:
print(pd.unique(df['Player']))

print("\nNumber of unique player names: ", len(pd.unique(df['Player'])))

['Monte Cross' 'Bill Dahlen' 'Tom Daly' ... 'Joan Adon' 'Tyler Payne'
 'Spencer Strider']

Number of unique player names:  15595


**Note:** We found there were 15,985 unique IDs and 15,595 unique player names, which is a difference of 390 in favour of the IDs. This could easily be explained by different players over the years with the same name. This is a nuance that will be disregarded for this project. As previously stated, we will use the ID for all but human identification purposes.

**Season - Year the game was played in**
(Extracted from `'Date'`)

The values should all be visibly in YYYY format between the years of 1901 and 2021, inclusive.

Ideally, these values will be Integers -- but they have started as strings.

In [505]:
print(pd.unique(df['Season']))

['1901' '1902' '1903' '1904' '1905' '1906' '1907' '1908' '1909' '1910'
 '1911' '1912' '1913' '1914' '1915' '1916' '1917' '1918' '1919' '1920'
 '1921' '1922' '1923' '1924' '1925' '1926' '1927' '1928' '1929' '1930'
 '1931' '1932' '1933' '1934' '1935' '1936' '1937' '1938' '1939' '1940'
 '1941' '1942' '1943' '1944' '1945' '1946' '1947' '1948' '1949' '1950'
 '1951' '1952' '1953' '1954' '1955' '1956' '1957' '1958' '1959' '1960'
 '1961' '1962' '1963' '1964' '1965' '1966' '1967' '1968' '1969' '1970'
 '1971' '1972' '1973' '1974' '1975' '1976' '1977' '1978' '1979' '1980'
 '1981' '1982' '1983' '1984' '1985' '1986' '1987' '1988' '1989' '1990'
 '1991' '1992' '1993' '1994' '1995' '1996' '1997' '1998' '1999' '2000'
 '2001' '2002' '2003' '2004' '2005' '2006' '2007' '2008' '2009' '2010'
 '2011' '2012' '2013' '2014' '2015' '2016' '2017' '2018' '2019' '2020'
 '2021']


We can see visibly that these values are all YYYY integers, so let's convert them to actual integers in the dataframe.

In [506]:
df['Season'] = df['Season'].astype(int)
# test
print(pd.unique(df['Season']))

[1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914
 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928
 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942
 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956
 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984
 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998
 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
 2013 2014 2015 2016 2017 2018 2019 2020 2021]


**Tm - Player's Team** and **Opp - Opponent**

The team values (both the player's team and the opposing team) should all be visibly in ZZZ format, belonging to a recognizable team between the 1901-2021 seasons. These are fields that will likely get dropped later on, but I'm keeping them until I know for sure I don't want them.

In [507]:
print("Player's Team (Tm):\n", pd.unique(df['Tm']))
print("\nOpposing Team (Opp):\n", pd.unique(df['Opp']))

Player's Team (Tm):
 ['PHI' 'BRO' 'BSN' 'NYG' 'STL' 'CHC' 'PIT' 'CIN' 'CLE' 'CHW' 'MLA' 'DET'
 'PHA' 'WSH' 'BOS' 'BLA' 'SLB' 'NYY' 'BUF' 'BAL' 'PBS' 'BTT' 'CHI' 'IND'
 'SLM' 'KCP' 'NEW' 'MLN' 'KCA' 'LAD' 'SFG' 'WSA' 'MIN' 'LAA' 'HOU' 'NYM'
 'CAL' 'ATL' 'OAK' 'KCR' 'MON' 'SDP' 'SEP' 'MIL' 'TEX' 'SEA' 'TOR' 'FLA'
 'COL' 'ANA' 'ARI' 'TBD' 'WSN' 'TBR' 'MIA']

Opposing Team (Opp):
 ['BRO' 'PHI' 'NYG' 'BSN' 'CHC' 'STL' 'CIN' 'PIT' 'CHW' 'CLE' 'DET' 'MLA'
 'WSH' 'PHA' 'BLA' 'BOS' 'SLB' 'NYY' 'BAL' 'BUF' 'BTT' 'PBS' 'KCP' 'SLM'
 'IND' 'CHI' 'NEW' 'MLN' 'KCA' 'SFG' 'LAD' 'WSA' 'MIN' 'LAA' 'HOU' 'NYM'
 'CAL' 'ATL' 'OAK' 'MON' 'SDP' 'KCR' 'SEP' 'MIL' 'TEX' 'SEA' 'TOR' 'COL'
 'FLA' 'ANA' 'ARI' 'TBD' 'WSN' 'TBR' 'MIA']


**PA - Plate Appearances**

Appearances should be an integer value, between the range of 1 and some upper value. (A 0 would indicate the player didn't bat in the game which would mean there should not be a record.)

The upper value will vary, although (speaking as a baseball fan) five plate appearances is pretty standard in a regular, nine-inning, low scoring game. But as soon as you get into higher scores and/or extra inning games, players can be up to bat many times.

In [508]:
print(pd.unique(df['PA']))

[ 5  4  3  1  2  6  7  8  9 10 11 12]


**Note:** 12 was the upper value. I'm curious of what era these are from, so let's take a look:

In [509]:
print(df[df['PA'] == 12])

                ID         Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
2062446  millafe01   Felix Millan  NYM  STL  12  10  1  4   0   0   0  0.0   
2062447  milnejo01    John Milner  NYM  STL  12  10  0  2   1   0   0  1.0   
2458599  baineha01  Harold Baines  CHW  MIL  12  10  1  2   1   0   1  1.0   
2458643   fiskca01   Carlton Fisk  CHW  MIL  12  11  1  3   1   0   0  1.0   
2458686    lawru01       Rudy Law  CHW  MIL  12  11  1  4   0   0   0  1.0   

         BB  SO  HBP  SH   SF    WPA   RE24    aLI  BOP Result Score  Season  
2062446   1   0    0   1  0.0  0.060  0.422  1.894    2      L   3-4    1974  
2062447   2   3    0   0  0.0 -0.250 -0.399  2.147    4      L   3-4    1974  
2458599   2   0    0   0  0.0  0.195 -0.083  2.204    5      W   7-6    1984  
2458643   1   3    0   0  0.0 -0.237 -0.649  2.455    2      W   7-6    1984  
2458686   1   0    0   0  0.0  0.511  1.816  1.958    1      W   7-6    1984  


**AB - At Bats**

Similarly to Plate Appearances, At Bats should be an integer value. It should be between the range of 0 and some upper value. (Here, a 0 would indicate the player had one plate appearance that did not statistically count as an At Bat, such as a walk.)

The upper range should follow, and not exceed the upper value of Plate Appearances, which was 12. Note that 12 is possible, but does not have to be a value in this collection of data.

In [510]:
print(pd.unique(df['AB']))

[ 4  5  3  2  1  6  0  7  8  9 10 11]


**Note:** 11 was the upper value, which is less then 12. (The max number of plate appearances.)

We should validate that there are no records where there are more At Bats than Plate Appearances.

In [511]:
print("-----------------------------------------")
print("All records with PA < AB")
print("-----------------------------------------")
print(df[df['PA'] < df['AB']])

-----------------------------------------
All records with PA < AB
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**Note:** The validation checks pass, as we have found no records where PA < AB. No extra investigation or validation is required here.

**R - Runs**

Runs should also be an integer value, in the range of 0 and some upper value. The upper value can be, at most, one larger than the number of plate appearances. In general, that's 12+1 for this dataset, but that's a highly unlikely value to see as number of runs. (We'll address individual records in the next step.)

In [512]:
print(pd.unique(df['R']))

[2 1 0 4 3 5 6]


**Note:** We want to do some validation within each individual data record to look for instances where the number of plate appearances is less than the number of runs (e.g., when a player pinch runs for a teammate and then scores a run they have 0 plate appearances and 1 run). And within that subset of records, look for instances where there is more than a difference of one between the `'PA'` and `'R'` values. (If we find any instances with a difference larger than one, we may have a data issue.)

In [513]:
records_pa_lt_r = df[df['PA'] < df['R']]
print("-----------------------------------------")
print("All records with PA < R")
print("-----------------------------------------")
print(records_pa_lt_r)

print("\n\n-----------------------------------------")
print("All records with PA < R where PA != R-1")
print("-----------------------------------------")
print(records_pa_lt_r[ records_pa_lt_r['PA'] != ((records_pa_lt_r['R']-1)) ])


-----------------------------------------
All records with PA < R
-----------------------------------------
                ID          Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
97903     lowebo01      Bobby Lowe  DET  WSH   1   1  2  1   1   0   0  0.0   
203202   stanljo02     Joe Stanley  CHC  BSN   1   0  2  0   0   0   0  0.0   
221883   collibi02    Bill Collins  BSN  STL   1   1  2  1   0   0   0  NaN   
223671   keelewi01   Willie Keeler  NYG  CHC   1   1  2  1   0   0   0  NaN   
230698   butlear01      Art Butler  BSN  PHI   1   1  2  0   0   0   0  NaN   
...            ...             ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...   
4245067  hamilbi02  Billy Hamilton  CHW  MIN   1   1  2  0   0   0   0  0.0   
4256455  dubonma01  Mauricio Dubon  SFG  PHI   1   1  2  1   0   0   0  0.0   
4262325  whiteel04       Eli White  TEX  OAK   1   1  2  0   0   0   0  0.0   
4264040   kempto01       Tony Kemp  OAK  LAA   1   1  2  1   0   0   0  0.0   
4268197  davisjo05  Jon

**Note:** The validation checks pass, as we have found no records where PA < R and PA != R-1. No extra investigation or validation is required here.

**H - Hits**

Hits should also be an integer value, in the range of 0 and some upper value. The upper value can be, at most, the number of plate appearances. In general, that's 12 for this dataset, but that's a highly unlikely value to see as number of hits. (We'll address individual records in the next step.)

In [514]:
pd.unique(df['H'])

array([2, 3, 1, 0, 4, 5, 6, 9, 7])

**Note:** We should also confirm that there are never more hits than plate appearances within individual records.

In [515]:
records_pa_lt_h = df[df['PA'] < df['H']]
print("-----------------------------------------")
print("All records with PA < H")
print("-----------------------------------------")
print(records_pa_lt_h)

-----------------------------------------
All records with PA < H
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**Note:** The validation checks pass, as we have found no records where PA < H. No extra investigation or validation is required here.

**2B - Doubles**
**3B - Triples**
**HR - Home Run**

All of these extra base hits must be integer values, within the range of 0 up to the number of Plate Appearances.

In [516]:
print("2B: ", pd.unique(df['2B']))
print("3B: ", pd.unique(df['3B']))
print("HR: ", pd.unique(df['HR']))

2B:  [1 0 2 4 3]
3B:  [0 1 3 2]
HR:  [0 1 2 3 4]


**Note:** These are all reasonable values at a glance.

We should also confirm that these values are never greater than the number of Plate Appearances, At Bats, or Hits within individual records. (Note that Hits (`'H'`) represents all kinds of hits, not just single-base hits.)

In [517]:
records_pa_lt_2b = df[df['PA'] < df['2B']]
print("-----------------------------------------")
print("All records with PA < 2B")
print("-----------------------------------------")
print(records_pa_lt_2b)


records_pa_lt_3b = df[df['PA'] < df['3B']]
print("\n-----------------------------------------")
print("All records with PA < 3B")
print("-----------------------------------------")
print(records_pa_lt_3b)


records_pa_lt_hr = df[df['PA'] < df['HR']]
print("\n-----------------------------------------")
print("All records with PA < HR")
print("-----------------------------------------")
print(records_pa_lt_hr)


records_ab_lt_2b = df[df['AB'] < df['2B']]
print("\n-----------------------------------------")
print("All records with AB < 2B")
print("-----------------------------------------")
print(records_ab_lt_2b)


records_ab_lt_3b = df[df['AB'] < df['3B']]
print("\n-----------------------------------------")
print("All records with AB < 3B")
print("-----------------------------------------")
print(records_ab_lt_3b)


records_ab_lt_hr = df[df['AB'] < df['HR']]
print("\n-----------------------------------------")
print("All records with AB < HR")
print("-----------------------------------------")
print(records_ab_lt_hr)


records_h_lt_2b = df[df['H'] < df['2B']]
print("\n-----------------------------------------")
print("All records with H < 2B")
print("-----------------------------------------")
print(records_h_lt_2b)


records_h_lt_3b = df[df['H'] < df['3B']]
print("\n-----------------------------------------")
print("All records with H < 3B")
print("-----------------------------------------")
print(records_h_lt_3b)


records_h_lt_hr = df[df['H'] < df['HR']]
print("\n-----------------------------------------")
print("All records with H < HR")
print("-----------------------------------------")
print(records_h_lt_hr)

-----------------------------------------
All records with PA < 2B
-----------------------------------------
                ID          Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
1275367  robined01  Eddie Robinson  CHW  SLB   1   1  0  0   2   0   0  0.0   

         BB  SO  HBP  SH  SF  WPA  RE24  aLI  BOP Result Score  Season  
1275367   0   0    0   0 NaN  NaN   NaN  NaN    4      L  6-10    1950  

-----------------------------------------
All records with PA < 3B
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []

-----------------------------------------
All records with PA < HR
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []

-----------------------------------------
All records with 

**Note:** These validation checks caught a couple of problematic records. Looking at the entirety of these records, it looks like a simple data error, where the number of Hits needs to be updated to reflect the number of extra base hits. Instead of removing all records for these players, we will make these small data adjustments.

**First for Ed Robinson:**

In [518]:
ed_robinson = 1275367

df.loc[[ed_robinson]]

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,WPA,RE24,aLI,BOP,Result,Score,Season
1275367,robined01,Eddie Robinson,CHW,SLB,1,1,0,0,2,0,0,0.0,0,0,0,0,,,,,4,L,6-10,1950


For **Ed Robinson's record** (1275367), we will use the doubles (`'2B'`) statistic value as the value for Hits, Plate Appearances, and At Bats. These are reasonable guesses, based on the record.

In [519]:
df.at[ed_robinson, 'H'] = df.at[ed_robinson, '2B']
df.at[ed_robinson, 'AB'] = df.at[ed_robinson, '2B']
df.at[ed_robinson, 'PA'] = df.at[ed_robinson, '2B']

df.loc[[ed_robinson]]

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,WPA,RE24,aLI,BOP,Result,Score,Season
1275367,robined01,Eddie Robinson,CHW,SLB,2,2,0,2,2,0,0,0.0,0,0,0,0,,,,,4,L,6-10,1950


**Next for Joe Tipton:**

In [520]:
joe_tipton = 1272928

df.loc[[joe_tipton]]

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,WPA,RE24,aLI,BOP,Result,Score,Season
1272928,tiptojo01,Joe Tipton,PHA,CHW,3,2,0,0,0,1,0,0.0,1,1,0,0,,,,,8,L,3-10,1950


For **Joe Tipton's record** (1272928), we will use the triples (`'3B'`) statistic value as the value for Hits but it is possible that the statistics for Plate Appearances and At Bats is correct. Because it is not obviously wrong, we won't change these. These are reasonable guesses, based on the record.

In [521]:
df.at[joe_tipton, 'H'] = df.at[joe_tipton, '3B']

df.loc[[joe_tipton]]

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,WPA,RE24,aLI,BOP,Result,Score,Season
1272928,tiptojo01,Joe Tipton,PHA,CHW,3,2,0,1,0,1,0,0.0,1,1,0,0,,,,,8,L,3-10,1950


**Re-testing the original checks**

Re-running the tests that found these data issues should now pass and not introduce any new issues.

In [522]:
records_pa_lt_2b = df[df['PA'] < df['2B']]
print("-----------------------------------------")
print("All records with PA < 2B")
print("-----------------------------------------")
print(records_pa_lt_2b)


records_pa_lt_3b = df[df['PA'] < df['3B']]
print("\n-----------------------------------------")
print("All records with PA < 3B")
print("-----------------------------------------")
print(records_pa_lt_3b)


records_pa_lt_hr = df[df['PA'] < df['HR']]
print("\n-----------------------------------------")
print("All records with PA < HR")
print("-----------------------------------------")
print(records_pa_lt_hr)


records_ab_lt_2b = df[df['AB'] < df['2B']]
print("\n-----------------------------------------")
print("All records with AB < 2B")
print("-----------------------------------------")
print(records_ab_lt_2b)


records_ab_lt_3b = df[df['AB'] < df['3B']]
print("\n-----------------------------------------")
print("All records with AB < 3B")
print("-----------------------------------------")
print(records_ab_lt_3b)


records_ab_lt_hr = df[df['AB'] < df['HR']]
print("\n-----------------------------------------")
print("All records with AB < HR")
print("-----------------------------------------")
print(records_ab_lt_hr)


records_h_lt_2b = df[df['H'] < df['2B']]
print("\n-----------------------------------------")
print("All records with H < 2B")
print("-----------------------------------------")
print(records_h_lt_2b)


records_h_lt_3b = df[df['H'] < df['3B']]
print("\n-----------------------------------------")
print("All records with H < 3B")
print("-----------------------------------------")
print(records_h_lt_3b)


records_h_lt_hr = df[df['H'] < df['HR']]
print("\n-----------------------------------------")
print("All records with H < HR")
print("-----------------------------------------")
print(records_h_lt_hr)

-----------------------------------------
All records with PA < 2B
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []

-----------------------------------------
All records with PA < 3B
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []

-----------------------------------------
All records with PA < HR
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []

-----------------------------------------
All records with AB < 2B
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA

**SUCCESS!!** These data issues have been resolved. `'2B'`, `'3B'`, and `'HR'` are now validated.

**RBI - RBIs (Runs Batted In)**

This is an important statistic, especially for calculated batting statistics which may prove useful later.

RBIs should be an integer value, in the range of 0 and some upper value. The upper value is dependent on the number of runners on base at the time of the plate appearance, which we do not know directly from the data. We can estimate a maximum possible value of Plate Appearances * 4 (the maximum number of runs possible to bat in). This maximum value would be statistically possible but highly unlikely. But it means that if we see a value higher than 12 * 4 = 48 in a single game then it is definitely out of range.


In [523]:
print(pd.unique(df['RBI']))

[ 1.  0.  3.  4.  2.  5.  6.  8.  7. nan  9. 12. 11. 10.]


**Note:** Of course, we don't see anything nearly as extreme as 48, but we have multiple problems here: (1) RBI is stored as a Float value, which makes no sense in this context, and (2) we have a NaN/undefined value to deal with.

While it would be nice to convert to integer values, the presence of NaN values blocks our carrying out this operation. So, first, we have to make a decision about how to deal with the NaN values.

Let's look at how many records include these NaN values and how many unique players are impacted by this data issue.

In [524]:
rbi_is_nan = df.loc[pd.isna(df['RBI'])]
print(rbi_is_nan)
print()

print("\nUnique Players with this RBI-NaN problem:")
rbi_nan_players = pd.unique(rbi_is_nan['ID'])
print(rbi_nan_players)
print("Count: ",len(rbi_nan_players))

print()
print("\nPercentage of overall records (",len(df),") impacted: ", len(rbi_is_nan)/len(df)*100)
print("There are",len(rbi_nan_players),"players impacted by these",len(rbi_is_nan),"records.")
print("")

               ID           Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
41584   becklja01     Jake Beckley  CIN  PIT   4   4  2  2   0   0   0  NaN   
41594   corcoto01   Tommy Corcoran  CIN  PIT   4   4  0  0   0   0   0  NaN   
41598   donlimi01      Mike Donlin  CIN  PIT   4   4  1  1   0   0   0  NaN   
41615   kellejo01       Joe Kelley  CIN  PIT   4   3  0  1   0   0   0  NaN   
41621   magooge01    George Magoon  CIN  PIT   3   3  0  1   0   0   0  NaN   
...           ...              ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...   
476024  wambsbi01  Bill Wambsganss  CLE  SLB   4   4  0  1   0   0   0  NaN   
476025  weavebu01      Buck Weaver  CHW  DET   5   4  1  3   0   2   0  NaN   
476027  wilkiro01    Roy Wilkinson  CHW  DET   3   3  1  1   1   0   0  NaN   
476029   woodjo02   Smoky Joe Wood  CLE  SLB   4   4  0  1   0   0   0  NaN   
476031  youngra01      Ralph Young  DET  CHW   5   4  0  2   0   0   0  NaN   

        BB  SO  HBP  SH  SF  WPA  RE24  aLI  BOP Re

While the number of overall records impacted is less than 1%, we see that there are many more players than there are records with this problem, which means that it is safest to remove all records for these players -- not only the problematic ones. Since all of the impacted data records are from seasons early in the 20th century, this does not raise any great concerns for this project.

In [525]:
# the removal
df = df[df['ID'].isin(rbi_nan_players) == False]

# test
print("\nTest for an empty list of RBI==NaN values, after removing rows:\n\n", df.loc[pd.isna(df['RBI'])])
print("\n\nReduced to",len(df),"records in main data frame after removing those players records entirely.")


Test for an empty list of RBI==NaN values, after removing rows:

 Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


Reduced to 3715522 records in main data frame after removing those players records entirely.


**Finally,** we can address our other issue: converting the type for RBI from float to integer.

In [526]:
df['RBI'] = df['RBI'].astype(int)
# test
print(pd.unique(df['RBI']))

[ 0  1  3  2  4  5  6  7  8 12  9 11 10]


**SUCCESS!!** No NaN values and all Integer values. And all of these are reasonable values at a glance.

We should also confirm that RBI values are never greater than the number of Plate Appearances within individual records. (Note: There are many different scenarios for earning an RBI, but they all require a corresponding Plate Appearance.)

In [527]:
records_pa4_lt_rbi = df[df['PA']*4 < (df['RBI']) ]
print("\n-----------------------------------------")
print("All records with PA*4 < RBI")
print("-----------------------------------------")
print(records_pa4_lt_rbi)


-----------------------------------------
All records with PA*4 < RBI
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**Note:** All RBI-related data issues have been resolved and our check within a record has passed. `'RBI'` is now validated.

**BB - Base On Balls (Walks)**

Walks should be an integer value, within the range of 0 up to the number of Plate Appearances.

In [528]:
pd.unique(df['BB'])

array([1, 0, 2, 3, 4, 5, 6])

**Note:** These are all reasonable values at a glance.

We should also confirm that Walk values are never greater than the number of Plate Appearances within individual records.

In [529]:
records_pa_lt_bb = df[df['PA'] < (df['BB']) ]
print("\n-----------------------------------------")
print("All records with PA < BB")
print("-----------------------------------------")
print(records_pa_lt_bb)


-----------------------------------------
All records with PA < BB
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**HBP - Hit By Pitch**

Hit By Pitch should be an integer value, within the range of 0 up to the number of Plate Appearances.

Reference: https://en.wikipedia.org/wiki/Base_on_balls
HBP is **not** recorded as a walk/BB. ("A hit by pitch is not counted statistically as a walk, though the effect is mostly the same, with the batter receiving a free pass to first base.")

In [530]:
pd.unique(df['HBP'])

array([0, 1, 2, 3])

**Note:** These are all reasonable values at a glance.

We should also confirm that HBP values are never greater than the number of Plate Appearances within individual records.

In [531]:
records_pa_lt_hbp = df[df['PA'] < (df['HBP']) ]
print("\n-----------------------------------------")
print("All records with PA < HBP")
print("-----------------------------------------")
print(records_pa_lt_hbp)


-----------------------------------------
All records with PA < HBP
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**SO - Strikeouts**

Strikeouts should be an integer value, within the range of 0 up to the number of Plate Appearances.

In [532]:
print(pd.unique(df['SO']))

[0 2 1 3 4 5 6]


**Note:** These are all reasonable values at a glance.

We should also confirm that Strikeout values are never greater than the number of Plate Appearances or At Bats within individual records.

In [533]:
records_pa_lt_so = df[df['PA'] < (df['SO']) ]
print("\n-----------------------------------------")
print("All records with PA < SO")
print("-----------------------------------------")
print(records_pa_lt_so)

records_ab_lt_so = df[df['AB'] < (df['SO']) ]
print("\n-----------------------------------------")
print("All records with AB < SO")
print("-----------------------------------------")
print(records_ab_lt_so)


-----------------------------------------
All records with PA < SO
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []

-----------------------------------------
All records with AB < SO
-----------------------------------------
                ID       Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  BB  \
1252967  frienow01  Owen Friend  SLB  NYY   2   0  1  0   0   0   0    0   1   

         SO  HBP  SH  SF  WPA  RE24  aLI  BOP Result Score  Season  
1252967   1    0   1 NaN  NaN   NaN  NaN    8      L  9-11    1950  


**Note:** We found one record with a strikeout but no at bats -- which doesn't make sense.

In [534]:
owen_friend = 1252967
df.loc[[owen_friend]]

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,WPA,RE24,aLI,BOP,Result,Score,Season
1252967,frienow01,Owen Friend,SLB,NYY,2,0,1,0,0,0,0,0,1,1,0,1,,,,,8,L,9-11,1950


Looking closer we can make a minor data adjustment for this record, and assume that there should be two At Bats instead of 0. In this game, the player stuck out once -- which means there was at least one At Bat.

Note: we also see a sacrifice hit (`'SH'`) listed. But, Sacrifice Hits and Flies (`'SF'`) do not count as times At Bat. They do, however, count as Plate Appearances.

These statistics demonstrate there must have been at least one At Bat. There was also one Walk (`'BB'`), which should mean there was at least three Plate Appearances, between the Sacrifice Hit, the Walk, and the Strikeout.

The most reasonable guess here is to set PA = 3 and AB = 1.

In [535]:
df.at[owen_friend, 'PA'] = 3
df.at[owen_friend, 'AB'] = 1

df.loc[[owen_friend]]

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,WPA,RE24,aLI,BOP,Result,Score,Season
1252967,frienow01,Owen Friend,SLB,NYY,3,1,1,0,0,0,0,0,1,1,0,1,,,,,8,L,9-11,1950


Re-Testing:

In [536]:
records_pa_lt_so = df[df['PA'] < (df['SO']) ]
print("\n-----------------------------------------")
print("All records with PA < SO")
print("-----------------------------------------")
print(records_pa_lt_so)

records_ab_lt_so = df[df['AB'] < (df['SO']) ]
print("\n-----------------------------------------")
print("All records with AB < SO")
print("-----------------------------------------")
print(records_ab_lt_so)


-----------------------------------------
All records with PA < SO
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []

-----------------------------------------
All records with AB < SO
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**SH - Sacrifice Hits**

Sacrifice Hits should be an integer value, within the range of 0 up to the number of Plate Appearances.

Note: Sacrifice plays do not count against the batter and, as such, don't count as At Bats.

References:
https://www.mlb.com/glossary/standard-stats/sacrifice-bunt
https://www.baseball-reference.com/bullpen/Sacrifice_hit

In [537]:
pd.unique(df['SH'])

array([0, 1, 2, 3, 4])

**Note:** These are all reasonable values at a glance.

We should also confirm that Sacrifce Hits values are never greater than the number of Plate Appearances within individual records.

In [538]:
records_pa_lt_sh = df[df['PA'] < (df['SH']) ]
print("\n-----------------------------------------")
print("All records with PA < SH")
print("-----------------------------------------")
print(records_pa_lt_sh)


-----------------------------------------
All records with PA < SH
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**SF - Sacrifice Flies**

Sacrifice Flies should be an integer value, within the range of 0 up to the number of Plate Appearances.

Note: Sacrifice plays do not count against the batter and, as such, don't count as At Bats.

Reference:
https://www.mlb.com/glossary/standard-stats/sacrifice-fly

In [539]:
pd.unique(df['SF'])

array([nan,  0.,  1.,  2.,  3.])

**Note:** We find that our Sacrifice Flies data is non-integer -- as with the RBI data processing, this is because there are NaN values present.

We will take the same approach for investigation, record removal, and conversion as with RBI.

In [540]:
sf_is_nan = df.loc[pd.isna(df['SF'])]
print(sf_is_nan)
print()

print("\nUnique Players with this SF-NaN problem:")
sf_nan_players = pd.unique(sf_is_nan['ID'])
print(sf_nan_players)
print("Count: ",len(sf_nan_players))

print()
print("\nPercentage of overall records (",len(df),") impacted: ", len(sf_is_nan)/len(df)*100)
print("There are",len(sf_nan_players),"players impacted by these",len(sf_is_nan),"records.")
print("")

                ID           Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
4        delahed01     Ed Delahanty  PHI  BRO   5   4  1  2   0   0   0    0   
5        dolanjo02        Joe Dolan  PHI  BRO   5   5  0  1   0   0   0    1   
21       childcu01     Cupid Childs  CHC  STL   5   5  1  1   0   0   0    0   
22       crolifr01     Fred Crolius  BSN  NYG   4   4  0  0   0   0   0    1   
28       delahed01     Ed Delahanty  PHI  BRO   4   4  0  0   0   0   0    0   
...            ...              ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...   
1355415  wyrosjo01  Johnny Wyrostek  PHI  BRO   4   4  0  1   0   0   0    0   
1355416   yosted01       Eddie Yost  WSH  PHA   5   3  1  2   1   0   0    1   
1355417  youngbo01      Bobby Young  SLB  CHW   4   4  0  0   0   0   0    0   
1355418  zernigu01      Gus Zernial  PHA  WSH   5   4  2  3   0   0   0    3   
1670929  bankser01      Ernie Banks  CHC  LAD   1   1  0  0   0   0   0    0   

         BB  SO  HBP  SH  SF    WPA   R

We see here that a much larger percentage of our dataset is impacted (21%), so before removing all of these records and all of the player data associated, we need to consider the potential importance of the Sacrifice Flies feature for the project.

Removing the SF column itself is not an option if we want to calculate the OBP (On Base Percentage) statistic.

Instead of NaN values, we will replace the NaN values with zeroes. It won't be accurate in all cases (since these haven't always been tracked), but the worst case will be that the affected records will underreport this statistic and consequently have a lower OBP percentage. But, it is the best we can do without completely removing the data.

In [541]:
series_of_sf = sf_is_nan.index
series_of_sf

df.loc[series_of_sf, 'SF'] = 0
df['SF'] = df['SF'].astype(int)
df

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,WPA,RE24,aLI,BOP,Result,Score,Season
4,delahed01,Ed Delahanty,PHI,BRO,5,4,1,2,0,0,0,0,1,0,0,0,0,,,,3,L,7-12,1901
5,dolanjo02,Joe Dolan,PHI,BRO,5,5,0,1,0,0,0,1,0,0,0,0,0,,,,7,L,7-12,1901
21,childcu01,Cupid Childs,CHC,STL,5,5,1,1,0,0,0,0,0,0,0,0,0,,,,1,W,8-7,1901
22,crolifr01,Fred Crolius,BSN,NYG,4,4,0,0,0,0,0,1,0,0,0,0,0,,,,6,W,7-0,1901
28,delahed01,Ed Delahanty,PHI,BRO,4,4,0,0,0,0,0,0,0,2,0,0,0,,,,3,L,2-10,1901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4285624,woodfja01,Jake Woodford,STL,CHC,2,1,0,0,0,0,0,0,0,0,0,1,0,-0.045,-0.413,1.530,9,L,2-3,2021
4285625,yastrmi01,Mike Yastrzemski,SFG,SDP,4,3,1,1,1,0,0,2,1,1,0,0,0,0.023,1.648,0.570,7,W,11-4,2021
4285626,zimmebr01,Bradley Zimmer,CLE,TEX,4,4,1,2,0,0,0,1,0,0,0,0,0,0.065,0.828,0.442,6,W,6-0,2021
4285627,zimmery01,Ryan Zimmerman,WSN,BOS,4,3,0,0,0,0,0,1,1,2,0,0,0,-0.051,-0.356,1.565,5,L,5-7,2021


In [542]:
pd.unique(df['SF'])

array([0, 1, 2, 3])

In [543]:
records_pa_lt_sf = df[df['PA'] < (df['SF']) ]
print("\n-----------------------------------------")
print("All records with PA < SF")
print("-----------------------------------------")
print(records_pa_lt_sf)


-----------------------------------------
All records with PA < SF
-----------------------------------------
Empty DataFrame
Columns: [ID, Player, Tm, Opp, PA, AB, R, H, 2B, 3B, HR, RBI, BB, SO, HBP, SH, SF, WPA, RE24, aLI, BOP, Result, Score, Season]
Index: []


**WPA - Win Probability Added**


Reference:
https://en.wikipedia.org/wiki/Win_probability_added

In [544]:
pd.unique(df['WPA'])

array([   nan,  0.03 , -0.042, ...,  0.974,  0.947, -0.655])

In [545]:
wpa_is_nan = df.loc[pd.isna(df['WPA'])]
print(wpa_is_nan)
print()

print("\nUnique Players with this SF-NaN problem:")
wpa_nan_players = pd.unique(wpa_is_nan['ID'])
print(wpa_nan_players)
print("Count: ",len(wpa_nan_players))

print()
print("\nPercentage of overall records (",len(df),") impacted: ", len(wpa_is_nan)/len(df)*100)
print("There are",len(wpa_nan_players),"players impacted by these",len(wpa_is_nan),"records.")
print("")

                ID         Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
4        delahed01   Ed Delahanty  PHI  BRO   5   4  1  2   0   0   0    0   
5        dolanjo02      Joe Dolan  PHI  BRO   5   5  0  1   0   0   0    1   
21       childcu01   Cupid Childs  CHC  STL   5   5  1  1   0   0   0    0   
22       crolifr01   Fred Crolius  BSN  NYG   4   4  0  0   0   0   0    1   
28       delahed01   Ed Delahanty  PHI  BRO   4   4  0  0   0   0   0    0   
...            ...            ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...   
1981210  raderdo02     Doug Rader  HOU  ATL   4   4  1  2   0   0   1    1   
1981254  watsobo01     Bob Watson  HOU  ATL   4   4  0  1   0   0   0    1   
1981257  williea02  Earl Williams  ATL  HOU   4   4  0  2   0   0   0    1   
1981259  wilsodo01     Don Wilson  HOU  ATL   3   2  0  0   0   0   0    0   
1981262   wynnji01       Jim Wynn  HOU  ATL   5   4  1  1   0   0   0    1   

         BB  SO  HBP  SH  SF  WPA  RE24  aLI  BOP Result Score 

**RE24 - Base-Out Runs Added**
This statistic represents the run expectancy based on 24 base outs.

Reference:
https://thebaseballscholar.com/2017/08/14/sabermetrics-101-re24/

In [546]:
pd.unique(df['RE24'])

array([   nan,  1.036, -0.138, ...,  5.928, -3.176,  5.142])

In [547]:
re24_is_nan = df.loc[pd.isna(df['RE24'])]
print(re24_is_nan)
print()

print("\nUnique Players with this SF-NaN problem:")
re24_nan_players = pd.unique(re24_is_nan['ID'])
print(re24_nan_players)
print("Count: ",len(re24_nan_players))

print()
print("\nPercentage of overall records (",len(df),") impacted: ", len(re24_is_nan)/len(df)*100)
print("There are",len(re24_nan_players),"players impacted by these",len(re24_is_nan),"records.")
print("")

                ID         Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
4        delahed01   Ed Delahanty  PHI  BRO   5   4  1  2   0   0   0    0   
5        dolanjo02      Joe Dolan  PHI  BRO   5   5  0  1   0   0   0    1   
21       childcu01   Cupid Childs  CHC  STL   5   5  1  1   0   0   0    0   
22       crolifr01   Fred Crolius  BSN  NYG   4   4  0  0   0   0   0    1   
28       delahed01   Ed Delahanty  PHI  BRO   4   4  0  0   0   0   0    0   
...            ...            ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...   
1981210  raderdo02     Doug Rader  HOU  ATL   4   4  1  2   0   0   1    1   
1981254  watsobo01     Bob Watson  HOU  ATL   4   4  0  1   0   0   0    1   
1981257  williea02  Earl Williams  ATL  HOU   4   4  0  2   0   0   0    1   
1981259  wilsodo01     Don Wilson  HOU  ATL   3   2  0  0   0   0   0    0   
1981262   wynnji01       Jim Wynn  HOU  ATL   5   4  1  1   0   0   0    1   

         BB  SO  HBP  SH  SF  WPA  RE24  aLI  BOP Result Score 

**aLI - Average Leverage Index**

Reference:
https://www.azsnakepit.com/2021/11/9/22742763/what-average-leverage-index-revealed-about-the-2021-diamondbacks

In [548]:
pd.unique(df['aLI'])

array([  nan, 0.14 , 1.187, ..., 4.373, 3.954, 7.45 ])

In [549]:
ali_is_nan = df.loc[pd.isna(df['aLI'])]
print(ali_is_nan)
print()

print("\nUnique Players with this SF-NaN problem:")
ali_nan_players = pd.unique(ali_is_nan['ID'])
print(ali_nan_players)
print("Count: ",len(ali_nan_players))

print()
print("\nPercentage of overall records (",len(df),") impacted: ", len(ali_is_nan)/len(df)*100)
print("There are",len(ali_nan_players),"players impacted by these",len(ali_is_nan),"records.")
print("")

                ID         Player   Tm  Opp  PA  AB  R  H  2B  3B  HR  RBI  \
4        delahed01   Ed Delahanty  PHI  BRO   5   4  1  2   0   0   0    0   
5        dolanjo02      Joe Dolan  PHI  BRO   5   5  0  1   0   0   0    1   
21       childcu01   Cupid Childs  CHC  STL   5   5  1  1   0   0   0    0   
22       crolifr01   Fred Crolius  BSN  NYG   4   4  0  0   0   0   0    1   
28       delahed01   Ed Delahanty  PHI  BRO   4   4  0  0   0   0   0    0   
...            ...            ...  ...  ...  ..  .. .. ..  ..  ..  ..  ...   
1981210  raderdo02     Doug Rader  HOU  ATL   4   4  1  2   0   0   1    1   
1981254  watsobo01     Bob Watson  HOU  ATL   4   4  0  1   0   0   0    1   
1981257  williea02  Earl Williams  ATL  HOU   4   4  0  2   0   0   0    1   
1981259  wilsodo01     Don Wilson  HOU  ATL   3   2  0  0   0   0   0    0   
1981262   wynnji01       Jim Wynn  HOU  ATL   5   4  1  1   0   0   0    1   

         BB  SO  HBP  SH  SF  WPA  RE24  aLI  BOP Result Score 

**Note:** An interesting phenomenon was seen with the WPA, RE24, and aLI statistics. They all seem to have been calculated in a similar range of time.

***A DECISION ... FOR WPA, RE24, and aLI STATISTICS***

After looking at all of this data, the nature of the statistics, and the goals of this project, the best decision is to remove these three features entirely. These three statistics are quite advanced to a point that is beyond the scope of this project.

So, we will remove these columns from the data frame.

In [550]:
del df['WPA']
del df['RE24']
del df['aLI']

In [551]:
df

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,BOP,Result,Score,Season
4,delahed01,Ed Delahanty,PHI,BRO,5,4,1,2,0,0,0,0,1,0,0,0,0,3,L,7-12,1901
5,dolanjo02,Joe Dolan,PHI,BRO,5,5,0,1,0,0,0,1,0,0,0,0,0,7,L,7-12,1901
21,childcu01,Cupid Childs,CHC,STL,5,5,1,1,0,0,0,0,0,0,0,0,0,1,W,8-7,1901
22,crolifr01,Fred Crolius,BSN,NYG,4,4,0,0,0,0,0,1,0,0,0,0,0,6,W,7-0,1901
28,delahed01,Ed Delahanty,PHI,BRO,4,4,0,0,0,0,0,0,0,2,0,0,0,3,L,2-10,1901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4285624,woodfja01,Jake Woodford,STL,CHC,2,1,0,0,0,0,0,0,0,0,0,1,0,9,L,2-3,2021
4285625,yastrmi01,Mike Yastrzemski,SFG,SDP,4,3,1,1,1,0,0,2,1,1,0,0,0,7,W,11-4,2021
4285626,zimmebr01,Bradley Zimmer,CLE,TEX,4,4,1,2,0,0,0,1,0,0,0,0,0,6,W,6-0,2021
4285627,zimmery01,Ryan Zimmerman,WSN,BOS,4,3,0,0,0,0,0,1,1,2,0,0,0,5,L,5-7,2021


Considering what remains, `'BOP'` (Batting Order Position) and the extracted `'Score'`, I do not foresee any use for either of these, so I will remove these columns as well.

At this time, I have decided to keep the `'Tm'` and `'Opp'` columns in the data for context but do not anticipate using it in any of the learning aspects of the project.

In [552]:
del df['BOP']
del df['Score']

In [553]:
df

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Result,Season
4,delahed01,Ed Delahanty,PHI,BRO,5,4,1,2,0,0,0,0,1,0,0,0,0,L,1901
5,dolanjo02,Joe Dolan,PHI,BRO,5,5,0,1,0,0,0,1,0,0,0,0,0,L,1901
21,childcu01,Cupid Childs,CHC,STL,5,5,1,1,0,0,0,0,0,0,0,0,0,W,1901
22,crolifr01,Fred Crolius,BSN,NYG,4,4,0,0,0,0,0,1,0,0,0,0,0,W,1901
28,delahed01,Ed Delahanty,PHI,BRO,4,4,0,0,0,0,0,0,0,2,0,0,0,L,1901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4285624,woodfja01,Jake Woodford,STL,CHC,2,1,0,0,0,0,0,0,0,0,0,1,0,L,2021
4285625,yastrmi01,Mike Yastrzemski,SFG,SDP,4,3,1,1,1,0,0,2,1,1,0,0,0,W,2021
4285626,zimmebr01,Bradley Zimmer,CLE,TEX,4,4,1,2,0,0,0,1,0,0,0,0,0,W,2021
4285627,zimmery01,Ryan Zimmerman,WSN,BOS,4,3,0,0,0,0,0,1,1,2,0,0,0,L,2021


**Result**

I am intentionally keeping `'Result'` at this time, so that I have options to understand how many winning games a player participated in, although there is a good chance this level of nuance will end up being outside of the scope of the project.

These values are extracted strings and should be one of 3 values: "W", "L", or "T". (Note that "T" represents a tied game, which is relatively unusual in baseball.)

In [554]:
print(pd.unique(df['Result']))

['L' 'W' 'T']


We might decide later to convert this to numeric values, if it proves to be helpful.

### Looking at Player Data

So far, we've looked at data from the perspective of each individual game played by an individual player. Now the data is healthier than when we started, we want to look at individual players during their careers.

Keeping in mind that our goal is to find trends that might predict a batter's success in the future, we will want to be able to combine their statistics over time. It might also be useful to attempt to find ways to examine their data during parts of their careers, but that's not currently a focus

In [555]:
print("There are",len(pd.unique(df['ID'])),"unique players remaining in this dataset.")

There are 14336 unique players remaining in this dataset.


In [979]:
columns = ['ID', 'Player','PA','AB','R','H','2B','3B','HR','RBI','BB','SO','HBP','SH', 'SF']
summables = ['PA','AB','R','H','2B','3B','HR','RBI','BB','SO','HBP','SH', 'SF']

career = df.copy()
career = career[columns].copy()
career

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
4,delahed01,Ed Delahanty,5,4,1,2,0,0,0,0,1,0,0,0,0
5,dolanjo02,Joe Dolan,5,5,0,1,0,0,0,1,0,0,0,0,0
21,childcu01,Cupid Childs,5,5,1,1,0,0,0,0,0,0,0,0,0
22,crolifr01,Fred Crolius,4,4,0,0,0,0,0,1,0,0,0,0,0
28,delahed01,Ed Delahanty,4,4,0,0,0,0,0,0,0,2,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4285624,woodfja01,Jake Woodford,2,1,0,0,0,0,0,0,0,0,0,1,0
4285625,yastrmi01,Mike Yastrzemski,4,3,1,1,1,0,0,2,1,1,0,0,0
4285626,zimmebr01,Bradley Zimmer,4,4,1,2,0,0,0,1,0,0,0,0,0
4285627,zimmery01,Ryan Zimmerman,4,3,0,0,0,0,0,1,1,2,0,0,0


In [980]:
career = career.groupby(['ID', 'Player'])
career = career[summables].sum().copy()
career

Unnamed: 0_level_0,Unnamed: 1_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
ID,Player,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
aardsda01,David Aardsma,5,4,0,0,0,0,0,0,0,2,0,1,0
aaronha01,Henry Aaron,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120
aaronto01,Tommie Aaron,1045,944,99,216,42,6,13,94,85,145,0,9,6
aasedo01,Don Aase,5,5,0,0,0,0,0,0,0,3,0,0,0
abadan01,Andy Abad,25,21,1,2,0,0,0,0,4,5,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zuninmi01,Mike Zunino,2835,2559,308,518,111,5,141,345,198,981,58,8,12
zupcibo01,Bob Zupcic,886,795,93,199,47,4,7,80,57,137,6,20,8
zupofr01,Frank Zupo,8,7,1,2,1,0,0,0,1,2,0,0,0
zuvelpa01,Paul Zuvella,545,491,40,109,17,2,2,20,34,50,2,18,0


**Note:** Now we've created our career statistics records for every player, we want to remove any records with 0 At Bats. We know that these players will not be in the Hall of Fame and it will avoid problems with calculating other statistics later.

In [981]:
career = career[career['AB'] > 0].copy()
career

Unnamed: 0_level_0,Unnamed: 1_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF
ID,Player,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
aardsda01,David Aardsma,5,4,0,0,0,0,0,0,0,2,0,1,0
aaronha01,Henry Aaron,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120
aaronto01,Tommie Aaron,1045,944,99,216,42,6,13,94,85,145,0,9,6
aasedo01,Don Aase,5,5,0,0,0,0,0,0,0,3,0,0,0
abadan01,Andy Abad,25,21,1,2,0,0,0,0,4,5,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zuninmi01,Mike Zunino,2835,2559,308,518,111,5,141,345,198,981,58,8,12
zupcibo01,Bob Zupcic,886,795,93,199,47,4,7,80,57,137,6,20,8
zupofr01,Frank Zupo,8,7,1,2,1,0,0,0,1,2,0,0,0
zuvelpa01,Paul Zuvella,545,491,40,109,17,2,2,20,34,50,2,18,0


### Different Statistical Formulae

singles (int) = H - (2B + 3B + HR)

total_bases (int) = 1 * singles + 2 * 2B + 3 * 3B + 4 * HR


**AVG (float) = H / AB**

SLG (float) = total_bases / AB

OBP (float) = (H + BB + HBP)/(AB + BB + HBP + SF)

**OPS (float) = SLG + OBP**
https://en.wikipedia.org/wiki/On-base_plus_slugging  *** Reference is great for testing against

RC (float) = total_bases * ( (H + BB) / (AB + BB) )

ISO (float) = (1 * 2B + 2 * 3B + 3 * HR) / AB

PA/SO (float) = PA/SO

#### Simplification of SLG formula??

SLG = ( 1 * (H - (2B + 3B + HR)) + 2 * 2B + 3 * 3B + 4 * HR ) / AB
( H - 2B - 3B - HR + 2 * 2B + 3 * 3B + 4 * HR ) / AB
( H - 2B + 2 * 2B - 3B + 3 * 3B - HR + 4 * HR ) / AB
( H + 2B + 2 * 3B + 3 * HR ) / AB

In [982]:
career['AVG'] = career['H'] / (career['AB']*1.0)

career['SLG'] = (career['H'] + career['2B'] + 2*career['3B'] + 3*career['HR']) / (career['AB']*1.0)

career['OBP'] = (career['H'] + career['BB'] + career['HBP']) / ((career['AB'] + career['BB'] + career['HBP'] + career['SF'])*1.0) 

career['OPS'] = career['SLG'] + career['OBP']

career

Unnamed: 0_level_0,Unnamed: 1_level_0,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
ID,Player,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
aardsda01,David Aardsma,5,4,0,0,0,0,0,0,0,2,0,1,0,0.000000,0.000000,0.000000,0.000000
aaronha01,Henry Aaron,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,0.305503,0.555152,0.374276,0.929429
aaronto01,Tommie Aaron,1045,944,99,216,42,6,13,94,85,145,0,9,6,0.228814,0.327331,0.290821,0.618152
aasedo01,Don Aase,5,5,0,0,0,0,0,0,0,3,0,0,0,0.000000,0.000000,0.000000,0.000000
abadan01,Andy Abad,25,21,1,2,0,0,0,0,4,5,0,0,0,0.095238,0.095238,0.240000,0.335238
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zuninmi01,Mike Zunino,2835,2559,308,518,111,5,141,345,198,981,58,8,12,0.202423,0.415006,0.273788,0.688794
zupcibo01,Bob Zupcic,886,795,93,199,47,4,7,80,57,137,6,20,8,0.250314,0.345912,0.302540,0.648452
zupofr01,Frank Zupo,8,7,1,2,1,0,0,0,1,2,0,0,0,0.285714,0.428571,0.375000,0.803571
zuvelpa01,Paul Zuvella,545,491,40,109,17,2,2,20,34,50,2,18,0,0.221996,0.276986,0.275142,0.552128


### Getting the Supplemental Hall of Fame Data

In [935]:
HoF_data = pd.read_csv('./data/halloffamers.csv')
HoF_data

Unnamed: 0,Year,Name,Unnamed: 2,Voted By,Inducted As,Votes,% of Ballots
0,2022,Bud Fowler\fowlebu99,1858-1913,Veterans,Pioneer/Executive,,
1,2022,Gil Hodges\hodgegi01,1924-1972,Veterans,Player,,
2,2022,Jim Kaat\kaatji01,1938-Living,Veterans,Player,,
3,2022,Minnie Minoso\minosmi01,1923-2015,Veterans,Player,,
4,2022,Buck O'Neil\oneilbu01,1911-2006,Veterans,Pioneer/Executive,,
...,...,...,...,...,...,...,...
335,1936,Ty Cobb\cobbty01,1886-1961,BBWAA,Player,222.0,98.2%
336,1936,Walter Johnson\johnswa01,1887-1946,BBWAA,Player,189.0,83.6%
337,1936,Christy Mathewson\mathech01,1880-1925,BBWAA,Player,205.0,90.7%
338,1936,Babe Ruth\ruthba01,1895-1948,BBWAA,Player,215.0,95.1%


In [936]:
HoF_data[['Player','ID']] = HoF_data['Name'].str.split('\\', expand=True)
pd.unique(HoF_data['Inducted As'])

array(['Pioneer/Executive', 'Player', 'Manager', 'Umpire'], dtype=object)

In [969]:
cols = ['ID', 'Inducted As']
HoF_ids = HoF_data[cols].copy()
HoF_ids

Unnamed: 0,ID,Inducted As
0,fowlebu99,Pioneer/Executive
1,hodgegi01,Player
2,kaatji01,Player
3,minosmi01,Player
4,oneilbu01,Pioneer/Executive
...,...,...
335,cobbty01,Player
336,johnswa01,Player
337,mathech01,Player
338,ruthba01,Player


In [970]:
players = ['Pioneer/Executive', 'Player']
HoF_ids = HoF_ids[HoF_ids['Inducted As'] == 'Player'].copy()
HoF_ids

Unnamed: 0,ID,Inducted As
1,hodgegi01,Player
2,kaatji01,Player
3,minosmi01,Player
5,olivato01,Player
6,ortizda01,Player
...,...,...
335,cobbty01,Player
336,johnswa01,Player
337,mathech01,Player
338,ruthba01,Player


In [971]:
#### Combine with features

# Set this entire list of players as 1 to mark the inductees
del HoF_ids['Inducted As']
#HoF_ids
HoF_ids['Inductee'] = 1

#HoF_ids.loc[HoF_ids['ID','Inductee']] = 1
HoF_ids

Unnamed: 0,ID,Inductee
1,hodgegi01,1
2,kaatji01,1
3,minosmi01,1
5,olivato01,1
6,ortizda01,1
...,...,...
335,cobbty01,1
336,johnswa01,1
337,mathech01,1
338,ruthba01,1


In [983]:
HoF_ids

Unnamed: 0,ID,Inductee
1,hodgegi01,1
2,kaatji01,1
3,minosmi01,1
5,olivato01,1
6,ortizda01,1
...,...,...
335,cobbty01,1
336,johnswa01,1
337,mathech01,1
338,ruthba01,1


In [984]:
career = pd.merge(HoF_ids, career, on="ID", how="right")
career

Unnamed: 0,ID,Inductee,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
0,aardsda01,,5,4,0,0,0,0,0,0,0,2,0,1,0,0.000000,0.000000,0.000000,0.000000
1,aaronha01,1.0,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,0.305503,0.555152,0.374276,0.929429
2,aaronto01,,1045,944,99,216,42,6,13,94,85,145,0,9,6,0.228814,0.327331,0.290821,0.618152
3,aasedo01,,5,5,0,0,0,0,0,0,0,3,0,0,0,0.000000,0.000000,0.000000,0.000000
4,abadan01,,25,21,1,2,0,0,0,0,4,5,0,0,0,0.095238,0.095238,0.240000,0.335238
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14259,zuninmi01,,2835,2559,308,518,111,5,141,345,198,981,58,8,12,0.202423,0.415006,0.273788,0.688794
14260,zupcibo01,,886,795,93,199,47,4,7,80,57,137,6,20,8,0.250314,0.345912,0.302540,0.648452
14261,zupofr01,,8,7,1,2,1,0,0,0,1,2,0,0,0,0.285714,0.428571,0.375000,0.803571
14262,zuvelpa01,,545,491,40,109,17,2,2,20,34,50,2,18,0,0.221996,0.276986,0.275142,0.552128


In [985]:
hof_is_nan = career.loc[pd.isna(career['Inductee'])]
series_of_hof = hof_is_nan.index
career.loc[series_of_hof, 'Inductee'] = 0
career['Inductee'] = career['Inductee'].astype(int)
career

Unnamed: 0,ID,Inductee,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
0,aardsda01,0,5,4,0,0,0,0,0,0,0,2,0,1,0,0.000000,0.000000,0.000000,0.000000
1,aaronha01,1,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,0.305503,0.555152,0.374276,0.929429
2,aaronto01,0,1045,944,99,216,42,6,13,94,85,145,0,9,6,0.228814,0.327331,0.290821,0.618152
3,aasedo01,0,5,5,0,0,0,0,0,0,0,3,0,0,0,0.000000,0.000000,0.000000,0.000000
4,abadan01,0,25,21,1,2,0,0,0,0,4,5,0,0,0,0.095238,0.095238,0.240000,0.335238
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14259,zuninmi01,0,2835,2559,308,518,111,5,141,345,198,981,58,8,12,0.202423,0.415006,0.273788,0.688794
14260,zupcibo01,0,886,795,93,199,47,4,7,80,57,137,6,20,8,0.250314,0.345912,0.302540,0.648452
14261,zupofr01,0,8,7,1,2,1,0,0,0,1,2,0,0,0,0.285714,0.428571,0.375000,0.803571
14262,zuvelpa01,0,545,491,40,109,17,2,2,20,34,50,2,18,0,0.221996,0.276986,0.275142,0.552128


In [986]:
name_cols = ['ID', 'Player']
names = df.copy()
names = names[name_cols].copy()
names = names.drop_duplicates(subset=['ID', 'Player'], keep='first')
names

Unnamed: 0,ID,Player
4,delahed01,Ed Delahanty
5,dolanjo02,Joe Dolan
21,childcu01,Cupid Childs
22,crolifr01,Fred Crolius
30,demonge01,Gene DeMontreville
...,...,...
4285119,houckta01,Tanner Houck
4285170,minteaj01,A.J. Minter
4285298,adonjo01,Joan Adon
4285517,paynety01,Tyler Payne


In [987]:
career = pd.merge(names, career, on="ID", how="right")
career

Unnamed: 0,ID,Player,Inductee,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,AVG,SLG,OBP,OPS
0,aardsda01,David Aardsma,0,5,4,0,0,0,0,0,0,0,2,0,1,0,0.000000,0.000000,0.000000,0.000000
1,aaronha01,Henry Aaron,1,13666,12121,2128,3703,614,96,740,2243,1372,1357,32,21,120,0.305503,0.555152,0.374276,0.929429
2,aaronto01,Tommie Aaron,0,1045,944,99,216,42,6,13,94,85,145,0,9,6,0.228814,0.327331,0.290821,0.618152
3,aasedo01,Don Aase,0,5,5,0,0,0,0,0,0,0,3,0,0,0,0.000000,0.000000,0.000000,0.000000
4,abadan01,Andy Abad,0,25,21,1,2,0,0,0,0,4,5,0,0,0,0.095238,0.095238,0.240000,0.335238
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14259,zuninmi01,Mike Zunino,0,2835,2559,308,518,111,5,141,345,198,981,58,8,12,0.202423,0.415006,0.273788,0.688794
14260,zupcibo01,Bob Zupcic,0,886,795,93,199,47,4,7,80,57,137,6,20,8,0.250314,0.345912,0.302540,0.648452
14261,zupofr01,Frank Zupo,0,8,7,1,2,1,0,0,0,1,2,0,0,0,0.285714,0.428571,0.375000,0.803571
14262,zuvelpa01,Paul Zuvella,0,545,491,40,109,17,2,2,20,34,50,2,18,0,0.221996,0.276986,0.275142,0.552128


In [989]:
career = career[['ID', 'Player', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'BB', 'SO', 'AVG', 'SLG', 'OBP', 'OPS', 'Inductee']]
career

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,AVG,SLG,OBP,OPS,Inductee
0,aardsda01,David Aardsma,5,4,0,0,0,0,0,0,0,2,0.000000,0.000000,0.000000,0.000000,0
1,aaronha01,Henry Aaron,13666,12121,2128,3703,614,96,740,2243,1372,1357,0.305503,0.555152,0.374276,0.929429,1
2,aaronto01,Tommie Aaron,1045,944,99,216,42,6,13,94,85,145,0.228814,0.327331,0.290821,0.618152,0
3,aasedo01,Don Aase,5,5,0,0,0,0,0,0,0,3,0.000000,0.000000,0.000000,0.000000,0
4,abadan01,Andy Abad,25,21,1,2,0,0,0,0,4,5,0.095238,0.095238,0.240000,0.335238,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14259,zuninmi01,Mike Zunino,2835,2559,308,518,111,5,141,345,198,981,0.202423,0.415006,0.273788,0.688794,0
14260,zupcibo01,Bob Zupcic,886,795,93,199,47,4,7,80,57,137,0.250314,0.345912,0.302540,0.648452,0
14261,zupofr01,Frank Zupo,8,7,1,2,1,0,0,0,1,2,0.285714,0.428571,0.375000,0.803571,0
14262,zuvelpa01,Paul Zuvella,545,491,40,109,17,2,2,20,34,50,0.221996,0.276986,0.275142,0.552128,0


In [993]:
career[career['Inductee'] == 1 ]

Unnamed: 0,ID,Player,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,AVG,SLG,OBP,OPS,Inductee
1,aaronha01,Henry Aaron,13666,12121,2128,3703,614,96,740,2243,1372,1357,0.305503,0.555152,0.374276,0.929429,1
197,alomaro01,Roberto Alomar,10400,9073,1507,2724,504,80,210,1134,1032,1140,0.300231,0.442852,0.371245,0.814097,1
325,aparilu01,Luis Aparicio,10972,10003,1294,2610,383,91,80,770,713,718,0.260922,0.341398,0.309698,0.651095,1
331,applilu01,Luke Appling,10254,8856,1317,2750,441,101,45,1110,1313,531,0.310524,0.398374,0.400196,0.798570,1
395,ashburi01,Richie Ashburn,9442,8103,1286,2492,305,106,29,577,1169,553,0.307540,0.382081,0.396700,0.778780,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13903,wilsoha01,Hack Wilson,5555,4760,885,1460,266,67,244,1063,672,714,0.306723,0.544538,0.394718,0.939255,1
13949,winfida01,Dave Winfield,12358,11003,1669,3110,540,88,465,1833,1216,1686,0.282650,0.474507,0.352622,0.827129,1
14107,wynnea01,Early Wynn,1869,1675,135,361,59,5,17,174,138,323,0.215522,0.287164,0.275426,0.562590,1
14128,yastrca01,Carl Yastrzemski,13992,11988,1816,3419,646,59,452,1844,1845,1395,0.285202,0.462045,0.379453,0.841499,1


In [998]:
career.isnull().values.any()


False

### Converting to X and y

In [999]:
y = career[career.columns[-1]]
y = y.values
y

array([0, 1, 0, ..., 0, 0, 0])

In [1006]:
# remove the HOF inductee column, and the string-based fields
num = career.shape[1]
# X = career[career.columns[2:num-1]]
# X = X.values
# X

X = career[career.columns[2:num-5]]
X = X.values
X


array([[    5,     4,     0, ...,     0,     0,     2],
       [13666, 12121,  2128, ...,  2243,  1372,  1357],
       [ 1045,   944,    99, ...,    94,    85,   145],
       ...,
       [    8,     7,     1, ...,     0,     1,     2],
       [  545,   491,    40, ...,    20,    34,    50],
       [  156,   133,     4, ...,     7,     7,    36]])

### Visualization(s)

??? Before and after bar graphs?

## Model Selection

### Logistic Regression

In [1007]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)

# Split data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 20)

# Import logistic regression 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

# Build a logstic regression object
LogReg = LogisticRegression(solver = 'newton-cg')

# Train(fit) model
LogReg.fit(X_train, y_train)

# Predict the data
y_train_predict = LogReg.predict(X_train)
y_test_predict = LogReg.predict(X_test)

# Count percentage of correct predictions
print("The performance of the model")
print("--------------------------------------")
print('Log loss of the model for training set: %.3f' % log_loss(y_train,y_train_predict))
print('Log loss of the model for test set: %.3f' % log_loss(y_test,y_test_predict))

The performance of the model
--------------------------------------
Log loss of the model for training set: 0.303
Log loss of the model for test set: 0.363


In [1009]:
# Count percentage of correct predictions
print("The performance of the model:")
print("------------------------------")

# Confusion Matrix
from sklearn.metrics import confusion_matrix
print('Confusion matrix for training set')
print(confusion_matrix(y_train,y_train_predict))

print('Confusion matrix for test set')
print(confusion_matrix(y_test, y_test_predict),'\n')

# Accuracy Score
from sklearn.metrics import accuracy_score
print('Accuracy of the model for training set: %.3f' % accuracy_score(y_train,y_train_predict))
print('Accuracy of the model for test set: %.3f\n' % accuracy_score(y_test,y_test_predict))

# Precision score
from sklearn.metrics import precision_score
print('Precision of the model for training set: %.3f' % precision_score(y_train,y_train_predict))
print('Precision of the model for test set: %.3f\n' % precision_score(y_test,y_test_predict))

# Recall score
from sklearn.metrics import recall_score
print('Recall of the model for training set: %.3f' % recall_score(y_train,y_train_predict))
print('Recall of the model for test set: %.3f\n' % recall_score(y_test,y_test_predict))

# F1 Score
from sklearn.metrics import f1_score
print('F1-measure of the model for training set: %.3f' % f1_score(y_train,y_train_predict))
print('F1-measure of the model for test set: %.3f\n' % f1_score(y_test,y_test_predict))

# AUC score
from sklearn.metrics import roc_auc_score
print('AUC score of the model for training set: %.3f' % roc_auc_score(y_train,y_train_predict))
print('AUC score of the model for test set: %.3f\n' % roc_auc_score(y_test,y_test_predict))
      
# Log loss
print('Log loss of the model for training set: %.3f' % log_loss(y_train,y_train_predict))
print('Log loss of the model for test set: %.3f\n' % log_loss(y_test,y_test_predict))


The performance of the model:
------------------------------
Confusion matrix for training set
[[11260     9]
 [   91    51]]
Confusion matrix for test set
[[2817    2]
 [  28    6]] 

Accuracy of the model for training set: 0.991
Accuracy of the model for test set: 0.989

Precision of the model for training set: 0.850
Precision of the model for test set: 0.750

Recall of the model for training set: 0.359
Recall of the model for test set: 0.176

F1-measure of the model for training set: 0.505
F1-measure of the model for test set: 0.286

AUC score of the model for training set: 0.679
AUC score of the model for test set: 0.588

Log loss of the model for training set: 0.303
Log loss of the model for test set: 0.363



### 

### Visualization(s)

## Model Evaluation

### Visualization(s)

## Concluding Comments