### Research Question

In [1]:
import pandas as pd
import numpy as np

Our first step in cleaning the data was to import 8 .csv files that we downloaded from fangraphs.com. Using FanGraphs, we specified that we wanted team data for the years 2006-2019 and then downloaded .csv files of their compilations of various hitting, pitching, and fielding data using their dashboard, standard, and advanced categories. We created dataframes for each of these so that we can clean the data in Python.

In [2]:
hitting_dashboard = pd.read_csv('hitting_dashboard.csv')
hitting_standard = pd.read_csv('hitting_standard.csv')
hitting_advanced = pd.read_csv('hitting_advanced.csv')
pitching_dashboard = pd.read_csv('pitching_dashboard.csv')
pitching_standard = pd.read_csv('pitching_standard.csv')
pitching_advanced = pd.read_csv('pitching_advanced.csv')
fielding_dashboard = pd.read_csv('fielding_dashboard.csv')
fielding_advanced = pd.read_csv('fielding_advanced.csv')

Next we displayed our column headers for the hitting .csv files so that we could eliminate any columns we don't want to use.

In [3]:
display(hitting_dashboard.columns)
display(hitting_standard.columns)
display(hitting_advanced.columns)

Index(['Season', 'Team', 'G', 'PA', 'HR', 'R', 'RBI', 'SB', 'BB%', 'K%', 'ISO',
       'BABIP', 'AVG', 'OBP', 'SLG', 'wOBA', 'wRC+', 'EV', 'BsR', 'Off', 'Def',
       'WAR'],
      dtype='object')

Index(['Season', 'Team', 'G', 'AB', 'PA', 'H', '1B', '2B', '3B', 'HR', 'R',
       'RBI', 'BB', 'IBB', 'SO', 'HBP', 'SF', 'SH', 'GDP', 'SB', 'CS', 'AVG'],
      dtype='object')

Index(['Season', 'Team', 'PA', 'BB%', 'K%', 'BB/K', 'AVG', 'OBP', 'SLG', 'OPS',
       'ISO', 'Spd', 'BABIP', 'UBR', 'wGDP', 'wSB', 'wRC', 'wRAA', 'wOBA',
       'wRC+'],
      dtype='object')

Then we selected which columns we wanted to include from the three hitting dataframes we created from the FanGraphs data. They include a lot of statistics in their datasheets, but we are only going to look at some of this data in our analysis. Additionally, there are some redundancies across these dataframes (like 'PA'), so we also want to ensure that we eliminate those. All of the dataframes still include the 'Season' and 'Team' column so that we can combine them later.

In [4]:
hitting_dashboard = hitting_dashboard.drop(columns=['G', 'SB', 'BABIP', 'EV', 'BsR', 'Off', 'Def'])
hitting_standard = hitting_standard[['Season', 'Team', 'H']]
hitting_advanced = hitting_advanced[['Season', 'Team', 'OPS']]

Here we displayed pitching columns, so that we could do the same thing with the pitching dataframes that we did with the hitting dataframes.

In [5]:
display(pitching_dashboard.columns)
display(pitching_standard.columns)
display(pitching_advanced.columns)

Index(['Season', 'Team', 'W', 'L', 'SV', 'G', 'GS', 'IP', 'K/9', 'BB/9',
       'HR/9', 'BABIP', 'LOB%', 'GB%', 'HR/FB', 'EV', 'ERA', 'FIP', 'xFIP',
       'WAR'],
      dtype='object')

Index(['Season', 'Team', 'W', 'L', 'ERA', 'G', 'GS', 'CG', 'ShO', 'SV', 'HLD',
       'BS', 'IP', 'TBF', 'H', 'R', 'ER', 'HR', 'BB', 'IBB', 'HBP', 'WP', 'BK',
       'SO'],
      dtype='object')

Index(['Season', 'Team', 'K/9', 'BB/9', 'K/BB', 'HR/9', 'K%', 'BB%', 'K-BB%',
       'AVG', 'WHIP', 'BABIP', 'LOB%', 'ERA-', 'FIP-', 'xFIP-', 'ERA', 'FIP',
       'E-F', 'xFIP', 'SIERA'],
      dtype='object')

We selected which columns we wanted to include from the three pitching dataframes we created from the FanGraphs data. They include a lot of statistics in their datasheets, but we are only going to look at some of this data in our analysis. Additionally, there are some redundancies across these dataframes (like 'ERA'), so we also want to ensure that we eliminate those. All of the dataframes still include the 'Season' and 'Team' column so that we can combine them later.

In [6]:
pitching_dashboard = pitching_dashboard.drop(columns=['G', 'K/9', 'BB/9', 'HR/9', 'BABIP', 'LOB%', 'GB%', 'HR/FB', 'EV','xFIP'])
pitching_standard = pitching_standard[['Season', 'Team', 'SV', 'IP', 'H', 'R', 'ER', 'HR', 'BB', 'SO']]
pitching_advanced = pitching_advanced[['Season', 'Team', 'K%', 'BB%', 'WHIP']]

Finally, we displayed the column names for our fielding dataframes to limit our fielding statistics

In [7]:
display(fielding_dashboard.columns)
display(fielding_advanced.columns)

Index(['Season', 'Team', 'G', 'GS', 'Inn', 'PO', 'A', 'E', 'FE', 'TE', 'DP',
       'DPS', 'DPT', 'DPF', 'Scp', 'SB', 'CS', 'PB', 'WP', 'FP', 'TZ'],
      dtype='object')

Index(['Season', 'Team', 'Inn', 'rSZ', 'rCERA', 'rSB', 'rGDP', 'rARM', 'rGFP',
       'rPM', 'rTS', 'DRS', 'BIZ', 'Plays', 'RZR', 'OOZ', 'FSR', 'FRM', 'ARM',
       'DPR', 'RngR', 'ErrR', 'UZR', 'UZR/150', 'Def'],
      dtype='object')

We selected which columns we wanted to include from the two fielding dataframes we created from the FanGraphs data. They include a lot of statistics in their datasheets, but we are only going to look at some of this data in our analysis. This was especially true for fielding since a lot of fielding statistics are closely related and some reflect overall team defense better than others. Additionally, there are some redundancies across these dataframes, although none of the data we wanted was repeated in this case. Both of the dataframes still include the 'Season' and 'Team' column so that we can combine them later.

In [8]:
fielding_dashboard = fielding_dashboard[['Season', 'Team', 'FP']]
fielding_advanced = fielding_advanced[['Season', 'Team', 'DRS', 'UZR']]

FanGraphs data uses different titles in the 'Team' category for the data they collected for the fielding statistics (this is our fielding_dashboard and fielding_advanced dataframes). We had to convert these names to be the same as what is used across the other dataframes to distinguish teams, which is a conventional three letter abbreviation. We did this by creating a dictionary with the needed changes for all 30 teams and then passing that through the dataframe with the .replace() method.

In [9]:
team_name_dictionary = {'Angels': 'LAA', 'Braves': 'ATL', 'Astros': 'HOU', 'Athletics': 'OAK', 'Blue Jays': 'TOR', 'Royals': 'KCR', 'Tigers': 'DET', 
                       'Twins': 'MIN', 'White Sox': 'CHW', 'Yankees': 'NYY', 'Brewers': 'MIL', 'Cardinals': 'STL', 'Cubs': 'CHC', 'Devil Rays': 'TBR',
                       'Diamondbacks': 'ARI', 'Dodgers': 'LAD', 'Giants': 'SFG', 'Indians': 'CLE', 'Mariners': 'SEA', 'Marlins': 'MIA', 'Mets': 'NYM',
                       'Nationals': 'WSN', 'Orioles': 'BAL', 'Padres': 'SDP', 'Phillies': 'PHI', 'Pirates': 'PIT', 'Rangers': 'TEX', 'Red Sox': 'BOS',
                       'Reds': 'CIN', 'Rockies': 'COL', 'Rays': 'TBR'}
fielding_dashboard = fielding_dashboard.replace(team_name_dictionary, value=None)
fielding_advanced = fielding_advanced.replace(team_name_dictionary, value=None)

There was also an inconsistency with the way two teams were named because they rebranded during the time frame that we are using. The Miami Marlins (MIA) used to be the Florida Marlins (FLA) and the Tamba Bay Rays (TBR) used to be the Tampa Bay Devil Rays (TBD). These are the same franchises so they should be called the same name in our data. We already resolved this in the fielding data based on how we constructed our dictionary, but we had to resolve these inconsistencies in our other dataframes.

In [10]:
florida_renames = {'FLA': 'MIA', 'TBD': 'TBR'}
hitting_dashboard = hitting_dashboard.replace(florida_renames, value=None)
hitting_standard = hitting_standard.replace(florida_renames, value=None)
hitting_advanced = hitting_advanced.replace(florida_renames, value=None)
pitching_dashboard = pitching_dashboard.replace(florida_renames, value=None)
pitching_standard = pitching_standard.replace(florida_renames, value=None)
pitching_advanced = pitching_advanced.replace(florida_renames, value=None)

In order to combine all of our dataframes, we needed to make sure that all of the data was in the same order so we could concatenate across and include the proper data for each team in their respective row. In order to do this we sort our data in each dataframe by the season (chronologically) and then by the team (alphabetically). We then reset the index and dropped the index column to ensure that we didn't have any unnecessary columns and all of our data was in order.

In [11]:
hitting_dashboard = hitting_dashboard.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')
hitting_standard = hitting_standard.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')
hitting_advanced = hitting_advanced.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')
pitching_dashboard = pitching_dashboard.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')
pitching_standard = pitching_standard.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')
pitching_advanced = pitching_advanced.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')
fielding_dashboard = fielding_dashboard.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')
fielding_advanced = fielding_advanced.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')

Now that our dataframes were properly sorted, we were able to concatenate the data with its appropriate category. We will combine everything later, but to allow us to keep track of the columns in our subsequent renaming step we decided to keep hitting/pitching/fielding data divided for this part. In order to combin each category, we used the pandas .concat() method and specficied to concatenate along axis = 1 in order to "smoosh" together the columns. Additionally, we used the season and team data to make sure everything was in the same order, but following the concatenation we had duplicates of that information, so we eliminated those with the .duplicated() method.

In [12]:
hitting = pd.concat([hitting_dashboard, hitting_standard, hitting_advanced], axis=1)
hitting = hitting.loc[:,~hitting.columns.duplicated()]

pitching = pd.concat([pitching_dashboard, pitching_standard, pitching_advanced], axis=1)
pitching = pitching.loc[:,~pitching.columns.duplicated()]

fielding = pd.concat([fielding_dashboard, fielding_advanced], axis=1)
fielding = fielding.loc[:,~fielding.columns.duplicated()]

In [13]:
display(hitting.columns)
display(pitching.columns)
display(fielding.columns)

Index(['Season', 'Team', 'PA', 'HR', 'R', 'RBI', 'BB%', 'K%', 'ISO', 'AVG',
       'OBP', 'SLG', 'wOBA', 'wRC+', 'WAR', 'H', 'OPS'],
      dtype='object')

Index(['Season', 'Team', 'W', 'L', 'SV', 'GS', 'IP', 'ERA', 'FIP', 'WAR', 'H',
       'R', 'ER', 'HR', 'BB', 'SO', 'K%', 'BB%', 'WHIP'],
      dtype='object')

Index(['Season', 'Team', 'FP', 'DRS', 'UZR'], dtype='object')

Once we made the larger hitting, pitching, and fielding dataframes, we wanted to rename our columns so that they will be easier to work with later on. The first step for all of our new dataframes was to make the column names lowercase which we did with colname.lower(). Then we needed to specify name changes to certain columns which had ambiguous titles, could be confused between hitting and pitching stats, or had problematic characters like % and +. For overlapping statistics (like 'H' we clarified hit_hits and pitch_hits for who those hits belonged to.

In [14]:
hitting_lower = [colname.lower() for colname in hitting.columns]
hitting.columns = hitting_lower;
hitting = hitting.rename(columns = {'hr':'hit_hr', 'r':'runs_scored', 'bb%':'hit_bb_rate', 'k%':'hit_k_rate', 'avg':'bat_avg', 'wrc+':'wrc_plus', 'war':'hit_fwar', 'h':'hit_hits'})

pitching_lower = [colname.lower() for colname in pitching.columns]
pitching.columns = pitching_lower;
pitching = pitching.rename(columns = {'w':'wins', 'l':'losses', 'sv':'saves', 'gs':'games', 'war':'pitch_fwar', 'h':'pitch_hits', 'r':'runs_allowed', 'hr': 'pitch_hr', 'k%':'pitch_k_rate', 'bb%':'pitch_bb_rate', 'bb':'pitch_bb','so':'pitch_so'})

fielding_lower = [colname.lower() for colname in fielding.columns]
fielding.columns = fielding_lower;

Finally we were ready to concatenate all of the data. We followed a similar process to when we concatenated all of the hitting/pitching/fielding data separately, but this time we made one big dataframe with all data we could want to access. We did a similar process to eliminate duplicated columns (team and season).

In [15]:
baseball_data = pd.concat([hitting, pitching, fielding], axis=1)
baseball_data = baseball_data.loc[:,~baseball_data.columns.duplicated()]

Now we have to make sure our data is in a format that we can work with. Some of our data was stored as objects that were percents, but we want to be able to manipulate those as floats. Below we converted all of these columns to floats, this applied to all of the columns with 'rate' in the title (i.e. 'pitch_k_rate'). To do this we had to specify them as strings, eliminate the % sign, convert that value to a float and then divide by 100 (since we want them as a decimal representing a rate). ***NEED TO FIGURE OUT IF WE NEED TO USE DATETIME OBJECT

In [17]:
baseball_data.loc[:,'hit_bb_rate'] = baseball_data.loc[:,'hit_bb_rate'].str.rstrip('%').astype('float') / 100.0
baseball_data.loc[:,'hit_k_rate'] = baseball_data.loc[:,'hit_k_rate'].str.rstrip('%').astype('float') / 100.0
baseball_data.loc[:,'pitch_k_rate'] = baseball_data.loc[:,'pitch_k_rate'].str.rstrip('%').astype('float') / 100.0
baseball_data.loc[:,'pitch_bb_rate'] = baseball_data.loc[:,'pitch_bb_rate'].str.rstrip('%').astype('float') / 100.0
#baseball_data.loc[:,'season'] = pd.to_datetime(baseball_data.loc[:,'season'], format = "%Y")

Although we already have data on wins, a team's success is often determined by win percentage (which is what is used when determining playoff teams) since teams usually play the same number of games (162) but there is a chance for slight variation. In order to include this, we added a column called win_pct which we calculated using our wins and games columns.

In [18]:
baseball_data['win_pct'] = baseball_data['wins']/baseball_data['games']

Here is our cleaned data:

In [19]:
display(baseball_data)

Unnamed: 0,season,team,pa,hit_hr,runs_scored,rbi,hit_bb_rate,hit_k_rate,iso,bat_avg,...,pitch_hr,pitch_bb,pitch_so,pitch_k_rate,pitch_bb_rate,whip,fp,drs,uzr,win_pct
0,2006,ARI,6330,160,773,743,0.080,0.152,0.157,0.267,...,168,536,1115,0.176,0.085,1.40,0.983,5,-27.7,0.469136
1,2006,ATL,6284,222,849,818,0.084,0.186,0.184,0.270,...,183,572,1049,0.165,0.090,1.46,0.984,-13,10.0,0.487654
2,2006,BAL,6240,164,768,727,0.076,0.141,0.146,0.277,...,216,613,1016,0.161,0.097,1.54,0.983,-13,4.8,0.432099
3,2006,BOS,6435,192,820,777,0.104,0.164,0.166,0.269,...,181,509,1070,0.170,0.081,1.44,0.989,-57,-21.9,0.530864
4,2006,CHC,6147,166,716,677,0.064,0.151,0.154,0.268,...,210,687,1250,0.196,0.108,1.45,0.982,-16,31.2,0.407407
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
415,2019,STL,6167,210,764,714,0.091,0.230,0.170,0.245,...,191,545,1399,0.231,0.090,1.27,0.989,91,32.8,0.561728
416,2019,TBR,6285,217,769,730,0.086,0.238,0.178,0.254,...,181,453,1621,0.266,0.074,1.17,0.985,53,-2.1,0.592593
417,2019,TEX,6204,223,810,765,0.086,0.254,0.183,0.248,...,241,583,1379,0.217,0.092,1.46,0.982,-52,-11.1,0.481481
418,2019,TOR,6091,247,726,697,0.084,0.249,0.192,0.236,...,228,604,1332,0.211,0.096,1.43,0.984,0,-24.9,0.413580
