# Project Phase II: Anna Asch and Anna Clemson

### Research Question(s)
1. In the post-steroid era, were the most successful teams built with elite hitting, pitching, or fielding? Basically based on data that represents quality, is it more valuable to score runs or prevent runs? Is a run scored really the same as a run prevented?
2. Although all of this data is from the same "era," there is no doubt that the game has changed within this era, does data support the idea that the make-up of the most successful teams has changed in the last 15 years or so? (By make-up we mean, for example, has hitting become more important?)

Although there is some research on these topics most of the it is outdated or purely looking at the run totals or basic stats that are widely regarded as not being great measures of actual quality of a player (i.e., batting average). We want to look and see if beyond just runs, quality offense, quality pitching, or quality fielding is more valuable for a team's success (i.e., OPS for hitting or FIP for pitching).

### Data Cleaning

To begin we needed to import the pandas and numpy packages to allow us to perform all of the changes we needed on our data.

In [1]:
import pandas as pd
import numpy as np

Our first step in cleaning the data was to import 8 `.csv` files that we downloaded from [fangraphs.com](fangraphs.com). Using FanGraphs, we specified that we wanted team data for the years 2006-2019 and then downloaded `.csv` files of their compilations of various hitting, pitching, and fielding data using their dashboard, standard, and advanced categories. We created dataframes for each of these so that we could clean the data in Python.

In [2]:
hitting_dashboard = pd.read_csv('hitting_dashboard.csv')
hitting_standard = pd.read_csv('hitting_standard.csv')
hitting_advanced = pd.read_csv('hitting_advanced.csv')
pitching_dashboard = pd.read_csv('pitching_dashboard.csv')
pitching_standard = pd.read_csv('pitching_standard.csv')
pitching_advanced = pd.read_csv('pitching_advanced.csv')
fielding_dashboard = pd.read_csv('fielding_dashboard.csv')
fielding_advanced = pd.read_csv('fielding_advanced.csv')

Then we selected which columns we wanted to include from the three hitting dataframes we created from the FanGraphs data. They include a lot of statistics in their datasheets, but we are only going to look at some of this data in our analysis. Additionally, there are some redundancies across these dataframes (like `PA`), so we also want to ensure that we eliminate those. All of the dataframes still include the 'Season' and 'Team' column so that we can combine them later.

In [3]:
hitting_dashboard = hitting_dashboard.drop(columns=['G', 'SB', 'BABIP', 'EV', 'BsR', 'Off', 'Def'])
hitting_standard = hitting_standard[['Season', 'Team', 'H']]
hitting_advanced = hitting_advanced[['Season', 'Team', 'OPS']]

We selected which columns we wanted to include from the three pitching dataframes we created from the FanGraphs data. They include a lot of statistics in their datasheets, but we are only going to look at some of this data in our analysis. Additionally, there are some redundancies across these dataframes (like `ERA`), so we also wanted to ensure that we eliminated those. All of the dataframes still included the 'Season' and 'Team' columns so that we could combine them later.

In [4]:
pitching_dashboard = pitching_dashboard.drop(columns=['G', 'K/9', 'BB/9', 'HR/9', 'BABIP', 'LOB%', 'GB%', 'HR/FB', 'EV','xFIP'])
pitching_standard = pitching_standard[['Season', 'Team', 'SV', 'IP', 'H', 'R', 'ER', 'HR', 'BB', 'SO']]
pitching_advanced = pitching_advanced[['Season', 'Team', 'K%', 'BB%', 'WHIP']]

We selected which columns we wanted to include from the two fielding dataframes we created from the FanGraphs data. They include a lot of statistics in their datasheets, but we are only going to look at some of this data in our analysis. This was especially true for fielding since a lot of fielding statistics are closely related and some reflect overall team defense better than others. Additionally, there are some redundancies across these dataframes, although none of the data we wanted was repeated in this case. Both of the dataframes still include the 'Season' and 'Team' column so that we can combine them later.

In [5]:
fielding_dashboard = fielding_dashboard[['Season', 'Team', 'FP']]
fielding_advanced = fielding_advanced[['Season', 'Team', 'DRS', 'UZR']]

FanGraphs data uses different names in the `Team` category for the data they collected for the fielding statistics (this is in our fielding_dashboard and fielding_advanced dataframes). We had to convert these names to be the same as what is used across the other dataframes to distinguish teams, which is a conventional three letter abbreviation. We did this by creating a dictionary with the needed changes for all 30 teams and then passing that through the dataframe with the `.replace()` method.

In [6]:
team_name_dictionary = {'Angels': 'LAA', 'Braves': 'ATL', 'Astros': 'HOU', 'Athletics': 'OAK', 'Blue Jays': 'TOR', 'Royals': 'KCR', 'Tigers': 'DET', 
                       'Twins': 'MIN', 'White Sox': 'CHW', 'Yankees': 'NYY', 'Brewers': 'MIL', 'Cardinals': 'STL', 'Cubs': 'CHC', 'Devil Rays': 'TBR',
                       'Diamondbacks': 'ARI', 'Dodgers': 'LAD', 'Giants': 'SFG', 'Indians': 'CLE', 'Mariners': 'SEA', 'Marlins': 'MIA', 'Mets': 'NYM',
                       'Nationals': 'WSN', 'Orioles': 'BAL', 'Padres': 'SDP', 'Phillies': 'PHI', 'Pirates': 'PIT', 'Rangers': 'TEX', 'Red Sox': 'BOS',
                       'Reds': 'CIN', 'Rockies': 'COL', 'Rays': 'TBR'}
fielding_dashboard = fielding_dashboard.replace(team_name_dictionary, value=None)
fielding_advanced = fielding_advanced.replace(team_name_dictionary, value=None)

There was also an inconsistency with the way two teams were named because they rebranded during the time frame that we are using. The Miami Marlins (`MIA`) used to be the Florida Marlins (`FLA`) and the Tamba Bay Rays (`TBR`) used to be the Tampa Bay Devil Rays (`TBD`). These are the same franchises so they should be called the same name in our data. We already resolved this in the fielding data based on how we constructed our dictionary, but we had to resolve these inconsistencies in our other dataframes.

In [7]:
florida_renames = {'FLA': 'MIA', 'TBD': 'TBR'}
hitting_dashboard = hitting_dashboard.replace(florida_renames, value=None)
hitting_standard = hitting_standard.replace(florida_renames, value=None)
hitting_advanced = hitting_advanced.replace(florida_renames, value=None)
pitching_dashboard = pitching_dashboard.replace(florida_renames, value=None)
pitching_standard = pitching_standard.replace(florida_renames, value=None)
pitching_advanced = pitching_advanced.replace(florida_renames, value=None)

In order to combine all of our dataframes, we needed to make sure that all of the data was in the same order so we could concatenate across and include the proper data for each team in their respective row. In order to do this we sort our data in each dataframe by `Season` (chronologically) and then by `Team` (alphabetically). We then reset the index and dropped the index column to ensure that we didn't have any unnecessary columns and all of our data was in order.

In [8]:
hitting_dashboard = hitting_dashboard.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')
hitting_standard = hitting_standard.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')
hitting_advanced = hitting_advanced.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')
pitching_dashboard = pitching_dashboard.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')
pitching_standard = pitching_standard.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')
pitching_advanced = pitching_advanced.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')
fielding_dashboard = fielding_dashboard.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')
fielding_advanced = fielding_advanced.sort_values(by=['Season', 'Team']).reset_index().drop(columns='index')

Now that our dataframes were properly sorted, we were able to concatenate the data with its appropriate category. We will combine everything later, but to allow us to keep track of the columns in our subsequent renaming step we decided to keep hitting/pitching/fielding data divided for this part. In order to combine each category, we used the `pd.concat()` method and specficied to concatenate along `axis = 1` in order to "smoosh" together the columns. Additionally, we used the season and team data to make sure everything was in the same order, but following the concatenation we had duplicates of that information, so we eliminated those with the `.duplicated()` method.

In [9]:
hitting = pd.concat([hitting_dashboard, hitting_standard, hitting_advanced], axis=1)
hitting = hitting.loc[:,~hitting.columns.duplicated()]

pitching = pd.concat([pitching_dashboard, pitching_standard, pitching_advanced], axis=1)
pitching = pitching.loc[:,~pitching.columns.duplicated()]

fielding = pd.concat([fielding_dashboard, fielding_advanced], axis=1)
fielding = fielding.loc[:,~fielding.columns.duplicated()]

Once we made the larger hitting, pitching, and fielding dataframes, we wanted to rename our columns so that they will be easier to work with later on. The first step for all of our new dataframes was to make the column names lowercase which we did with `colname.lower()`. Then we needed to specify name changes to certain columns which had ambiguous titles, could be confused between hitting and pitching stats, or had problematic characters like % and +. For overlapping statistics (like `H` we clarified `hit_hits` and `pitch_hits` for who those hits belonged to.

In [10]:
hitting_lower = [colname.lower() for colname in hitting.columns]
hitting.columns = hitting_lower;
hitting = hitting.rename(columns = {'hr':'hit_hr', 'r':'runs_scored', 'bb%':'hit_bb_rate', 'k%':'hit_k_rate', 'avg':'bat_avg', 'wrc+':'wrc_plus', 'war':'hit_fwar', 'h':'hit_hits'})

pitching_lower = [colname.lower() for colname in pitching.columns]
pitching.columns = pitching_lower;
pitching = pitching.rename(columns = {'w':'wins', 'l':'losses', 'sv':'saves', 'gs':'games', 'war':'pitch_fwar', 'h':'pitch_hits', 'r':'runs_allowed', 'hr': 'pitch_hr', 'k%':'pitch_k_rate', 'bb%':'pitch_bb_rate', 'bb':'pitch_bb','so':'pitch_so'})

fielding_lower = [colname.lower() for colname in fielding.columns]
fielding.columns = fielding_lower;

Finally we were ready to concatenate all of the data. We followed a similar process to when we concatenated all of the hitting/pitching/fielding data separately, but this time we made one big dataframe with all data we could want to access. We did a similar process to eliminate duplicated columns (team and season).

In [11]:
baseball_data = pd.concat([hitting, pitching, fielding], axis=1)
baseball_data = baseball_data.loc[:,~baseball_data.columns.duplicated()]

Now we have to make sure our data is in a format that we can work with. Some of our data was stored as objects that were percents, but we want to be able to manipulate those as floats. Below we converted all of these columns to floats, this applied to all of the columns with `rate` in the title (i.e. `pitch_k_rate`). To do this we had to specify them as strings, eliminate the % sign, convert that value to a float and then divide by 100 (since we want them as a decimal representing a rate).

In [12]:
baseball_data.loc[:,'hit_bb_rate'] = baseball_data.loc[:,'hit_bb_rate'].str.rstrip('%').astype('float') / 100.0
baseball_data.loc[:,'hit_k_rate'] = baseball_data.loc[:,'hit_k_rate'].str.rstrip('%').astype('float') / 100.0
baseball_data.loc[:,'pitch_k_rate'] = baseball_data.loc[:,'pitch_k_rate'].str.rstrip('%').astype('float') / 100.0
baseball_data.loc[:,'pitch_bb_rate'] = baseball_data.loc[:,'pitch_bb_rate'].str.rstrip('%').astype('float') / 100.0

Although we already have data on wins, a team's success is often determined by win percentage (which is what is used when determining playoff teams) since teams usually play the same number of games (162) but there is a chance for slight variation. In order to include this, we added a column called `win_pct` which we calculated using `wins` and `games`. Similarly, we created a run differential column called `run_diff` using `runs_scored` and `runs_allowed` because this is a quick way to tell whether a team's offense produced more or their pitching/fielding produced less throughout the season. Although we want to stick with specific quality analysis when answering our research questions, this is a good diagnostic statistic to potentially use as a measure of success. 

In [13]:
baseball_data['win_pct'] = baseball_data['wins']/baseball_data['games']
baseball_data['run_diff'] = baseball_data['runs_scored'] - baseball_data['runs_allowed']

In [14]:
baseball_data.to_csv('baseball_data.csv')

Here is our cleaned data:

In [15]:
display(baseball_data)

Unnamed: 0,season,team,pa,hit_hr,runs_scored,rbi,hit_bb_rate,hit_k_rate,iso,bat_avg,...,pitch_bb,pitch_so,pitch_k_rate,pitch_bb_rate,whip,fp,drs,uzr,win_pct,run_diff
0,2006,ARI,6330,160,773,743,0.080,0.152,0.157,0.267,...,536,1115,0.176,0.085,1.40,0.983,5,-27.7,0.469136,-15
1,2006,ATL,6284,222,849,818,0.084,0.186,0.184,0.270,...,572,1049,0.165,0.090,1.46,0.984,-13,10.0,0.487654,44
2,2006,BAL,6240,164,768,727,0.076,0.141,0.146,0.277,...,613,1016,0.161,0.097,1.54,0.983,-13,4.8,0.432099,-131
3,2006,BOS,6435,192,820,777,0.104,0.164,0.166,0.269,...,509,1070,0.170,0.081,1.44,0.989,-57,-21.9,0.530864,-5
4,2006,CHC,6147,166,716,677,0.064,0.151,0.154,0.268,...,687,1250,0.196,0.108,1.45,0.982,-16,31.2,0.407407,-118
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
415,2019,STL,6167,210,764,714,0.091,0.230,0.170,0.245,...,545,1399,0.231,0.090,1.27,0.989,91,32.8,0.561728,102
416,2019,TBR,6285,217,769,730,0.086,0.238,0.178,0.254,...,453,1621,0.266,0.074,1.17,0.985,53,-2.1,0.592593,113
417,2019,TEX,6204,223,810,765,0.086,0.254,0.183,0.248,...,583,1379,0.217,0.092,1.46,0.982,-52,-11.1,0.481481,-68
418,2019,TOR,6091,247,726,697,0.084,0.249,0.192,0.236,...,604,1332,0.211,0.096,1.43,0.984,0,-24.9,0.413580,-102


### Data Description

**What are the observations (rows) and the attributes (columns)?**

Each row represents one season for one Major League Baseball team. The seasons range from 2006 to 2019. Each of the 30 MLB teams has data for each of these years, making a total of 420 observations.

Each column represents a particular baseball statistic, averaged or summed over all players on a team. For example, the `hit_hr` column represents the total number of homeruns hit by a team in a season. The columns include information on hitting, pitching, and fielding as well as a couple measurements of overall team performance.

**Why was this dataset created?**

We created this dataset to aggregate various team statistics for all Major League Baseball teams from 2006 to 2019. Our dataset contains data not only about one aspect of baseball (i.e., pitching, fielding, or hitting), but combines information about all of these aspects into one dataset. We chose to take data starting in 2006 because it is considered to be the first season in the start of the "Post-Steroid Era" according to Dr. Michael Woltring et al. in [_The Sport Journal_](https://thesportjournal.org/article/examining-perceptions-of-baseballs-eras/). Although there is data from the 2020 season, that season was shortened and there were some rule changes due to the COVID-19 pandemic, so we decided to end our dataset with the 2019 season. 

**Who funded the creation of the dataset?**

The raw data which makes up our dataset comes from [fangraphs.com](fangraphs.com), a baseball statistics and analytics website. FanGraphs is funded by individual membership subscriptions, and collects and maintains their data for fans. Their website provides ways for users to filter data by team, year, and other aspects in order for users to easily compare data of interest. FanGraphs obtains their data from various sources. According to their website, "All major league baseball data including pitch type, velocity, batted ball location, and play-by-play data provided by Sports Info Solutions;" and "Major League and Minor League Baseball data provided by Major League Baseball."

**What processes might have influenced what data was observed and recorded and what was not?**

In general, Major League Baseball, Sports Info Solutions, and by proxy FanGraphs are very thorough in their data collection as they include all games for all teams in their data. Due to the nature of the availability of data in baseball in the twenty-first century, the data observed should be complete and unbiased. This means that the information that is missing is largely in data we have chosen to omit or were unable to include or which are unmeasurable. For example, we did not include any statistics that require technology to gather (for example statcast data like exit velocity). This will limit what data we can look at, however this data is not available across the entire era that we are looking at and so we decided we wanted to focus on statistics that are collected (or calculated) in the more traditional way. The biggest influence on the data which we haven't measured is park effects for certain teams. For example, the Colorado Rockies play half their games at Coors Field, which is at altitude, every season. Therefore, they might have more offense and their pitchers might allow more runs just because of where they are. However, we would expect to still be able to see whether their pitching or hitting was more important using stats like run differential or win percentage. Nevertheless, parks with these extreme effects could definitely be outliers and our data does not have information account for this.

**What preprocessing was done, and how did the data come to be in the form that you are using?**

The preprocessing that was done is described in more detail in our Data Cleaning section. However, here is a summary. We downloaded 8 `.csv` files from [fangraphs.com](fangraphs.com): 3 for hitting, 3 for pitching, and 2 for fielding. Each of these `.csv` files contained 420 observations, one for each Major League Baseball team across the years 2006 - 2019. We dropped overlapping columns and those not needed for our data analysis, and renamed the remaining columns to make them more indicative of their contents and to eliminate ambiguity. Some of the team naming was inconsistent, so we also renaming some of the observations. We also changed some of the columns to numeric values from strings. Eventually, we combined information from all 8 dataframes together. 

**If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?**

The data does originate from people. However, the data in our data set is aggregated across entire Major League Baseball teams. Major League Baseball has a lot of data in the public domain and the baseball players who played in the games that generated this data were aware of the data collection. In general, players know that there is no limitation on what baseball data will be used for and are aware that it is generally intended for baseball research to learn about trends in the game and how players and teams compare to each other.

**Where can your raw source data be found, if applicable? Provide a link to the raw data (hosted in a Cornell Google Drive or Cornell Box).**

Our raw source data can be found in this [Cornell Google Drive folder](https://drive.google.com/drive/folders/1uDGY4ISnM3rEx5NPR95Btq6YsGBkI9nJ?usp=sharing). Our cleaned data is named `baseball_data.csv`, and the raw source data form which we obtained the cleaned data is the 8 other `.csv` files.

### Data Limitations
- One limitation is an inability to control for park factors within this data. For example, the Colorado Rockies play in Denver which is at altitude and thus ball movement is affected. This means, anecdotally, hitters tend to display more power in Colorado and pitchers tend to struggle more. Data in other research definitely supports this, but it will be interesting to see how that shows up in our analysis. Depending on what statistics we use, it is possible that these stats will balance out within the team data, but looking at isolated hitting and pitching stats we would see more extremes since the Rockies play half their games in their own ballpark. Less extreme park effects may also come into play since some parks are termed "pitcher" parks and others are referred to as "hitter" parks (["Ranking MLB's Most Hitter-Friendly Ballparks, by the Numbers"](https://bleacherreport.com/articles/2022901-ranking-mlbs-most-hitter-friendly-ballparks-by-the-numbers)). This actually could be something to look deeper into our data for just to see how real these terms are when applied to teams like the Milwaukee Brewers. But a team like the Brewers who play in Miller Park, which is considered a good hitter park, might want to focus more on developing a strong offense where as a team in a pitcher park might want to develop a strong rotation. Or perhaps since both sides are affected by these it wouldn't matter. This might mean that some teams should take our conclusions with a grain of salt if perhaps their home ballpark is better suited for hitters/pitchers. Regardless, it seems that park factors may limit our ability to distinguish some trends across teams.
- We also don't have any data about strength of schedule. This means that we won't be able to account for offense being suppressed because a team may have had to play against a particularly strong pitching rotation more often than a weak pitching rotation. To give an example from this year the San Diego Padres will have to play against the Los Angeles Dodgers, who have one of the best starting rotations in MLB right now, a lot more than they will play the Baltimore Orioles, who really don't have that stellar of a starting rotation at the moment. This also would apply to pitchers facing particularly formidable line ups. That being said, we are looking at a lot of teams and MLB does try to create even schedules for teams, so the hope would be that this would balance out across our data. Nevertheless that is an assumption we are making.
- A third limitation ties into the limitation above about strength of schedule but it is more just the nature of how MLB schedules are built in terms of the teams that you play (strength of those teams aside). Teams play a plurality of their games within their own division and a majority of games (including that plurality) within their own league ([wikipedia.org/wiki/Major_League_Baseball_schedule](https://en.wikipedia.org/wiki/Major_League_Baseball_schedule)). MLB is divided into two leagues, the American League (AL) and the National League (NL). The leagues are essentially the same, but in the AL they have a designated hitter (DH) who bats for the pitcher whereas in the NL the pitcher still bats (for now - this will probably change in the coming seasons). This means that on average we would expect AL teams to probably net more offense, but we should still be able to get an idea of relative strength of offense/defense within a team. Also although this effect is noticeable, it shouldn't be so big that we can't say anything about the importance of offense vs. pitching vs. fielding for an MLB team. When an AL team plays an NL team, they obviously abide by the same rules (i.e., they both have a DH or neither does) depending on who the home team is. That means that there isn't any disparity within a game. That being said, the above still holds that we may see AL teams tend to have more offense and NL teams tend to have more dominant pitching. Still this effect shouldn't be overpowering and we should still be able to tell whether offense or defense is more valuable. Also within teams own trends, this won't really matter with the exception of the Houston Astros who did switch to the AL in 2013 ([wikipedia.org/wiki/Houston_Astros](https://en.wikipedia.org/wiki/Houston_Astros)).

### Exploratory Data Analysis