# _Title of project_
#### by Anna Asch (aca89) and Anna Clemson (aac64)

## Introduction

1. In the post-steroid era, were the most successful teams built with elite hitting, pitching, or fielding? Based on data that represents quality, is it more valuable to score runs or prevent runs? Is a run scored really the same as a run prevented? Perhaps to go deeper into this, we can bring in World Series/playoff success because anecdotally people often say that pitching wins in the playoffs, so does that ring true?
2. Although all of this data is from the same "era," there is no doubt that the game has changed within this era. Does data support the idea that the make-up of the most successful teams has changed in the last 15 years or so? Offense seems to be trending up in terms of power, but so do strikeouts, so what does this mean for teams trying to build for success? (By make-up we mean, for example, has hitting become more important?)

Although there is existing research on these topics, most of it is outdated or analyzes the run totals and other basic statistics like batting average, which are widely regarded as not being great measures of a player's actual quality. We aim to explore if beyond just runs, quality offense, quality pitching, or quality fielding is more valuable for a team's success (i.e., OPS for hitting or FIP for pitching).

When we ran our exploratory analysis it reinforced what we are trying to research. Looking at the data reinforces our thoughts about the home run and strikeout trends noted in our second question and so it definitely seems like this might be worth digging deeper into to see if there really has been a change. Although we are keeping fielding in the mix for now, the exploratory analysis seems to indicate that it is going to be challenging to draw a strong connection between success and fielding (relative to with hitting and pitching), but we don't want to rule it out until we dig a bit deeper. There is a positive correlation and anecdotally it seems this challenge may be derived from the fact that it is less common to come across truly horrific defenders - pretty much everyone in MLB can make a routine play from their position on a regular basis. One thing that reinforces this is the average fielding percentage is above .980 (this is why we are bringing in DRS and UZR where there is a bit more differentiation).

## Data Description

**What are the observations (rows) and the attributes (columns)?**

Each row represents one season for one Major League Baseball team. The seasons range from 2006 to 2019. Each of the 30 MLB teams has data for each of these years, making a total of 420 observations.

Each column represents a particular baseball statistic, averaged or summed over all players on a team, depending on the statistic. Here is a chart which describes the meaning of each column in our dataset:

| Variable Name | Description |
| --- | --- |
| season | Year |
| team | Name of the team |
| pa | Number of plate appearances |
| hit_hr | Number of home runs hit |
| runs_scored | Number of runs scored |
| rbi | Number of runs batted in |
| hit_bb_rate | Walk rate for hitters |
| hit_k_rate | Strikeout rate for hitters |
| iso | Isolated Power: represents batters' raw power based on extrabase hits |
| bat_avg | Batting average |
| obp | On-base percentage |
| slg | Slugging percentage: Average number of bases earned per at bat |
| woba | Weighted on-base average: a version of obp that takes into account how a batter got on base (values power power more) |
| wrc_plus | Weighted runs created plus: takes runs created (estimated offensive contribution) and adjusts for external factors |
| hit_fwar | FanGraphs Wins Above Replacement for hitters: measures a hitter's overall contribution to a team's wins (this includes defense, but excludes pitchers) |
| hit_hits | Number of hits (by batters) |
| ops | On-base percentage plus slugging |
| wins | Team's total wins |
| losses | Team's total losses |
| saves | Number of saves (pitching) |
| games | Number of games played |
| ip | Number of innings pitched |
| era | Earned run average |
| fip | Fielding independent pitching: similar to ERA but uses only the outcomes a pitcher has complete control over (HR, K, BB, HBP)|
| pitch_fwar | FanGraphs Wins Above Replacement for pitchers: measures hitter's overall contribution to a team's wins |
| pitch_hits | Hits allowed |
| runs_allowed | Runs allowed |
| er | Number of earned runs allowed |
| pitch_hr | Number of home runs allowed |
| pitch_bb | Number of walks allowed |
| pitch_so | Number of strikeouts by pitchers |
| pitch_k_rate | Strikeout rate for pitchers |
| pitch_bb_rate | Walk rate for pitchers |
| whip | Walks plus hits per innings pitched |
| fp | Fielding percentage |
| drs | Defensive runs saved |
| uzr | Ultimate zone rating: quantifies defensive performance by measuring runs saved, but using a different formula than DRS|
| win_pct | Win percentage (reported as a decimal) |
| run_diff | Run differential: runs scored - runs allowed |

Overall, the columns include information on hitting, pitching, and fielding, in addition to a few measures of overall team performance.

Source for some definitions: [MLB website](https://www.mlb.com/glossary/)

**Why was this dataset created?**

We created this dataset to aggregate various statistics for all Major League Baseball teams from 2006 to 2019. We chose to take data starting in 2006 because it is considered to be the first season in the start of the current "Post-Steroid Era" according to Dr. Michael Woltring et al. in [_The Sport Journal_](https://thesportjournal.org/article/examining-perceptions-of-baseballs-eras/). Also, although data exists from the 2020 season, teams played fewer games than usual in 2020 and there were some rule changes specifically related to the COVID-19 pandemic, as detailed in [the MLB's announcement of the 2020 season](https://www.mlb.com/news/mlb-announces-2020-regular-season). For this reason, our dataset ends with the 2019 season. 

Our dataset contains data not only about one aspect of baseball (i.e., pitching, fielding, or hitting), but combines information about all of these aspects into one dataset. This is useful because the intended use of this dataset is to analyze the relative importance of pitching, fielding, and hitting (three different aspects in which a team can invest) to a team's success. The goal is for this data to help answer our research question about how teams should build for success based on the way the game has been played thus far in the "Post-Steroid Era." That being said, we also may want to use the data to look at any trends within this era to see if perhaps teams are changing how they build for success due to macro trends within the game (for example the growth of the home run and changes to pitcher use which is generating more strikeouts).

**Who funded the creation of the dataset?**

The raw data which makes up our dataset comes from [fangraphs.com](fangraphs.com), a baseball statistics and analytics website. FanGraphs is funded by individual membership subscriptions, and collects and maintains their data for baseball fans. Their website provides an easy-to-use interface for users to filter data by team, year, and other aspects, after which users can easily compare statistics of interest. FanGraphs obtains their data from various sources. According to their website, "All major league baseball data including pitch type, velocity, batted ball location, and play-by-play data [is] provided by Sports Info Solutions;" and "Major League and Minor League Baseball data [is] provided by Major League Baseball."

**What processes might have influenced what data was observed and recorded and what was not?**

In general, Major League Baseball, Sports Info Solutions, and by proxy FanGraphs are very thorough in their data collection as they include all games for all teams in their data. Due to the nature of the availability of data in baseball in the twenty-first century, the data observed should be complete and unbiased. This means that the information that is missing is largely data we have chosen to omit, we were unable to include, or which are unmeasurable. For example, we did not include any statistics that require technology to gather, like [Statcast](https://www.mlb.com/glossary/statcast) data, which includes measurements such as exit velocity (how fast a ball is hit by a hitter). While this limits the information available in our dataset, that kind of data was not collected across the entire era which we are examining, so our dataset instead focuses on statistics that are collected (or calculated) in the more traditional way. 

The biggest influence on the data which is unaccounted for is park effects for certain teams. For example, the Colorado Rockies play half their games every season at Coors Field, which is at altitude (almost a mile above sea level). Therefore, they might have more offense and their pitchers might allow more runs because of the location of their field. However, we expect to still be able to detect whether their pitching or hitting played a more important role using statistics from our dataset like run differential or win percentage. Nevertheless, data generated by games played at parks with these extreme effects may still be outliers, and our dataset, for the most part, does not have information to account for this (the one exception is wRC+, which does try to take park effects into account).

**What preprocessing was done, and how did the data come to be in the form that you are using?**

The preprocessing that was done is described in more detail in our Data Cleaning section. However, here is a summary. We downloaded 8 `.csv` files from [fangraphs.com](fangraphs.com): 3 for hitting, 3 for pitching, and 2 for fielding. Each of these `.csv` files contained 420 observations, one for each Major League Baseball team across the years 2006 - 2019. We dropped overlapping columns and those not needed for our data analysis, and renamed the remaining columns to make them more indicative of their contents and to eliminate ambiguity (e.g., between runs scored and runs allowed). Some of the team naming was inconsistent, so we also renamed some of the observations to restore consistency. We also changed some of the columns to numeric values from strings. Eventually, we combined information from all 8 dataframes together to obtain our final dataset. We also added a couple of summary statistics columns (win percentage and run differential) based on data from the original `.csv` files that we downloaded. You can find details about accessing the raw source data from FanGraphs and our processed dataset below.

**If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?**

The data does originate from people. However, the observations in our dataset are aggregated across entire Major League Baseball teams, so no data is specifically tied to a person. Major League Baseball has a lot of data in the public domain, and the baseball players who played in the games that generated this data were aware of the data collection. In general, players know that there is no limitation on what baseball data will be used for. They are aware that it is generally intended for baseball researchers to learn about trends in the game and compare players and teams to each other (like we're doing).

**Where can your raw source data be found, if applicable? Provide a link to the raw data (hosted in a Cornell Google Drive or Cornell Box).**

Our raw source data can be found in this [Cornell Google Drive folder](https://drive.google.com/drive/folders/1uDGY4ISnM3rEx5NPR95Btq6YsGBkI9nJ?usp=sharing). Our cleaned data is named `baseball_data.csv`, and the raw source data form which we obtained the cleaned data is the 8 other `.csv` files.

## Preregistrated Analyses

In this report, we will .... 

## Data Analysis

To start, we import the necessary libraries, such as pandas and numpy, so that we can perform our analyses.

In [41]:
## Load libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [54]:
bb = pd.read_csv('baseball_data.csv', index_col=0)

In [55]:
bb2006 = bb[bb.season == 2006]
X_train, X_test, y_train, y_test = train_test_split(bb2006[['wrc_plus']], bb2006['win_pct'], test_size=0.2)

lr2006 = LinearRegression().fit(bb2006[['wrc_plus']], bb2006['win_pct'])

In [58]:
bb_ind_var = pd.get_dummies(bb, columns=['season'], drop_first=True)
bb_ind_var.head()

Unnamed: 0,team,pa,hit_hr,runs_scored,rbi,hit_bb_rate,hit_k_rate,iso,bat_avg,obp,...,season_2010,season_2011,season_2012,season_2013,season_2014,season_2015,season_2016,season_2017,season_2018,season_2019
0,ARI,6330,160,773,743,0.08,0.152,0.157,0.267,0.331,...,0,0,0,0,0,0,0,0,0,0
1,ATL,6284,222,849,818,0.084,0.186,0.184,0.27,0.337,...,0,0,0,0,0,0,0,0,0,0
2,BAL,6240,164,768,727,0.076,0.141,0.146,0.277,0.339,...,0,0,0,0,0,0,0,0,0,0
3,BOS,6435,192,820,777,0.104,0.164,0.166,0.269,0.351,...,0,0,0,0,0,0,0,0,0,0
4,CHC,6147,166,716,677,0.064,0.151,0.154,0.268,0.319,...,0,0,0,0,0,0,0,0,0,0


In [87]:
ohe = OneHotEncoder(drop='first', categories='auto', sparse=False)
oh_data = ohe.fit_transform(bb[['season']])

# get column names and convert encoded data to df
column_names = []
for i, category in enumerate(ohe.categories_):
    for colname in category[1:]:
        column_names.append(str(colname))

bb_ohe = pd.DataFrame(oh_data, columns=column_names)
bb_ohe = pd.concat([bb.reset_index(drop=True), bb_ohe], axis='columns')

bb_ohe.head()

Unnamed: 0,season,team,pa,hit_hr,runs_scored,rbi,hit_bb_rate,hit_k_rate,iso,bat_avg,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,2006,ARI,6330,160,773,743,0.08,0.152,0.157,0.267,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2006,ATL,6284,222,849,818,0.084,0.186,0.184,0.27,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2006,BAL,6240,164,768,727,0.076,0.141,0.146,0.277,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2006,BOS,6435,192,820,777,0.104,0.164,0.166,0.269,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2006,CHC,6147,166,716,677,0.064,0.151,0.154,0.268,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [89]:
reg_cols = column_names
print(reg_cols)
#reg_cols.append('wrc_plus')
lr = LinearRegression().fit(bb_ohe[reg_cols], bb_ohe['hit_hr'])

print(lr.score(bb_ohe[reg_cols], bb_ohe['hit_hr']))
print(lr.coef_)

['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']
0.3286218162207907
[-14.3        -16.93333333 -11.46666667 -25.76666667 -27.8
 -15.06666667 -24.16666667 -40.         -15.9          7.46666667
  23.96666667   6.63333333  46.33333333]


### Data Limitations

- One limitation is an inability to control for park factors within this data. For example, the Colorado Rockies play in Denver which is at altitude and thus ball movement is affected. Anecdotally, this means that hitters tend to display more power in Colorado and pitchers tend to struggle more. Data in other research supports this, but it will be interesting to see how that shows up in our analysis. Depending on what statistics we use, it is possible that these stats will balance out within the team data, but looking at isolated hitting and pitching stats we would see more extremes since the Rockies play half their games in their own ballpark. Less extreme park effects may also come into play since some parks are termed \"pitcher\" parks and others are referred to as \"hitter\" parks ([\"Ranking MLB's Most Hitter-Friendly Ballparks, by the Numbers\"](https://bleacherreport.com/articles/2022901-ranking-mlbs-most-hitter-friendly-ballparks-by-the-numbers)). We may want to analyze our data in such a way as to see how justified these terms are when applied to certain teams. A team like the Brewers who play at Miller Park (considered a good hitter park), might want to focus more on developing a strong offense whereas a team in a pitcher park might want to develop a strong rotation. Or perhaps since these effects impact both teams in a game it wouldn't matter. This might mean that some teams should take our conclusions with a grain of salt if their home ballpark is better suited for hitters/pitchers. Regardless, it seems that park factors may limit our ability to distinguish some trends across teams. However, we will use wRC+ in our analysis, which accounts for external factors like ballpark. This may help us reduce issues with this limitation.

- We also do not have any data about strength of schedule. This means that we will not be able to account for offense being suppressed because a team may have had to play against a particularly strong pitching rotation more often than a weak pitching rotation. To give an example from this year, the San Diego Padres will have to play a lot more against the Los Angeles Dodgers, who have one of the best starting rotations in MLB right now, than they will the Baltimore Orioles, who do not have that great of a starting rotation at the moment. This also would apply to pitchers facing particularly formidable line ups. That being said, we are looking at a lot of teams and MLB does try to craft fair schedules for teams, so we hope that this balances out across our data. Nevertheless, that is an assumption we are making.

- A third limitation is related the limitation above about strength of schedule, but this one stems from the nature of how MLB schedules are built in terms of the teams that one plays (strength of those teams aside). Teams play a plurality of their games within their own division and a majority of games (including that plurality) within their own league ([wikipedia.org/wiki/Major_League_Baseball_schedule](https://en.wikipedia.org/wiki/Major_League_Baseball_schedule)). MLB is divided into two leagues, the American League (AL) and the National League (NL). The leagues are essentially the same, but in the AL they have a designated hitter (DH) who bats for the pitcher whereas in the NL the pitcher still bats (for now - this will probably change in the coming seasons). This means that on average we would expect AL teams to probably net more offense, but we should still be able to get an idea of relative strength of offense and defense within a team. Also although this effect is noticeable, it should not be so big that we cannot say anything about the importance of offense vs. pitching vs. fielding for an MLB team. When an AL team plays an NL team, they obviously abide by the same rules (i.e., they both have a DH or neither does) depending on who the home team is. That means that there is never any disparity within a game. That being said, we still may see AL teams tend to have more offense and NL teams tend to have more dominant pitching. Also, within the trends of individual teams, this will not matter with the exception of the Houston Astros who switched to the AL in 2013 ([wikipedia.org/wiki/Houston_Astros](https://en.wikipedia.org/wiki/Houston_Astros)). Overall, this effect should not be overpowering and we should still be able to tell whether offense or defense is more valuable. Additionally, we will use wRC+ in our analysis, which accounts for external factors like league. This may help us reduce issues with this limitation.

- Another limitation may be our data size. We do have hundreds of data points which seems reasonable to draw conclusions from and each data point is the aggregation of around 6000 plate appearances or approximately 25 players' performances from the season (that's the size of the active roster at any given moment in MLB). Nevertheless, we are limited by the fact that MLB only has 30 teams, so within a season we can only get 30 data points and we wanted to stick with a certain era - in this case the post-steroid - era which is only so many years.

## Source Code

## Acknowledgements