# Rubric

What is expected:

- Work and Process Shown in a Jupyter Notebook
- Last Markdown Cell as a Formal Writeup
    - Expectation of 1-2 page paper, 400-500 words

Graded on basis of the quality of:

- Statistical Analysis
- Paper
- Presentation

Points Breakdown

- **10**
    - Clear Layout of Work
    - Inclusion of the 1-2 page writeup
    - Formatting with Brief Description of What Each Cell is Doing (commenting)
        - Markdown Cell above each Coding Cell
        - Commnenting within Each Cell
- **25** (Writeup - Each of the Following will Receive a Grade of 0-5)
    - Context and Purpose of Writing
    - Content Development
    - Sources and Evidence
    - Explanation of Statistical Analyses
    - Syntax and Mechanics
- **65**
    - (0-10): Definition of Problem/Question of Interest
    - (0-10): Proposal of Solution(s) to Answer the Question
    - (0-25): Implementing Solution(s) to Answer the Question
    - Each of the 5 *Techniques* Above will be Broken Into 0-5 Subscores
    - (0-20): Evaluation of Outcomes, Interpretation of Results, etc.

# Outline of Writeup

- Introduction/Background
    - Why are you interested in this problem?
    - What is the relevant background information for readers to understand your project? (assume readers are not experts in this application field)
    - Is there any prior research on this topic that might be helpful for the audience
    - From where did the data come from? Is this an experiment or observational study? Who collected the data? Why was the data collected?
    - What are the questions of interest that we hope to answer?
- Methods/Results (experimental design and data collection)
    - How did you obtain this data?
    - Describe the exploratory data analysis methods. What needed to be done to the dataset to make it amenable to analysis?
    - What analyses are most appropriate to answer the questions of interest?
    - Describe the analyses used. Check assumptions!
    - Present relevant graphics and interpret results.
    - Explicitly connect your technical (i.e. statistical, mathematical) resutls to the research questions.
- Conclusion
    - What are the conclusions? What was learned?
    - How could we extend this research? What future research ideas comes to mind based on the results and experience with this analysis?

# The Data

- Team
- League
- Year
- Runs Scored (RS)
- Runs Allowed (RA)
- Wins (W)
- On-Base Percentage (OBP)
- Slugging Percentage (SLG): total bases divided by at-bats
    - total bases: sum of the number of singles (1), doubles (2), triples (3), and home runs (4)
- Batting Average (BA)
- Playoffs (binary)
- RankSeason
- RankPlayoffs
- Games Played (G)
- Opponent On-Base Percentage (OOBP)
- Opponent Slugging Percentage (OSLG)

# Background and Research

Baseball, also known as "America's Pastime", is an extremely popular sport in which team and player statistics have been tracked and studied into considerable depth. When it comes to baseball statistics, Bill James is generally considered one of the pioneers of this field, and even coined the term *Sabermetrics* to describe the empirical analysis of the sport [4](https://en.wikipedia.org/wiki/Sabermetrics). The love for the game and the subsequential statistical phenomenons observed have even inspired movies such as Moneyball [5](https://en.wikipedia.org/wiki/Moneyball_(film)), starring Brad Pitt, Jonah Hill, and Philip Seymour Hoffman.

In this paper, we'll attempt to analyze and model important baseball statistics. Specifically, we'll use **generalized linear models** to explore what metrics help predict whether a team made the playoffs or even won the World Series, **multiple linear regression** to explore relationships between commonly measured team statistics, and **simple linear regression** to confirm a widely used result direclty from the derivations of *Sabermetrics* known as the **Pythagorean Theorem of Baseball** [3](https://www.baseball-reference.com/bullpen/Pythagorean_Theorem_of_Baseball).

The dataset we'll be using is found on Kaggle, and provides 50 years of commonly tracked team statistics [6](https://www.kaggle.com/datasets/wduckett/moneyball-mlb-stats-19622012).

Playoffs and winning the series are great achievements, but often a team's achievements are also frequently measured in:

- **Run Differential** (most telling): subtracting the number of runs allowed from the number of runs scored [1](https://www.samford.edu/sports-analytics/fans/2022/MLB-Winning-Percentage-Breakdown-Which-Statistics-Help-Teams-Win-More-Games)
    - positve: a strong team which outscores its opponents
    - negative: a team that is struggling to keep up with it's opponents
- **Winning Percentage**: total wins divided by total games [1](https://www.samford.edu/sports-analytics/fans/2022/MLB-Winning-Percentage-Breakdown-Which-Statistics-Help-Teams-Win-More-Games)
- **On-Base Percentage**: how frequently a batter reaches base per plate appearence [2](https://sarahesult.medium.com/common-mlb-statistics-which-stats-determine-a-teams-win-percentage-a6e0a83aa07c)
- **Slugging Percentage**: total number of bases a team accumulates per at-bat [2](https://sarahesult.medium.com/common-mlb-statistics-which-stats-determine-a-teams-win-percentage-a6e0a83aa07c)
    - total bases: sum of the number of singles (1), doubles (2), triples (3), and home runs (4)

As we were exploring the data, we came across some limitations and issues which slightly altered the trajectory of the initial research questions we were planning to pursue.

The metrics Opponent On-Base Percentage (`OOBP`) and Opponent Slugging Percentage (`OSLG`) weren't captured before the 1999 season.

This gave us two ways to proceed:

- drop the rows and use 1999-2012 seasons
- drop the columns and use all seasons

Baseball tactics and training has changed significantly through the years, especially over a 50 year stretch the dataset offers. So, while we may have more observations to train models using all the seasons available, we may obtain a more accurate model by focusing on the current years.

This gave us some potential research questions.

1. Are more closely related seasons (in time) better indicators of each other?
2. Which features are pertinent for a model? Do we even need `OOBP` and `OSLG`?

There are some slight issues preventing us from using other features or predicting certain events.

1. We don't have data from points throughout the season, so we don't have the capability to predict where a team might finish given their status during the season.
2. Not immediately due to volatiltiy of the players or coaches from one season to the next, but to historical changes within franchises (creation, dissolution, name changes, city changes, etc.), it can be difficult to track a team from one season to the next.

> Due to this, we ultimately ended up dropping the `Team` and `Year` columns. Additionally, we dropped the `League` column to allow for more flexibility in our models as there are an equal number of teams which make the playoffs from each league each season.

An exciting metric discovery we made was finding **The Pythagorean Theorem of Baseball**, which incidentally could be explored through statistics available in our dataset.

The theorem is simply stated as the following:

Let

$R_s$: Runs Scored

$R_a$: Runs Allowed

$W_p$: Winning Percentage

Then, there were two formulas presented:

The more *common*:

$W_p = \frac{R_s^2}{R_s^2 + R_a^2}$

The more *accurate*:

$W_p = \frac{R_s^{1.81}}{R_s^{1.81} + R_a^{1.81}}$

With there being a *common* and an *accurate* method, it was only natural for us to attempt to could confirm this result with our data.

# Research Questions

With everything presented, our overarching goals are:

1. Create a model to predict whether or not a team made the playoffs.
2. Create a model to predict whether or not a team won the series.
3. Create a model to predict other variables previously as features, seen as important metrics:
    - Winning Percentage
    - On-Base Percentage
    - Slugging Percentage
4. Create a model which predicts the best exponents in the Pythagorean Theorem of Baseball.
5. Explore the break between seasons which tracked `OOBP` and `OSLG`.

# Long Conclusion

## 1. Introduction and Background


Baseball, also known as "America's Pastime", is an extremely popular sport in which team and player statistics have been tracked and studied in considerable depth. When it comes to baseball statistics, Bill James is generally considered one of the pioneers of this field, and even coined the term *Sabermetrics* to describe the empirical analysis of the sport [4](https://en.wikipedia.org/wiki/Sabermetrics). The love for the game and the subsequential statistical phenomenon observed have even inspired movies such as Moneyball [5](https://en.wikipedia.org/wiki/Moneyball_(film)), starring Brad Pitt, Jonah Hill, and Philip Seymour Hoffman.

In this paper, we'll attempt to analyze and model important baseball statistics. Specifically, we'll use **generalized linear models** to explore what metrics help predict whether a team made the playoffs or even won the World Series, **multiple linear regression** to explore relationships between commonly measured team statistics, and **simple linear regression** to confirm a widely accepted result directly from the derivations of *Sabermetrics* known as the **Pythagorean Theorem of Baseball** [3](https://www.baseball-reference.com/bullpen/Pythagorean_Theorem_of_Baseball).

The dataset we'll be using is found on Kaggle, and provides 50 years of commonly tracked team statistics [6](https://www.kaggle.com/datasets/wduckett/moneyball-mlb-stats-19622012).

Playoffs and winning the series are great achievements, but often a team's achievements are also frequently measured in:

- **Winning Percentage**: total wins divided by total games [1](https://www.samford.edu/sports-analytics/fans/2022/MLB-Winning-Percentage-Breakdown-Which-Statistics-Help-Teams-Win-More-Games)
- **On-Base Percentage**: how frequently batters reach a base per plate appearance [2](https://sarahesult.medium.com/common-mlb-statistics-which-stats-determine-a-teams-win-percentage-a6e0a83aa07c)
- **Slugging Percentage**: total number of bases a team accumulates per at-bat [2](https://sarahesult.medium.com/common-mlb-statistics-which-stats-determine-a-teams-win-percentage-a6e0a83aa07c)
    - total bases: sum of the number of singles (1), doubles (2), triples (3), and home runs (4)
    
Baseball tactics and training have changed significantly through the years, especially over a 50 year stretch the dataset offers. So, while we may have more observations to train models using all the seasons available, we may obtain a more accurate model by focusing on the current years.

There are some slight issues preventing us from using other features or predicting certain events.

1. We don't have data from points throughout the season, so we don't have the capability to predict where a team might finish given their status during the season.
2. Not immediately due to volatility of the players or coaches from one season to the next, but to historical changes within franchises (creation, dissolution, name changes, city changes, etc.), it can be difficult to track a team from one season to the next.

An exciting metric discovery we made was finding **The Pythagorean Theorem of Baseball**, which incidentally could be explored through statistics available in our dataset.

The theorem is simply stated as the following:

Let

$R_s$: Runs Scored

$R_a$: Runs Allowed

$W_p$: Winning Percentage

Then, there were two formulas presented:

The more *common*:

$W_p = \frac{R_s^2}{R_s^2 + R_a^2}$

The more *accurate*:

$W_p = \frac{R_s^{1.81}}{R_s^{1.81} + R_a^{1.81}}$

With there being a *common* and an *accurate* method, it is only natural for us to attempt confirm this result and find the *best* method with our data.

With everything presented, our overarching **research goals** are:

1. Create a model to predict whether or not a team made the playoffs.
2. Create a model to predict whether or not a team won the series.
3. Create a model to predict important metrics, previously used as features:
    - Winning Percentage
    - On-Base Percentage
    - Slugging Percentage
4. Create a model which predicts the best exponents in the Pythagorean Theorem of Baseball.
5. Explore the break between all seasons and the more *recent* seasons which tracked `OOBP` and `OSLG`.


## 2. Methods and Results


### Data Preparation


After the exploring the data, the first main concern was the null values.

The first set of null values were associated with `RankSeason` and `RankPlayoffs`. The values came if a team failed to make the playoffs. We decided to turn both of these columns in booleans, either a team made the playoffs or not and either a team won the World Series or not.

The next set of null values came from `OOBP` and `OSLG`. These statistics simply weren't tracked prior to the 1999 season. To proceed we had the option of dropping the columns and using the entirety of the data or dropping the rows associated with seasons prior to 1999 and using a reduced dataset with more features. Ultimately, we went with both options as this offered a chance to test if `OOBP` and `OSLG` had a significant effect on different response variables while simultaneously allowing us to explore differences between models created with historic data (colloquially coined *full* in the analysis) and recent data (colloquially coined *recent* in the analysis).

Our last step in making the data amenable was to combine wins (`W`) with games (`G`) into winning percentage (`WP`), and runs scored with runs allowed to create run differential (`RD`). The individual columns were dropped after this to prevent collinearity.


### Data Modeling


We used logistic regression, a form of generalized linear modeling, to create models to predict playoffs and championship across historic and recent data. We created visuals for each feature against either playoffs or championship to evaluate if there was any pattern, or strong break, between the value of the feature and the outcome of either playoffs or championship. To select the best model, we used AIC, MSPE, chi-squared tests, and partial f-tests. To evaluate the best model, we used contingency tables (and accuracy) and predicted vs. observed plots. Each model had AIC and MSPE created through backwards selection using p-values from individual t-tests.

To examine predictors significant for the response variables `WP`, `OBP`, and `SLG`, we used multiple linear regression. We used forward selection, deciding the best fit for each size model with SSE for each of the response variables across historic and recent datasets. We created visuals for each feature against each of the response variables to visualize any strong linear correlations, and then used a combination of AIC, BIC, MSPE, $R^2_a$, partial f-tests, and full f-tests for model selection. For each best model, we used diagnostic plots to confirm the assumptions required for linear regression. All assumptions available to confirm by those plots were confirmed for all models. The last assumption we tested for was collinearity by using VIF. Although we used a litany of methods for selection, two of our models showed sign of collinearity.

In an attempt to confirm the **Pythagorean Theorem of Baseball**, we transformed the theorem into a form we could use simple linear regression with, also applying some log-transformations with our data prior to building the model. We confirmed a factor for the formula that was in our expected range. The factor also had an acceptable p-value in the t-test showing significance. Using diagnostic plots, we confirmed the model itself met the assumptions for linear regression. 

Aside from the **Pythagorean Theorem of Baseball**, we tested the historic and recent datasets along with our models. For the generalized linear models, the model predicting champions was very similar between the subsets. Playoffs had different models, but the best model using the recent dataset did not consist of the features only available in that dataset. For the multiple linear regression, the models differed significantly across the datasets and some of the models used features only available in the recent dataset. Overall, there was evidence suggesting that either timeframe or new features could affect the model results.


## 3. Conclusions


We performed an in-depth analysis testing multiple predictors across different parameters, and created decent models confirmed both visually and statistically. Although we didn't have the amount of data to perform analyses such as those referenced in Moneyball, even this amount of data can help us glimpse the coveted patterns Sabermetrics was founded on. We could have taken this research further by examining more subsets of years to gain a better understanding of how the data and the game was changing over time. Perhaps modeling by each decade would reveal more information. Furthermore, a more useful analysis would be attempting to model future performance based on a team's past performance. Or, if the data was available, attempt to predict how a team finishes a season given data at points through the season.

# Revised Conclusion

## 1. Introduction and Background

In this paper, we take our own swing at baseball analytics, exploring metrics that predict playoff appearances, World Series victories, and overall team performance [1](https://www.samford.edu/sports-analytics/fans/2022/MLB-Winning-Percentage-Breakdown-Which-Statistics-Help-Teams-Win-More-Games) [2](https://sarahesult.medium.com/common-mlb-statistics-which-stats-determine-a-teams-win-percentage-a6e0a83aa07c).

Bill James, a pioneer of baseball analytics, coined the term “Sabermetrics” to describe this overall field [4](https://en.wikipedia.org/wiki/Sabermetrics). One result we'll try to confirm within this paper is the **Pythagorean Theorem of Baseball** [3](https://www.baseball-reference.com/bullpen/Pythagorean_Theorem_of_Baseball).

Our **research goals** are:

1. Predict whether or not a team made the playoffs.
2. Predict whether or not a team won the series.
3. Predict performance metrics:
    - Winning Percentage
    - On-Base Percentage
    - Slugging Percentage
4. Confirm the Pythagorean Theorem of Baseball.
5. Explore the break between seasons tracking `OOBP` and `OSLG`.


## 2. Methods and Results

### Data Preparation

Our dataset spans 50 years of team statistics [6](https://www.kaggle.com/datasets/wduckett/moneyball-mlb-stats-19622012).

**Null Values**

- `RankSeason`, `RankPlayoffs`: change to booleans representing playoff appearances and World Series wins.
- `OOBP`, `OSLG`: weren't tracked prior to the 1999 season; use to perform comparative analysis.

**Data Transformation**

- Combine wins (`W`) and games (`G`) to compute winning percentage.
- Calculate run differential (`RD`) with runs scored (`RS`) and runs allowed (`RA`).
- Drop individual columns to prevent collinearity.

### Data Modeling

**Generalized Linear Models**: predict playoff appearances and World Series wins.

We created visuals to reveal any binary patterns. To select the best model, we used AIC, MSPE, chi-squared tests, and partial f-tests. Each model had AIC and MSPE created through backwards selection using p-values from individual t-tests. To evaluate the best model, we used contingency tables (and accuracy) and predicted vs. observed plots. We tested for collinearity through VIF.

**Multiple Linear Regression**: explore relationships between team stats.

We used forward selection, deciding the best fit for each size model with SSE. We created visuals to reveal any linear correlations, and then used a combination of AIC, BIC, MSPE, $R^2_a$, partial f-tests, and full f-tests for model selection. For each best model, we used diagnostic plots to confirm the assumptions required for linear regression. All assumptions available to confirm by those plots were confirmed for all models. We tested for collinearity through VIF. Two of our *best* models showed signs of collinearity.

**Simple Linear Regression**: validate the Pythagorean Theorem of Baseball.

We transformed the equation into a log-format, and then applied log-transformations with our data prior to building the model. The prediction was acceptable to confirm the theorem. Using diagnostic plots, we confirmed the model itself met the assumptions for linear regression.

**Historic vs. Recent**

There was evidence suggesting that either timeframe or new features could affect the model results.


## 3. Conclusion

In summary, our analysis sheds light on patterns that Sabermetrics was built upon, even with comparatively limited data. The historic vs. recent analysis suggests further subsetting the years could provide information on how the game has changed over the years. Furthermore, an attempt to predict how a team finishes the season or performs the following season would ultimately be more useful.