<img src = '../Images/baseball-field.jpg' width = 1100/>

# A Formula for Success: Predicting How to Win in Baseball 
### Using Regression to Chart a Path to Victory


## I. Overview
Baseball is a sport of numbers and statistics. Because of this, there are all sorts of measurements of what happened in a game, as well as team and specific player performance. Over the years, and made more popular by fantasy sports and books/movies such as <i>Moneyball</i>, statistics have gotten even better at determining the value of players. These advanced statistics are commonly referred to as sabermetrics.

### The Problem

But what is the formula for a team's success? Obviously, it's winning, because a team must win in order to become championship contenders. And obviously, winning is accomplished by scoring more runs than your opponent. But if you were a general manager of a baseball franchise, you would want to go beyond that to determine a more precise equation to field a consistently winning team. Then, based on this formula, a general manager should be able to determine a player's value in terms of contributing to a team's winning chances.

### Research Questions

Based on the problem above, this project focuses on answering the following questions:
* What features most significantly impact the winning percentage of a given Major League Baseball (MLB) team?
* For those given features, to what extent do they impact a team's winning percentage?
* Based on this, who has been the most valuable player in contributing to a team's winning percentage from 2017-2020?
    
By answering these questions, a general manager can build a roster to maximize it's odds of winning. As such, a general manager could target particular players in free agency or through trades to acquire them. Furthermore, since spending in baseball is limited based on what a particular team can afford in its market, as well as a tax beyond a particular salary threshold, a general manager could use this formula to determine if a given player is an improvement over a player currently rostered at that position on the team.

### Background Work
The book and movie <i>Moneyball</i> really brought sports analytics into the mainstream. As such, most of the work that can be found on Github and through Google uses a specific dataset to replicate the model built in the book. My approach will be different in that I will be using a different dataset to pull from more recent years and that the dataset will include advanced baseball statistics known as sabermetrics.

## II. Data Acquisition
To answer the research questions above, this project requires both team statistics to determine the equation for winning and player statistics to determine who has contributed the most to winning. To acquire this data, the MLB statistic repository Fangraphs.com is used. This data is pulled specifically from the [database search engine found here](https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2020&month=0&season1=2020&ind=0). Fangraphs collects data daily and updates with the most recent baseball statistics. As the 2020 baseball season has come to a close, this particular repository of team and individual statistics was most recently updated in October and will not be updated until the 2021 season begins (hopefully).

To acquire the needed data, we need to run multiple requests, then export each of them. For the team stats, you click on the "Team Stats" panel, select multiple seasons, split them, hit submit, then export. Do this for the Batting, Pitching, and Fielding tabs. We split the seasons because a given team will vary in its performance--and its roster--from year to year, adding to our sample size. To build an adequate sample size, we will pull data for each season beginning in 2006. For the individual stats, you click on the "Player Stats" panel; because we are viewing individual players stats over the past few seasons, we will select "Multiple Seasons" from 2017 to 2020.

While the years selected for this sample seem arbitrary, they are not. Baseball is known to have time periods that stylize play, commonly known as eras. From the late 1980s into most of the 2000s, power hitting had a surprising jolt due to the prevalence of steroids and other performance-enhancing drugs. Long-standing home run records were broken multiple times during this period. The 2006 season is when MLB began to test and discipline players for performance-enhancing drugs, and therefore is a good starting point for our dataset.

This process can easily be repeated by others who wish to replicate or customize the data pulled for their own usage. The datasets for team batting, fielding, and pitching will also need to be combined. To see how to do that, please visit Notebook 1. Instead, this notebook will show you the resulting merged dataset.

### Dataset Details
Let's give an overview of each of the datasets used in this project:
First, here is the team dataset:
* <b>TeamStats_Combined</b>
    * 145 KB as CSV / 225.1 KB in Pandas
    * 450 rows
    * 64 columns
        * Column datatypes:
            * float64: 32
            * int64: 25
            * object: 7
    * 870 null values

This dataset was altered via data cleaning and feature engineering to become:
* <b>TeamStats_Combined_Cleaned</b>
    * 84.8 KB as CSV / 66.9 KB in Pandas
    * 450 rows
    * 19 columns
        * Column datatypes:
            * float64: 16
            * int64: 1
            * object: 2
    * 0 null values
    
Here is the individual player statistics dataset:
* <b>PlayerStats_Batting</b>
    * 215 KB as CSV / 392.7 KB in Pandas
    * 2185 rows
    * 23 columns
        * Column datatypes:
            * float64: 12
            * int64: 7
            * object: 4
    * 1403 null values

### Supervised Learning Details
Because the goal of this project is to determine the factors which most contribute to winning in baseball, this is a supervised learning project. The target variable is an engineered feature called win percentage (W%). The columns that became important as part of the discussion of it included home runs (HR), batting average (AVG), on-base percentage (OBP), home runs allowed per nine innings (HR/9), batting average allowed on balls in play (BABIP_y), ground ball percentage (GB%), and home runs allowed per fly ball (HR/FB). For more information on each of these statistics, I strongly encourage you to view [Fangraph's glossary of baseball statistics](https://library.fangraphs.com/getting-started/).

## III. Data Exploration
Because the focus is on determining which factors contribute to winning in baseball, almost the entire effort in data exploration was tailored to the team statistics. That will be the focus here as well.

In [16]:
#import tools
import pandas as pd

# Pull dataframe of team stats
df = pd.read_csv(r'~/Github/DATA601_FinalProject/Data/TeamStats_Combined.csv')
df2 = pd.read_csv(r'~/Github/DATA601_FinalProject/Data/TeamStats_Combined_Cleaned.csv')
df3 = pd.read_csv(r'~/Github/DATA601_FinalProject/Data/PlayerStats_Batting.csv')

def analyze(i):
    print('Here is an overview of the dataset')
    print('The shape of the dataset is:', i.shape)
    print('The total null values in the dataset is:', (i.isnull().sum().sum()))
    print('Here is the head of the dataset:')
    return i.head()

Let's take a look at the original team stats dataset.

In [17]:
analyze(df)

Here is an overview of the dataset
The shape of the dataset is: (450, 64)
The total null values in the dataset is: 870
Here is the head of the dataset:


Unnamed: 0,TeamYear,Season,Team,G_x,PA,HR,R,RBI,SB,BB%,...,HR/9,BABIP_y,LOB%,GB%,HR/FB,EV_y,ERA,FIP,xFIP,WAR_y
0,2019_Astros,2019,Astros,2309,6394,288,920,891,67,10.10%,...,1.42,0.27,76.80%,43.60%,16.70%,88.1,3.66,3.98,3.8,23.7
1,2016_Cubs,2016,Cubs,2332,6335,199,808,767,66,10.40%,...,1.01,0.255,77.50%,46.90%,13.10%,88.3,3.15,3.77,3.74,18.8
2,2011_Yankees,2011,Yankees,2301,6306,222,867,836,147,9.90%,...,0.94,0.297,75.00%,44.20%,9.90%,,3.73,3.87,3.84,17.3
3,2015_Blue Jays,2015,Blue Jays,2337,6231,232,891,852,88,9.10%,...,1.08,0.278,72.70%,43.70%,11.00%,88.1,3.81,4.09,4.14,13.7
4,2011_Red Sox,2011,Red Sox,2269,6414,203,875,842,102,9.00%,...,0.96,0.285,70.80%,42.30%,9.10%,,4.2,4.05,4.13,14.3


Now let's view the cleaned team stats dataset.

In [18]:
analyze(df2)

Here is an overview of the dataset
The shape of the dataset is: (450, 19)
The total null values in the dataset is: 0
Here is the head of the dataset:


Unnamed: 0,TeamYear,Season,Team,W%,HR,BB%,ISO,BABIP_x,AVG,OBP,SLG,wOBA,BIZ,Plays,RZR,HR/9,BABIP_y,GB%,HR/FB
0,2019_Astros,2019,Astros,0.660494,1.777778,0.101,0.221,0.296,0.274,0.352,0.495,0.355,10.864198,8.839506,0.814,1.42,0.27,0.436,0.167
1,2016_Cubs,2016,Cubs,0.635802,1.228395,0.104,0.173,0.302,0.256,0.343,0.429,0.333,12.018519,10.253086,0.853,1.01,0.255,0.469,0.131
2,2011_Yankees,2011,Yankees,0.598765,1.37037,0.099,0.181,0.292,0.263,0.343,0.444,0.345,13.635802,11.320988,0.83,0.94,0.297,0.442,0.099
3,2015_Blue Jays,2015,Blue Jays,0.574074,1.432099,0.091,0.188,0.298,0.269,0.34,0.457,0.344,13.802469,11.462963,0.831,1.08,0.278,0.437,0.11
4,2011_Red Sox,2011,Red Sox,0.555556,1.253086,0.09,0.181,0.314,0.28,0.349,0.461,0.352,13.388889,11.296296,0.844,0.96,0.285,0.423,0.091


Lastly, let's view the player stats dataset.

In [19]:
analyze(df3)

Here is an overview of the dataset
The shape of the dataset is: (2185, 23)
The total null values in the dataset is: 1403
Here is the head of the dataset:


Unnamed: 0,Name,Team,G,PA,HR,R,RBI,SB,BB%,K%,...,OBP,SLG,wOBA,wRC+,EV,BsR,Off,Def,WAR,playerid
0,Mike Trout,Angels,441,1956,134,344,301,58,18.50%,19.90%,...,0.44,0.63,0.436,181.0,90.8,17.5,213.3,-2.0,27.8,10155
1,Mookie Betts,- - -,494,2278,101,412,301,82,12.20%,13.60%,...,0.386,0.538,0.387,141.0,90.5,25.0,139.8,35.3,25.4,13611
2,Anthony Rendon,- - -,481,2080,92,315,349,14,12.40%,13.50%,...,0.399,0.55,0.397,146.0,90.3,7.0,129.4,30.1,22.6,12861
3,Jose Ramirez,Indians,496,2139,108,330,317,85,11.30%,12.40%,...,0.367,0.549,0.381,137.0,88.8,20.7,118.4,18.6,21.3,13510
4,Christian Yelich,- - -,491,2173,110,357,310,72,12.60%,21.40%,...,0.394,0.547,0.394,147.0,92.3,19.9,148.5,-14.0,20.7,11477


Since our target variable is win percentage, let's take a look at that more closely. First, let's view the distribution of win percentage:

<img src = '../Images/WPHistogram.png' width = 650/>

We see that it is a pretty normal distribution, which is good. But let's also take a look at it separated by team.

<img src = '../Images/BoxPlotWPTeam.png' width = 900/>

Some interesting trends we see here:
* Most teams have had a wide range of win percentages, indicating they had some good seasons and some bad seasons.
* The Yankees, Dodgers, and Cardinals were more consistent, as they have smaller spreads than most teams.
    * They also were consistently good teams, as they rarely had a win percentage below 0.5. The Dodgers even had an outlier season with a win percentage above 0.7, which is extraordinarily good.
* The Astros, Rays, Tigers, and Orioles have very large spreads, indicating their season win percentage has had more variability.

Let's look at the relationships between all the features from the original dataset

<img src = '../Images/Heatmap1.png' width = 900/>

That correlation heatmap looks pretty busy, but there are a few key takeaways. We see a number of strong correlations between features here, such as slugging percentage (SLG) with runs (R) and runs batted in (RBI), which makes sense. If you hit the ball well and get on base, you'll score more runs. We see a strong negative correlation between strike outs per 9 innings (K/9) with balls in zone (BIZ) and plays made (Plays). This also makes sense, as a team that strikes opposing batters out will have to field fewer balls hit by opposing batters.

In terms of correlations with win percentage (W%), we see the strongest positive correlation with offensive wins above replacement (WAR_x). This is expected, because if your players are batting farther above a given level (an average replacement player), you will likely have more wins. We see the strongest negative correlation with earned run average (ERA). This is also expected, because if you allow opposing teams to score fewer runs against you, you have a better chance of outscoring them for a win. This information will likely come in handy when we build our models, although they also may present multicollinearity problems.

For more visualization, particularly with other features, please see Notebook 1.

## IV. Data Preparation

As you can see above, our original dataset had many features. To simplify, knowing we would be employing regression models, we used recursive feature elimination (RFE). RFE essentially builds a model with every feature, then weighs each feature's importance. This allows us, through a process of elimination, to more easily narrow down the features that maximize the strength of a linear regression model while minimizing its cost.

After employing RFE to select the 15 most important features (along with keeping a few identifying features), here is an updated correlation heatmap.

<img src = '../Images/Heatmap2.png' width = 900/>

Using the information here, we began modelling.

## V. Modeling

Two models were used to help build a formula that would predict win percentages: simple linear regression and multiple linear regression. Scikitlearn and statsmodels were used to assist with this.

### Simple Linear Regression
Put simply, a simple linear regression model is summed up in the following equation, where X is our predictor variable and Y is the response or dependent variable:$$
Y = \beta_{0} + \beta_{1}X + \epsilon
$$

We see that the one feature that is most strongly correlated to win percentage is weighted on base average (wOBA). Our RFE process also includes it as a significant feature. As such, we focused our linear regression model on that. Essentially, we wanted to view how wOBA can help us predict a team's win percentage. This model used "wOBA" as the predictor/independent variable and "W%" as the response/dependent variable.

In [22]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Let's create a linear regression object
lr = LinearRegression()

# Set variables
X = df2[['wOBA']]
Y = df2['W%']

# Fit the model
lr.fit(X,Y)

# Check intercept and slope
print('The value of the intercept is:', lr.intercept_)
print('The value of the slope is:', lr.coef_)

The value of the intercept is: -0.34397881712886275
The value of the slope is: [2.63737356]


So the final estimated linear regression model is:

<div align="center"><i><b>W%</b> = -0.344 + 2.637(wOBA)</i></div>

Let's look at this visually through a regression plot, and then view the variance through a residual plot.
<img src = '../Images/SLR_RegPlot.png' width = 550/>
<img src = '../Images/SLR_ResPlot.png' width = 550/>

What we see from above is that win percentage is positively correlated to weighted on base average, which we knew from the correlation heatmap. But we also see some variance, as not all the dots are close to the line. When we view this through the residual plot, we see that there is some randomness in the spread of residuals, which is good.

### Multiple Linear Regression
We can also try to predict win percentage by using more than one variable. We will take five features from our RFE list: 'wOBA', 'OBP', 'SLG', 'BB%', and 'BABIP_y'. The equation for this multiple linear regression is:

$$
Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2}X_{2} + \cdots \beta_{p}X_{p} + \epsilon
$$

Here is the model used based on those five predictor variables.

In [24]:
# Create object for predictor variables
Z = df2[['wOBA', 'OBP', 'SLG', 'BB%', 'BABIP_y']]

# Restating response variable
Y = df2['W%']

# Fit the model
lr.fit(Z, Y)

# Check intercept and slopes
print('The value of the intercept is:', lr.intercept_)
print('The values of the coefficients are:', lr.coef_)

The value of the intercept is: 0.4791040239816229
The values of the coefficients are: [ 1.41539577  0.65152783  0.35808903  0.49618659 -2.82097402]


Our estimated multiple linear regression model is:

<div align="center"><i><b>W%</b> = 0.479 + 1.415(wOBA) + 0.652(OBP) + 0.358(SLG) + 0.496(BB%) - 2.821(BABIP_y)</i></div>


Let's look at this visually, by comparing modeled values to actual values.

<img src = '../Images/MLR1.png' width = 600/>

Overall, the fitted values are more spread out than the actual values, so there is room to improve this model, as we see a difference in the structures of both lines. 

### Model Comparison & Selection
To determine which model to use, the metrics evaluated were R-squared and mean squared error (MSE).


In [27]:
print('Simple Linear Regression (SLR) Model:')
# Fit the model
lr.fit(X, Y)
# Print R-squared
print('R^2:', lr.score(X, Y))
# Calculate and print MSE
Yhat = lr.predict(X)
MSE = mean_squared_error(Y, Yhat)
print('MSE:', MSE)

print('Multiple Linear Regression (MLR) Model:')
# Fit the model
lr.fit(Z, Y)
# Print R-squared
print('R^2:', lr.score(Z, Y))
# Calculate and print MSE
Yhat = lr.predict(Z)
MSE = mean_squared_error(Y, Yhat)
print('MSE:', MSE)

Simple Linear Regression (SLR) Model:
R^2: 0.2940004271229022
MSE: 0.0036823132736963408
Multiple Linear Regression (MLR) Model:
R^2: 0.510789534394585
MSE: 0.00255159671526258


After determining that the multiple linear regression model had higher R-squared and lower MSE values, this was the model selected for fine tuning and further development.

## VI. Model Refinement

When actually fitting the MLR model, we began by separating the 2017-2020 seasons from the preceding seasons. The preceding seasons would serve as the training group, and the 2017-2020 seasons would be the test group.

To begin, all 15 features were used in the model, then statsmodel would be used to view the R-squared values of each model. Here is the result of that first MLR model.

<img src = '../Images/MLR2.png' width = 600/>

Graphically, the distribution plot for the predicted test data values looks similar to the actual test data values; the peak of the curve of the predicted values is slightly higher, while the actual test values' curve peaks slightly sooner (at a lower win percentage). Let's look at the stats for the training group.

<img src = '../Images/OLS2.png' width = 600/>

Not bad, the model as a whole received a R-squared score of 0.540. To further refine the model, we looked at the p-values for each of the features in the training set. The one furthest above 0.05 was subsequently removed (in this case, BB%), and then the model with one fewer feature was run again.

This process was repeated for 9 iterations, until no features had a p-value above 0.05, the f-statistic was sufficiently low, and the R-squared was also higher than other model iterations.

This left us with the following result:

<img src = '../Images/MLR3.png' width = 600/>

Graphically, the distribution plot for the predicted test data values looks similar to the actual test data values, although not as close as the original MLR model. That's okay though, as we didn't want to overfit. The stats for the training group can be seen below.

<img src = '../Images/OLS3.png' width = 600/>

The overall metrics of this model were:
* R^2: 0.5813724554092377
* MSE: 0.0018948020906331916

We see that 58.1% of the variance of the data is explained by our model. Moreover, our MSE is low at 0.002. Both of these metrics indicate a model that shows effectiveness without overfitting. Let's look more closely at how the model did with the test set.

<img src = '../Images/FinalRegPlot.png' width = 600/>
<img src = '../Images/FinalResPlot.png' width = 600/>

In both of these graphs, the line is our predicted win percentage in the test set, while the points are the actual win percentage of the test set. The distance of each residual appears to be pretty random, which is a great sign. This means the model's prediction errors would seem to be pretty normally distributed. We can investigate this further with a histogram showing the frequency of the size of error between predicted and actual values.

<img src = '../Images/HistofErr.png' width = 600/>

The graph is close to being normally distributed, which is great. This means that the model's prediction errors were most often very small and were balanced between being too high and too low. So the model does not need to be further adjusted.

### MLR Model Result

Therefore, we will use this equation to determine the win percentage of a team:

<i><b>W%</b> = 0.8092 + (0.1078 * HR) + (1.0229 * AVG) + (1.4072 * OBP) - (0.4279 * HR/9) - (2.0556 * BABIP_y) - (0.8026 * GB%) + (2.4894 * HR/FB)</i>

### MLR Model Application

Once the model was developed to find an equation for win percentage, we could determine who has been the most valuable player in terms of contributing to wins. The features used in our equation include aspects of batting and aspects of pitching (the features for fielding were removed during model refinement). This makes sense. A team that wins will have its batters scoring more runs and its pitchers denying the other team's batters from doing so. 

However, determining who has been the most valuable player is actually somewhat challenging because of how baseball is played: pitchers generally do not bat (or do not bat often), and batters are very rarely pitchers. Moreover, pitchers do not play most games, limiting their impact over the course of a season. 

As such, we focused our most valuable player on who is the best at contributing to wins through batting. We will use the batting features of our win percentage equation to create a new feature which scores the batters.

In [29]:
# Pull batting dataset
Batters = pd.read_csv(r'~/Github/DATA601_FinalProject/Data/PlayerStats_Batting.csv')
# Create MVB
Batters['MVB'] = ((0.1078 * Batters['HR']) + (1.0229 * Batters['AVG']) + (1.4072 * Batters['OBP']))
# Sort players
Batters.sort_values(by=['MVB'], inplace = True, ascending = False)
Batters.head()

Unnamed: 0,Name,Team,G,PA,HR,R,RBI,SB,BB%,K%,...,SLG,wOBA,wRC+,EV,BsR,Off,Def,WAR,playerid,MVB
0,Mike Trout,Angels,441,1956,134,344,301,58,18.50%,19.90%,...,0.63,0.436,181.0,90.8,17.5,213.3,-2.0,27.8,10155,15.371238
30,Nelson Cruz,- - -,472,1971,133,275,357,2,10.50%,22.90%,...,0.565,0.389,149.0,93.4,-13.9,105.3,-48.6,12.5,2434,15.153428
37,J.D. Martinez,- - -,469,2032,131,316,366,13,10.60%,23.20%,...,0.592,0.399,148.0,91.6,-15.5,105.8,-54.9,11.9,6184,14.96123
8,Nolan Arenado,Rockies,518,2216,124,329,384,8,9.60%,15.40%,...,0.564,0.385,125.0,89.2,2.4,73.1,38.7,18.3,9777,14.195373
26,Eugenio Suarez,Reds,515,2131,124,282,327,10,11.60%,25.60%,...,0.515,0.367,125.0,88.9,-11.1,57.4,5.5,13.4,12552,14.141023


We see that our most valuable batter from 2017 through 2020 is <b>Mike Trout of the Los Angeles Angels</b>.

## VII. Conclusions, Interpretation, and Limitations

In conclusion, we developed a model that will account for 58.1% of data variance with a low mean squared error of 0.002 while maintaining that all of its included features are statistically significant. Moreover, it uses seven features, so its cost to calculate is low.

We should now return to our research questions to answer them:
* What features most significantly impact the winning percentage of a given Major League Baseball (MLB) team?
    * The features which most significantly impact winning percentage are home runs (HR), batting average (AVG), on-base percentage (OBP), home runs allowed per nine innings (HR/9), batting average allowed on balls in play (BABIP_y), ground ball percentage (GB%), and home runs allowed per fly ball (HR/FB).
* For those given features, to what extent do they impact a team's winning percentage?
    * From multiple linear regression models, the following formula was created:
    <div align="center"><i><b>W%</b> = 0.8092 + (0.1078 * HR) + (1.0229 * AVG) + (1.4072 * OBP) - (0.4279 * HR/9) - (2.0556 * BABIP_y) - (0.8026 * GB%) + (2.4894 * HR/FB)</i></div>
* Based on this, who has been the most valuable player in contributing to a team's winning percentage from 2017-2020?
    * It was calculated that Mike Trout contributed the most offensively toward his team's winning percentage.

Based on this information, a general manager can attempt to build a winning roster by signing or trading for players that will improve the team's statistics that impact winning percentage. It also means that conversely, the general manager can determine if particular players are harming the team's odds of winning. The general manager could also use this to find players at specific contract costs that will maximize the team's potential winning percentage at a given salary payroll. 

### Limitations
One limitation comes from baseball itself. We could only craft a model that accounted for 58.1% of data variance. There is a lot to baseball that impacts winning and losing. While you want your team to bat well, field well, and pitch well, no team is perfect. Players make mistakes in all phases of the game which can have a direct outcome on the result, especially in a close game. So this model cannot account for a higher amount of the data variance.

Additionally, one of the problems with sports predictions is that the rows are not truly independent of one another. The teams play each other, impacting each other's win percentages. For example, the Yankees will typically play the Orioles 19 times in a standard length season. Each time they play each other, they impact each other's win percentage. Thus, they aren't independent.

Another limitation is that the dataset of team statistics was only 450 rows. This was chosen as a precaution to prevent skewed data from the "steroid era" of baseball. But a larger dataset from multiple decades of seasons would be preferable. Additionally, a few features were removed due to many null values. Perhaps these features could have been important in determining the win percentage, but the decision was made to preserve as many rows/seasons as possible in the dataset.

### Final Thoughts
These notebooks sought to develop a model to predict a Major League Baseball team's winning percentage based on the team's performance in particular areas (features). The model did this fairly well, but I was disappointed in it only covering 58.1% of the data variance. That being said, it seems on par or even outperforms similar models found through a search of Google. There is a model (coded in R) ([link here](https://rstudio-pubs-static.s3.amazonaws.com/466923_475fdd164e7343f88fa8b2df67d3b648.html#model-diagnostics)) that ended with a 0.426 R-squared value, and another multiple linear regression model ([link here](http://rstudio-pubs-static.s3.amazonaws.com/326635_edfcfb859221409eb4fcc8e8d564bb09.html)) that ended with a 0.93 R-squared value. What I noticed about the models with much higher R-squared values is that they approached the problem very differently. They saw that run differential has a high correlation with win percentage, and then used regression to target run differential. This approach was popularized by <i>Moneyball</i>, so most of the work I have found uses this approach. Another key difference is that I opted to include advanced baseball stats (known commonly as sabermetrics) in my dataset, which is uncommon. Surprisingly, the batting sabermetrics did not make it into the model, but the pitching sabermetrics did.

All that being said, I would be interested in attempting this again through the <i>Moneyball</i> approach to see how my results would differ. Batter up.

<img src = '../Images/camdenyards.jpg' width = 600/>