In [None]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats
import matplotlib.pyplot as plt

In [None]:
## Load in data from CSV
## Data contains four columns, team name, year, wins, and payroll
## only the final two make it into the model - I'm not looking for the impact of team names or years
## dollars are in millions rounded to the nearest million - example: 183.4 is 183
data = pd.read_csv("<YOUR PATHWAY>/wins_analysis/data/wins_dollars_data.csv")
dollars = np.array(data['Dollars'])
wins = np.array(data['Wins'])

In [None]:
# Wins are slightly less than 81 because a few games weren't played due to rainouts and schedules
# The difference is not significant enough to have a tangible impact on the model
dollars.mean(), wins.mean()

In [None]:
# Let's look at a plot of the values
# Our x variable, or the independent variable, is going to be dollars
# Consequently, the y variable will be wins, since we are trying to predict wins based on dollars
font_title = {'fontsize': 16}
font_axes = {'fontsize': 14}
plt.scatter(dollars, wins)
plt.xlim([0,dollars.max() + 10])
plt.ylim([0, wins.max() + 5])
plt.title("Payroll Dollars spent versus Total Wins",font_title)
plt.xlabel("Payroll Dollars", font_axes)
plt.ylabel("Total Wins", font_axes)
plt.savefig("<YOUR PATHWAY>/wins_analysis/data/dollars_wins_plot.png")
plt.show()

In [None]:
# model passes all test for statistical signficance
# the results are pretty weak - each $1M corresponds to 0.09 more wins - yikes
# The R-squared is pretty brutal as well - 0.116 is not good, that means 89% of the variance is from other factors
dollars_x = sm.add_constant(dollars)
est = sm.OLS(wins,dollars_x)
est_fit = est.fit()
print(est_fit.summary())

### Key Points
1. Model Validation
2. P-values
3. Coefficients
4. R-squared

#### Model Validation

The F-statistic for this model is slightly larger than zero, which means the model is valid. Passing this test means we can continue to analyze the remaining outputs.

### P-values

There are two variables in this model that determine the outputs, const and x1. "const" is a constant that does not change regardless of the dollars spent, it's p-value is 0.000, which means it's incredibly small and passes the test for statistical significance. "x1" is dollars spent on the payroll, which is also 0.000, therefore the same conclusion applies.

### Coefficients

The coefficient for the constant is a little strange - according to this model, if a team spent no money they would win 68 games. In practice we know this is cannot occur because spending no money means either 1) the team did not play due to lockout, COVID-19, etc. or 2) the team folded and does not exist. This model can still be accurate for observations that may occur, such as teams spending at least \$40M on payroll.

The x1 coefficient, or dollars spent, is 0.969 (nice). This means that for each one million dollars spent a team expects less than 1/10th of a win. Let's look at an example to add context. If a team spent \$100M on payroll, this model looks like the following : 100x0.0969 + 68.08 = 77.77 wins. That seems pretty reasonable given the plot above.

### R-squared

R-squared calculates the variance explained by the data in the model. Models with a low R-squared are generally regarded as poor because it means that the independent variable (dollars spent) does not explain the variance in the dependent variable (wins). In this model the R-squared is 11.6% (shown as 0.116 above). In non-statistics english, this means that 11.6% of the variance in wins is explained by dollars, and 88.4% of the variance in wins is explained by data that is not in the model.

## Conclusion

Spending and wins have a weak connection. The low R-squared suggests that a teams payroll is a poor predictor of season-long success. An increase in payroll does not necessarily imply that a team will win more games because 88.4% of the variance in wins comes from data not included in the model.