# Regression Modeling

This demo goes over regression modeling and how we can use it to compute run values and the four factor model like we've already seen. 

## Regression Equation

The fundamental equation for multiple regresssion is an extension of the usual simple one-variable regression equation.  For $k$ input variables, the regression equation is,
\begin{align}
  \text{Observation} & = \text{Linear Model} + \text{Error} \\
      & = \text{Intercept} + \beta_1 \times \text{Input}_1 + \dots + \beta_k \times \text{Input}_k + \text{Error}.
\end{align}

+ The Observation is the actual data observation we make for a particular set of inputs.
+ The Intercept gives a baseline value around which the output will vary as the inputs change.
+ The weights $\beta_i$ give the relative values of the inputs.  The units for the weights are given by
$$
    \text{$\beta_i$ Units} = \frac{\text{1 Observation Unit}}{\text{1 $\text{Input}_i$ Unit}}
$$
+ The Error adds in the random variation that is not modeled and, when combined with the linear equation, leads to the observation


## Fitting a Regression Model

By fitting a regression model, we find the optimal values of Intercept and $\beta_1, \ldots, \beta_k$.  How do we define optimal?  We minimize the squared error of the model and the observations:
$$
    \mathrm{minimize}\ \sum_i (\text{Observation}_i - \text{Linear Model}_i)^2
$$
where 
$$
    \text{Linear Model}_i = \text{Intercept} + \beta_1 \times \text{$i$-th Input}_1 + \dots + \beta_k \times \text{$i$-th Input}_k
$$

We'll be using a helper function `multiple_regression` to fit the regression model.

## Setup

In [None]:
%run ../../utils/notebook_setup.py

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from datascience_stats import multiple_regression

## 1. Baseball Run Values by Regression

Recall the formula for Linear Weights:
$$
  \text{Runs Above Average} = .46\cdot \mathit{1B} + .80\cdot \mathit{2B} + 1.02\cdot \mathit{3B} + 1.40\cdot \mathit{HR} + .33\cdot (\mathit{BB} + \mathit{HBP}) - .25\cdot \mathit{O}
$$

We directly computed the run values for the events through a simple and elegant computation with the play-by-play data.  But there's nothing that stops us from trying to compute the run values through regression.  LWTS is a linear model, after all.

It turns out, using season level data for teams we can do pretty well estimating the run values.

### Load Data

Similar to what we've seen before, we're goint to use the Lahman dataset but cleaned a bit for ease of use with our helper function.  In particular, some columns have been renamed, some extra have been computed, and many have been dropped.

In [None]:
# Load lahman_teams.csv obtained from the Lahman databank. 
# This table is a slight modification of the regular table.
lahman = pd.read_csv("lahman_teams.csv")
lahman.head()

### Our first regression model

Let's build our first regression model.  We need to tell the function `multiple_regression` which is the dependent variable (the observation) and the independent variables (the inputs).

The dependent variable is going to be Runs Above Average and the independent variables will be the events.

In [None]:
dep_vars = 'RAA'
ind_vars = ['O', 'X1B', 'X2B', 'X3B', 'HR', 'BB', 'HBP', 'SB', 'CS']

# compute the regression model
coefs, predictions, errors = multiple_regression(dep_vars, ind_vars, lahman)
coefs

We can compare the regression to the run values we obtained earlier in the semester:

| Event | Run Value |
| ------|---------- |
|  Out  |  -0.287   |
|  1B   |   0.462   |
|  2B   |   0.781   |
|  3B   |   1.085   |
|  HR   |   1.383   |
|  BB   |   0.306   |
|  HBP  |   0.336   |

We find strikingly similar results.  It's hard to argue wih the effectiveness of the regression.

#### Stolen Bases

Under the original modeling approach, the run values from FanGraphs for a stolen base and getting caught steaking is given by,
$$
    \mathit{SB} = .2,\quad \mathit{CS} = -(2 * \text{Runs per Out} + 0.075).
$$
The caught stealing value is typically about -.4.  Our findings align pretty well with that.

We could have used additional variables for the regression.  We're a bit limited based on the Lahman dataset so we cannot distinguish between regular walks and intentional walks, or fielder's choice, or reaching base on an error.  Luckily we've got most of the events and the most important ones at that.

In [None]:
# stolen bases
SB = coefs["SB"]
print(f"""
Regression SB value: {SB:.3f}  
FanGraphs SB value:  0.2
""")

# Caught stealing
CS = coefs["CS"]
O = coefs['O']
print(f"""
Regression CS value: {CS:.3f}  
FanGraphs CS value:  approx. -0.4
-(2 x R / O + 0.075): {-(2 * O + 0.075):.3f}
""")

#### Stolen Base Breakeven Probability

The breakeven probability for a stolen base tells us how likely a stolen base needs to be to make it an even proposition in terms of run expectancy.  Research has shown that some poorly constructed regression models can fail to provide a properly calibrated model with respect to the breakeven probability.  Our model is pretty close to what we should expect, which is about 70%.

In [None]:
np.abs(coefs['CS']) / (coefs['SB'] + np.abs(coefs['CS']))

#### Residuals
We can look a scatterplot between RAA and the errors from the regression.  The doesn't look eggregiously bad so it looks like we're doing a fair job of capturing run scoring with the events we have used.

In [None]:
plt.plot(lahman['RAA'], errors, '.');

### Are Ks more costly than other outs?

Among other variables we could have used is the strikeout.  Presumably striking out and not putting the ball in play, even if it results in an out, should be less valuable.  So is there much of a distinction between regular outs and strikeouts?

In [None]:
ind_vars_with_K = ['O_nonK', 'SO', 'X1B', 'X2B', 'X3B', 'HR', 'BB', 'HBP', 'SB', 'CS']

# compute the regression model with strikeouts
coefs_with_K, _, _ = multiple_regression(dep_vars, ind_vars_with_K, lahman)
coefs_with_K

Here is what we computed using the play-by-play data:

| Event | Run Value |
| ----- | --------- |
|  Out  |  -0.287   |
|   K   |  -0.292   |

The evidence is not strong that a generic O and a strikeout are hugely different in value

### What happens if we only use a year of data?

We used all years since 2000 to build our regression.  What if we want to compute the run values for a single year, say 2016?  Let's sluff off the rest of the data and run our regression.

In [None]:
# Isolate to just 2016
lahman_2016 = lahman.loc[lahman['yearID'] == 2016].copy()
# recompute RAA just for 2016
lahman_2016['RAA'] = lahman_2016['R'] - lahman_2016['R'].mean()

# compute the 2016 regression model
coefs_2016, _, _ = multiple_regression(dep_vars, ind_vars, lahman_2016)
coefs_2016

The end result is not good.  We don't know the ground truth but we have a good idea of where things should be and in this case, some of these values are ludicrous.  

+ The value of a double is way off, especially given that it's worth more than a triple. 
+ The values for HBP and BB are out of whack too.  
+ Most alarmingly, the value for CS is near 0.

So what happened?  

Not enough data.  That's pretty much it.  One season of MLB has only 30 observations and we tried to estimate 9 coeffients.  30 data points would possibly be okay if we wanted to measure 1 effect.  But 9 simultaneous effects?  No way.

The play-by-play method worked for a single season but this regression approach requires multiple years.  This is not great if we want to capture changing run environments.  A potential solution (if we wanted to continue with regression modeling) would be to build a regression using the play-by-play data.  That would be enough data.

### What happens if we only use a single variable?

Let's return to our data for the 2000s but now we'll explore an important phenomenon with regression modeling: _misspecification_.

The underlying mathematical theory for regression basically requires the following:
+ Use all the independent variables that the observation depends on 
+ Assume the error is reasonably well behaved and actually random

If you satisfy those assumptions, the regression model will properly estimate the coefficients of the model.

So far we've seen regression models that do pretty well because we're doing a pretty good job of specifying the model.  Let's see just how the regression could have produced junk results if we did not properly specify the regression model.

**Note**: Because we're explicitly creating a bad model with missing information, it's makes sense now to include a  constant term in `multiple_regression`.  You see this most obviously from the scatter plot below.

In [None]:
dep_vars = 'RAA'
ind_vars = 'X2B'
lahman.plot.scatter(x=ind_vars, y=dep_vars);

In [None]:
coefs, predictions, errors = multiple_regression(dep_vars, ind_vars, lahman, constant=True)
coefs

While it feels like we should have been able to estimate the individual effects of the events, the poor results show that the simultaneous effects of the different events make it so that you definitely need to incorporate all the events to get proper results.

This is huge part of any statistical study using regression: you need to collect as much information as you can that likely is relevant _and_ properly specify the model.  If you fail to do this, your results can very likely be corrupted and erroneous.

## 2. Dean Oliver's Four Factor Model by Regression

Recall Dean Oliver's four factor model for basketball:
\begin{align*}
  \text{Team Performance} & = .4 \cdot Z(\mathit{eFG\%} -  \mathit{eFG\%}_{\text{Opp}}) \\
  & \quad - .25 \cdot Z(\text{Turnover Rate} - \text{Turnover Rate}_{\text{Opp}}) \\
  & \quad + .2 \cdot Z(\mathit{OREB\%} -  \mathit{OREB\%}_{\text{Opp}}) \\
  & \quad  + .15 \cdot Z(\text{FT Rate} - \text{FT Rate}_{\text{Opp}})
\end{align*}

The model tried to explain team performance through four fundamental factors.  Dean Oliver prescribed his own relative importance to the factors as 40% for efficient shooting, 25% for turnovers, 20% for rebounding, and 15% for free throw attempts.   Where did Dean Oliver get those values?  Are they the best?

We don't know where he got those values but we can see what regression says for the relative importance.

### Load Data

We'll use similar data we used before but cleaned up to have the just the four factors and other relevant data.

Recall the two values:
\begin{align*}
  \text{Rating Ratio} & = \frac{\text{Off. Rating}}{\text{Def. Rating}} \\
  \text{Log Rating Ratio} & = \log\text{Rating Ratio}
\end{align*}

In [None]:
nba_teams_full = pd.read_csv('team_season_ff_data.csv')

nba_teams = nba_teams_full.loc[nba_teams_full.season >= 2000]
nba_teams.head()

### Four Factors and Winning Pct

Let's first look at a model for winning percentage using the four factors.  Since winning percentage is centered around .500, we need to include a constant term to center our model.

In [None]:
dep_vars = 'win_pct'
ind_vars = ['eFG', 'Tov', 'Reb', 'Ftr']

# compute the Four Factor model by regression
coefs, _, _ = multiple_regression(dep_vars, ind_vars, nba_teams, constant=True)
coefs

Now we have the exact coefficients in terms of winning percentage.  So we know that for a team that increases it's eFG factor 1 unit, it will increase it's winning percentage .122 points, or 10 wins.

If we want the relative importance, we can rescale the non-intercept coefficients to sum to 100 in absolute value.  These will be the are relative percentages, as Dean Oliver used.  

In [None]:
factor_coefs = coefs['eFG':]
factor_coefs / factor_coefs.abs().sum() * 100

It turns out we get quite close to Dean Oliver's prescribed values.  But it also turns out that our model suggests lower weights for **Tov**, **Reb**, and **FTR** in exchange for more importance for **eFG**.

### Four Factors and the log Rating Ratio

We can also look at our ole Pythagorean Expectation pal the log rating ratio.  There is no need for an intercept for the log rating ratio since it's centered very close to 0.

Perhaps not shockingly, we get similar results for the relative importance.  The **eFG** factor again is more relevant according to this regression.

In [None]:
dep_vars = 'log_rtg_rat'
ind_vars = ['eFG', 'Tov', 'Reb', 'Ftr']

# compute the Four Factor model by regression
coefs, _, _ = multiple_regression(dep_vars, ind_vars, nba_teams)

coefs / coefs.abs().sum()

#### As before, what if only include one variable in the regression?

The resulting coefficients from the misspecified models are all off, and not in a consistent direction.

**Note**: Since we're using the Log Rating Ratio, we don't use an intercept.

In [None]:
dep_vars = 'log_rtg_rat'
ind_vars = 'eFG'

# misspecified regression model
coefs_misspecified, _, _ = multiple_regression(dep_vars, ind_vars, nba_teams)

print(f"""
Misspecified {ind_vars} value: {coefs_misspecified[ind_vars]}
Four Factor {ind_vars} value:  {coefs[ind_vars]}
""")

In [None]:
dep_vars = 'log_rtg_rat'
ind_vars = 'Tov'

# misspecified regression model
coefs_misspecified, _, _ = multiple_regression(dep_vars, ind_vars, nba_teams)

print(f"""
Misspecified {ind_vars} value: {coefs_misspecified[ind_vars]}
Four Factor {ind_vars} value:  {coefs[ind_vars]}
""")

In [None]:
dep_vars = 'log_rtg_rat'
ind_vars = 'Reb'

# misspecified regression model
coefs_misspecified, _, _ = multiple_regression(dep_vars, ind_vars, nba_teams)

print(f"""
Misspecified {ind_vars} value: {coefs_misspecified[ind_vars]}
Four Factor {ind_vars} value:  {coefs[ind_vars]}
""")

In [None]:
dep_vars = 'log_rtg_rat'
ind_vars = 'Ftr'

# misspecified regression model
coefs_misspecified, _, _ = multiple_regression(dep_vars, ind_vars, nba_teams)

print(f"""
Misspecified {ind_vars} value: {coefs_misspecified[ind_vars]}
Four Factor {ind_vars} value:  {coefs[ind_vars]}
""")

In [None]:
all_coefs_misspecified = np.array([0.04023, -0.01672, 0.009373, 0.01910])
all_coefs_misspecified / np.abs(all_coefs_misspecified).sum() * 100

### Four Factors Regression Model by Game

If you recall, the four factor model was also effective for explaning game performance.  Compared to the season level, the performance was quite similar though the games just had more variation.  The regression should still be more effective.  How does that play out here?

In [None]:
games = pd.read_csv('game_ff_data_2016.csv')
games.head()

For 2016, the weight is just a bit more on eFG.  But it appears generally consistent with season level.

In [None]:
dep_vars = 'log_rtg_rat'
ind_vars = ['eFG', 'Tov', 'Reb', 'Ftr']
coefs, _, _ = multiple_regression(dep_vars, ind_vars, games)
coefs = coefs / coefs.abs().sum()
coefs