# Plus/Minus Ratings

We can approach evaluating a player in two ways:
+ Using data on the events they generate like made shots, assists, etc
+ Using the cumulative scoring while the player is on the court

The basic Plus/Minus calculation is given by:
\begin{align}
    \text{Player Plus/Minus} 
        & = \text{Team Points Scored w/ Player on Court} - \text{Team Points Allowed w/ Player on Court} \\
        & = \text{Net Team Points w/ Player on Court}
\end{align}

In theory, this should be an effective general measurement of a player that directly captures the effect on scoring.  Especially given that when we try to use player events, there are inevitably things we are missing that should impact scoring.  For instance, if a player doesn't register conventional boxscore stats but is a good player that helps overall scoring, Plus/Minus might be able to capture it.

Unfortunately, it doesn't work out like this.  We'll see why and how to try to do better.

## Setup

In [None]:
%run ../../utils/notebook_setup.py

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from datascience_stats import multiple_regression_big

## 1. Stint data

Here we can see the data on all the stints but this isn't really effective for performing a regression analysis.  

In [None]:
df = pd.read_csv('nba_stints_2015_full.csv.gz')
print(df.shape)
df.head()

### Stint Data for Regression

Instead, we use encoded data that is actually numeric.  Each player is represented by a 0 or 1.  If a player is on the court during the stint, he will have a 1.  Most of the entries will be 0.

HCA naturally stands for home court advantage and is actually just a column of 1s.  This is like fitting an intercept.


We do this via a big model where each variable corresponds to a player and is 0 if the player was _not_ on the court during the stint and 1 if he was.  This creates a table of 0s and 1s of size Number of Stints by Number of Players + 1.  The +1 is for an extra variable representing the home court advantage.  Each row will only have 10 1s.

In [None]:
stints = pd.read_csv('nba_stints_2015_binary.csv.gz')
players = list(stints.loc[:, 'A.J. Price':].columns)

stints.head()

## 2. Plus/Minus

We can build the Plus/Minus for each player by summing up their net points for each time the player is on the court.


In [None]:
player_plus_minus = {}
for player in players:
    # compute the plus minus value
    plus_minus_val = (stints['home_netpts'] * stints[player]).sum()
    player_plus_minus[player] = plus_minus_val
    
player_plus_minus = pd.Series(player_plus_minus).sort_values(ascending=False)

If we look at the top players, we see a lot of famiilar names.  These are all starters or important players on the best teams in the league that year.

In [None]:
player_plus_minus.head(20)

When we look at the bottom players, we see a lot young, not very good players who are also on weak teams.

In [None]:
player_plus_minus.tail(20)

### Player Net Rating

We should normalize by number of possessions.  This will help with evening out the players

In [None]:
player_poss = {}
for player in players:
    # compute the number of possessions for the player
    poss_ct = (stints['net_poss'] * stints[player].abs()).sum()
    player_poss[player] = poss_ct
    
player_poss = pd.Series(player_poss).sort_values(ascending=False)

In [None]:
player_net_rtg = player_plus_minus / player_poss * 100
player_net_rtg.sort_values(ascending=False, inplace=True)

Unfortunately, it looks like some low possession players dominate because they did well in their few opportunities.  

In [None]:
player_net_rtg.head(20)

It is beyond the scope of this demo (because the data isn't quite right) to take this a step further and compute the difference between the net rating when the player is on the court versus when the player is off the court.

## 3. Regression Plus/Minus

We can think of a plus/minus rating as simultaneous impacts of players on team performance.  If we track performance over stints, where the same 10 players are on the court, we can measure a player's impact using a regression.

The model is:
\begin{align}
    \mathrm{HomeNetRating}_t & = \mathrm{HomeCourtAdv}\ + \\
    & \quad \mathrm{Sum}(\mbox{Home Player $i$'s net rating if player $i$ is on the during the $t$-th stint})\ - \\
    & \quad \mathrm{Sum}(\mbox{Away Player $i$'s net rating if player $i$ is on the during the $t$-th stint}).
\end{align}

Using play-by-play data from 2014-15, the stint data is collected into a table.  For each stint, possessions and scoring is tracked as well as the 10 players on the court.  There are about 40k stints over this season.

### Adjusted Plus/Minus

We need a more advanced solver for the regression model that can handle this much bigger problem.  This is where `multiple_regression_big` comes in.

We set `net_rtg` as the dep_var and we set `HCA` and the players as the independent vars.  We also utilize weights: each stint has a total number of possessions.  We want the results from stints with more possessions to be weighted more than other possessions.

After we compute the regression model, we can see some of the results that come out for the first ten players alphabetically.  These are the _Adjusted Plus/Minus_ or APM ratings

The result of this regression model is a player rating which should indicate the impact the player had on Net Rating relative to league average.  A positive value obviously indicates a positive impact on Net Rating.  We could in fact use this to construct lineup net ratings above average by summing across a lineup of players.

In [None]:
dep_var = 'home_netrtg'
ind_vars = players

# compute the regression for Net Rating
apm = multiple_regression_big(dep_var, ind_vars, stints, constant=True)
apm.head(10)

Let's take a look at the histogram plot.

This is odd... there are some very large values.  This is supposed to be the player's impact on net rating and there are values over 100 in magnitude??

Did we do something wrong?

In [None]:
apm.plot.hist(bins=50);

Let's look at the top ranked players.

Geez, who are some of these guys?  Where's Lebron??

What happened?

In [None]:
apm_HCA = apm['Intercept']
print("Home Court Advantage for Net Rating: {:.2f}".format(apm_HCA))
print()
print("Top 20 by APM\n" + 40*"=")
print(apm[players].sort_values(ascending=False)[:20].to_string())
print()
print("Bottom 20 by APM\n" + 40*"=")
print(apm[players].sort_values(ascending=True)[:20].to_string())


## 4. Fixing the Regression Model

The initial regression results are no good.  There a few issues:
+ We didn't restrict to a minimum amount of playing time.  Malcolm Lee played like 1 stint which had an absurdly high Net Rating for his team.
+ We're not addressing the issue of _multicolinearity_ which is basically the result of groups of the same players frequently playing together or players substituting at the same position and thus never playing together.


We can try to fix the regression two ways:

#### Weighting
We can use weights so that instead of the squared error of each stint being treated equally, we'll emphasize stints with more possessions.  It's not always obvious there are weights to use but in this case, we should use the possessions as weights.  The more possessions, the more signal there is in the stint.

#### Penalization
This is more advanced but we can incorporate penalization when solving for the optimal model values.  The optimization of the model, ie. minimizing the squared error, is being overly aggressive in how it computes its values.  The result is an overfit model that won't generalize well.  

Giving Malcolm Lee a high rating would do well to minimize the squared error but if we had a chance to observe him play more stints, the high rating would very soon appear to look like a very bad prediction, ie. it wouldn't generalize to other data.

Penalizing the model's values reduces the overfitting by not allowing it to assign large values unless it really needs to.  If a player is going to be rated high, there needs to be a lot of observations so that high rating would contribute well to minimizing the error, moreso than the penalty we place on the rating.

### Using Weights

We can incorporate the number of possessions in the stint into the model very easily.

In [None]:
dep_var = 'home_netrtg'
ind_vars = players
weights = 'net_poss'

# compute the weighted regression for Net Rating
apm_weighted = multiple_regression_big(
    dep_var, ind_vars, stints, weights=weights, constant=True)
apm_weighted.head(10)

Looking at the histogram shows the results of the regression already appear to be better.

In [None]:
apm_weighted.plot.hist(bins=50);

Let's look at the ranking coming from the weighted regression.  First we'll create a complete dataframe with all the ratings so far.

In [None]:
player_df = pd.DataFrame({
    'Net Rating': player_net_rtg, 
    'APM': apm[players], 
    'Weighted APM': apm_weighted[players]
})

The weighted regression is vastly superior to the regular regression model.  Comparing between the weighted model and raw net rating, we see that there is quite a bit of difference.  For instance, the weighted model likes DeMarcus Cousins a lot more than his net rating does, quite possibly due to the fact that Cousins played for the Kings, a notoriously trash team that could have tanked his rating.

In [None]:
apm_weighted_HCA = apm_weighted['Intercept']
print("Home Court Advantage for Net Rating: {:.2f}".format(apm_weighted_HCA))
print()
print("Top 20 by Weighted APM\n" + 40*"=")
print(player_df.sort_values('Weighted APM', ascending=False).head(20).to_string())
print()
print("Bottom 20 by Weighted APM\n" + 40*"=")
print(player_df.sort_values('Weighted APM', ascending=True).head(20).to_string())

### Penalizing the Least Squares Fit: Regularized Adjusted Plus Minus (xRAPM)

We just ran into a few issues:
+ Players who we should have dropped due to not having many minutes.  If they have a raw net rating of 200 in 1 possession, the regression will still try to aggressively optimize and give that player a high rating.  We can bucket those players together or force the regression optimizer to not be so aggressive
+ Lineups do not behave like randomized controlled trials.  Given nine players on the court, we can do a really good job predicting the tenth.  Sometimes two players almost always play together.  Or two players switch for each other.
+ This lack of randomization leads to a condition called _multicollinearity_ and is a huge potential problem in multiple regression problems.  Due to issues that can be derived/explained with Linear Algebra, if multicollinearity is present the regression will likely falter and not be able to distinguish well what is happening.  

Our solution is to use something called _penalization_ or _regularization_.

Instead of just aggressively minimizing the mean square error, we reframe the regression to simultaneously minimize mean square error but penalize aggressive fitting by the optimizer.  If the optimization wants to assign a big rating value to a player, it better have a lot of evidence behind it, ie. the reduction in the least squares needs to offset the penalty imposed.

What exactly is the penalty?  We penalize the sum of squares of the coefficients and we introduce a penalty parameter that quantifies the strength of this penalty.  This parameter is our choice but there are methods (beyond the scope of this demo) that can suggest a good value.

The result of this is a statistic attributed to Jerry Engelmann called _Regularized Adjusted Plus Minus_ or xRAPM.  It is actually the cousin/basis for ESPN's Real Plus/Minus statistic. 

We use a new function to perform this: `multiple_regression_big_with_penalty`.


In [None]:
from datascience_stats import multiple_regression_big_with_penalty

dep_var = 'home_netrtg'
ind_vars = players

rapm = multiple_regression_big_with_penalty(
    dep_var, ind_vars, stints, constant=True, penalty=800.)
rapm.head(10)

This looks way better.  Now we see the people we expect to see at the top.  There are some interesting names at the top like Kyle Korver or Draymond Green.  I would have expected Draymond to rank high on defense but not overall.

In [None]:
rapm.plot.hist(bins=50);

In [None]:
rapm_HCA = rapm['Intercept']

player_df['RAPM'] = rapm[players]
print("Home Court Advantage for Net Rating: {:.2f}".format(rapm_HCA))
print()
print("Top 20 by RAPM\n" + 40*"=")
print(player_df.sort_values('RAPM', ascending=False).head(20).to_string())
print()
print("Bottom 20 by RAPM\n" + 40*"=")
print(player_df.sort_values('RAPM', ascending=True).head(20).to_string())

### Combining Weighting and Penalization

We can actually combine the two methods to achieve something pretty solid.

A few comments:
+ Note how the penalty parameter is a lot different now.  The weighting already picked up some slack so the penalty parameter doesn't have to do as much.
+ Note how some players look a lot better with the weighting (Kelley Olynyk, Danny Green, George Hill) 

In [None]:
wrapm = multiple_regression_big_with_penalty(
    dep_var, ind_vars, stints, weights=weights, constant=True, penalty=0.001)

wrapm.plot.hist(bins=50)

wrapm_HCA = rapm['Intercept']
player_df['wRAPM'] = wrapm[players]
print("Home Court Advantage for Net Rating: {:.2f}".format(wrapm_HCA))
print()
print("Top 20 by wRAPM\n" + 40*"=")
print(player_df.sort_values('wRAPM', ascending=False).head(20).to_string())
print()
print("Bottom 20 by wRAPM\n" + 40*"=")
print(player_df.sort_values('wRAPM', ascending=True).head(20).to_string())

### Compare to ESPN's RPM

We can compare our results ESPN's Real Plus/Minus statistic.

Compared against overall RPM from 2014-15, our rating is actually that not that bad.  We're overrating players a bit and maybe using more years would help.  RPM actually uses box score data and some biographic data to help stabilize the regression further.  We are working purely with lineup data so if they are doing things well, that extra data will improve things for them.

Also note that ESPN produces Offensive RPM and Defensive RPM. To do this, we need to break up the stint data into offense and defense performance and have _two_ effects for each player, one for offense and one for defense.

They also convert RPM to Wins, presumably using something like the pythagorean expectation formula.  Kevin Pelton's WARP statistic does similarly.

In [None]:
rpm = pd.read_csv('rpm.csv', index_col='RK')
rpm