In [None]:
todo: add github links

# Deep learning for tabular data Part 2: Inspecting deep models

Two predominant concerns give rise to hesitation for adopting deep learning models in production.

The first stems simply from uncertainty on how to apply deep learning techniques to the tabular data that is often found in industry. This concern was addressed in my previous post on this topic. The second concern stems from the conception that deep neural networks (herafter referred to as DNNs and used broadly to describe any sort of neural network you might wish to use) are a "black box" and not as interpretable as simpler models like linear regression or tree-based algorithms.

My aim in this post is to address the second concern by demonstrating on a non-trivial data set that DNNs aren't as opaque as sometimes claimed just as simpler models are not always as interpretable as we think.

# Preamble

In this post we'll take a look at three different models trained on the [MLBs statcast data](https://baseballsavant.mlb.com/statcast_search) to predict if a given [plate appearance](https://en.wikipedia.org/wiki/Plate_appearance) resulted in a walk or hit. The data set contains every pitch from every plate appearance in the 2018 season.

To keep this post clean I won't include any code in this post. If you are curious the notebook can be found on my GitHub account, just keep in mind the models are trained to illustrate how to inspect deep learning models and not to be a model solution for this problem.

# Linear Models

As mentioned above, interpratability is an attribute often ascribed to linear regresssion and [its variants](https://en.wikipedia.org/wiki/Linear_regression#Generalized_linear_models). Recall the mathematical notation for
such a model

$$
y = \sigma\left(\sum_{i=1}^{n}{\beta_{i}x_{i}} + \beta_{0}\right)
$$

where $\beta_{i}\in\mathbb{R}$ are the learned model coefficients, $x_{i}$ are the model features and $g$ is a *link function* used to map values from $(-\infty, \infty)$ to some domain more relevant to the problem at hand. In linear regression $\sigma$ is simply the identify function and the [logistic function](https://en.wikipedia.org/wiki/Logistic_function) for for logistic regression. The key is that the link function is [monotonic](https://en.wikipedia.org/wiki/Monotonic_function), i.e. if $x_{i}<x_{j}$ then $\sigma(x_{i})<\sigma({x_{j}})$.

This is a straightforward formula which allows us to make certain inferences about the model predictions.

For example, it is common to infer that $|\beta_{i}| > |\beta_{j}|$ implies $\x_{i}$ is in some sense more important than $\x_{j}$ for predicting the target variable (given plenty of assumptions such the scale of the features that I don't want to get bogged down in discussing at present).

Another common inference is described well by wikipedia: "A fitted linear regression model can be used to identify the relationship between a single predictor variable $x_{j}$ and the response variable $y$ when all the other predictor variables in the model are 'held fixed'."

Both types of inferences have a limititations which are discussed at the end of this post. However, at present I want to discuss their usefulness, mainly as a sanity check (which I've discussed at greater length in my post on [model evaluation](https://dantegates.github.io/2019/01/07/model-evaluation-for-humans.html).

For example, consider the following output from a logistic regression model I trained on the MLB statcast data. The model was trained to predict whether a plate appearance resulted in a walk or hit from features of the final pitch of the PA such as the batter, pitcher, [count](https://en.wikipedia.org/wiki/Count_(baseball)), etc. The output below shows pitchers (with 100 innings pitched or more) with the lowest learned coefficients (remember in this case negative values will push the logistic function toward 0, so a smaller coefficient means the pitcher is more likely not to give up a walk or hit.

In [10]:
import pandas as pd
d = pd.read_csv('deep-learning-static/linear-pitcher-coefficients.csv')
d[d.IP > 150].sort_values('coef').head(10)[['player_name', 'coef']]

Unnamed: 0,player_name,coef
329,Blake Snell,-0.099976
24,Corey Kluber,-0.086561
327,Aaron Nola,-0.082084
227,Patrick Corbin,-0.072311
361,German Marquez,-0.072309
387,Luis Castillo,-0.070172
48,Max Scherzer,-0.065948
29,James Shields,-0.065294
168,Julio Teheran,-0.064711
61,J.A. Happ,-0.0517


Interestingly we see see the list led by the [2018 American League Cy Young Winner Blake Snell](https://www.mlb.com/news/blake-snell-wins-al-cy-young-award-c300735402) and followed with [2018 National League Cy Young Candidates Aaron Nola and Max Scherzer](https://www.si.com/mlb/2018/08/29/jacob-degrom-max-scherzer-aaron-nola-nl-cy-young-award) along with other top performing pitchers of 2018. As a sanity check this gives us confidence that the model is learning from real trends in the data and not from noise. If we saw the list lead by players posting [WHIPs](https://en.wikipedia.org/wiki/Walks_plus_hits_per_inning_pitched) 1.5 or greater we would likely have good reason for concern and take a closer look at our model.

# Deep neural networks

Deep neural networks involve many more calculations allowing for interactions among the features and non-linear transformations all along the way which makes it more difficult to quantify exactly how a given input variable affects the prediction.

Fortunately the story doesn't stop here. As promised I'll now demonstrate a few of my favorite methods for inspecting DNNs.

## Using a "hybrid" model

If you really want to take a look at what your DNN is learning in a fashion similar to how we treated coefficients of a linear model above, one approach is to train a "hybrid" model. For example part of the model can be a DNN that "encodes" certain features into a real valued vector. This "encoding" can then be used as input, along side other features, to a linear model. In this case we can interpret the coeffiecients of the held out features as we did before.

Let's take a concrete example from thee Statcast data. Each plate appearance is made up of a sequence of one or more pitches. This data suggests that we could feed data from the each pitch of a plate appearance into a recurrent model such as an LSTM. In fact, in preparation for this post, I did exactly that. I fed sequences of 38 features describing each pitch (such as the velocity and movement of the ball, where it crossed the strike zone, the count, whether runners were on base, etc.) into an LSTM that encoded these sequences into a 32-dimmensional vector. The vector was then concatenated with 1-dimmensional embeddings for both the batter and pitcher to form a 34-dimmensional vector. These 34 features were then fed into a standard logistic regression model and thus the embeddings are functionally identical to learned linear coefficients.

We can look at the results just like we did before and once again find Cy Young candidates as well as starting pitchers of the 2018 world series.

In [14]:
d = pd.read_csv('deep-learning-static/hybrid-pitcher-coefficients.csv')
d[d.IP > 150].sort_values('prob_not_on_base').head(10)[['player_name', 'prob_not_on_base']].rename({'prob_not_on_base': 'coef'}, axis=1)

Unnamed: 0,player_name,coef
129,Blake Snell,-0.371892
230,Jacob deGrom,-0.354052
294,Sean Newcomb,-0.33182
122,Max Scherzer,-0.325935
130,David Price,-0.298555
26,Corey Kluber,-0.295807
249,Mike Minor,-0.259139
161,Jhoulys Chacin,-0.253853
1,Chris Sale,-0.212044
189,James Paxton,-0.205869


For illustration we can also look at the batters sorted by their coefficients and find the list led by [on-base-percentage](https://en.wikipedia.org/wiki/On-base_percentage) and [slugging](http://m.mlb.com/glossary/standard-stats/slugging-percentage) leaders such as Mookie Betts, Juan Soto, Jean Segura, Mike Trout and Christian Yelich.

In [12]:
d = pd.read_csv('deep-learning-static/hybrid-batter-coefficients.csv')
d.sort_values('prob_on_base', ascending=False).head(10)[['batter_mlbname', 'prob_on_base']].rename({'prob_on_base': 'coef'}, axis=1)

Unnamed: 0,batter_mlbname,coef
0,Mookie Betts,0.382955
381,Jeff McNeil,0.358202
92,Jose Altuve,0.324591
387,Juan Soto,0.291924
59,Jean Segura,0.287105
125,Lorenzo Cain,0.278372
207,Miguel Cabrera,0.274602
109,Mike Trout,0.268682
71,Charlie Blackmon,0.266222
126,Christian Yelich,0.25968


Keep in mind there's no reason to restrict this approach to numeric values, we could just as easily have applied this to numeric values as well.

## DNNs all the way down

If the last approach feels a bit like a cop-out to you, it does to me to.

Let's talk about how we can inspect trained models that are "DNNs all the way down."

The final model is nearly identical to the "hybrid" model, but includes the batter and pitcher embeddings as features input to the LSTM. Additionally, since these embeddings can now interact with the other features we embed the batter and pitcher IDs into 4 and 2 dimmensions respectively (with the hybrid model embedding these features into more than one dimmension would have been superfluous).

Intuitively this suggests that one way to inspect the learned embeddings is to plot them.

Since the pitcher IDs are embedded into 2 dimmensions we use the embeddings directly as $x,y$ coordinates in our plot. Now, we should be careful not to try and interpret this plot by asking questions about specific locations of pitchers.

In [1]:
from IPython.display import HTML
HTML(filename='deep-learning-static/pitcher-embeddings.html')

In [2]:
from IPython.display import HTML
HTML(filename='deep-learning-static/batter-embeddings.html')

# Attention

In [3]:
import pandas as pd
pd.read_csv('deep-learning-static/Jean Segura walks.  .csv')

Unnamed: 0,des,attention,pitch_name,zone,balls,strikes,total_pitch_number
0,,1.9e-05,Split Finger,7.0,0.0,0.0,65
1,,9e-06,Knuckle Curve,14.0,0.0,1.0,66
2,,1.4e-05,4-Seam Fastball,2.0,1.0,1.0,67
3,,3.4e-05,4-Seam Fastball,12.0,1.0,2.0,68
4,,6.7e-05,Split Finger,14.0,2.0,2.0,69
5,,0.032536,4-Seam Fastball,6.0,3.0,2.0,70
6,Jean Segura walks.,0.967321,4-Seam Fastball,14.0,3.0,2.0,71


In [36]:
pd.read_csv('deep-learning-static/Mookie Betts homers (22) on a fly ball to left center field.  .csv')

Unnamed: 0,des,attention,pitch_name,zone,balls,strikes,total_pitch_number
0,,0.478748,2-Seam Fastball,14.0,0.0,0.0,1
1,,0.179872,2-Seam Fastball,5.0,1.0,0.0,2
2,Mookie Betts homers (22) on a fly ball to left...,0.341381,2-Seam Fastball,2.0,1.0,1.0,3


In [37]:
pd.read_csv('deep-learning-static/J.  T.   Realmuto grounds out, shortstop Asdrubal Cabrera to first baseman Rhys Hoskins.  .csv')

Unnamed: 0,des,attention,pitch_name,zone,balls,strikes,total_pitch_number
0,,0.063786,Cutter,3.0,0.0,0.0,1
1,,0.029418,2-Seam Fastball,14.0,0.0,1.0,2
2,,0.055858,Cutter,14.0,1.0,1.0,3
3,"J. T. Realmuto grounds out, shortstop Asdru...",0.850938,2-Seam Fastball,5.0,2.0,1.0,4


# Counterfactuals

# Final comments

## A note on interpratability and evaluation