forked from christophM/interpretable-ml-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-interpretable-models.Rmd
767 lines (603 loc) · 57.7 KB
/
04-interpretable-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
# Definitions {#definitions}
To avoid confusion through ambiguity, here are some definitions of terms used in this book.
- An **Algorithm** is a set of rules that a machine follows to achieve a particular goal [@algorithm]
- A **Machine learning algorithm** is a set of rules that a machine follows to learn how to a achieve a particular goal. The output of a machine learning algorithm is a machine learning model.
- A **(Machine learning) Model** is the outcome of a machine learning algorithm. This can be a set of weights for a linear model or neural network plus the information about the architecture.
- **Dataset**: A table containing the data from which the machine learns.
- **Features**: The features/information used for prediction/classification/clustering. A feature is one column in the dataset.
- **Target**: The thing the machine learns to predict.
- **(machine learning) Task**: The combination of a dataset with features and a target. Depending on the type of the target, the task can be classification, regression, survival analysis, clustering or outlier detection.
- **Prediction**: The machine learning models "guess" what the target's value should be based on the given features.
- **Instance**: One row in the dataset.
# Interpretable Models {#simple}
The most straightforward way to get to interpretable machine learning is to use only a subset of algorithms that create interpretable models.
Very common model types of this group of interpretable models are:
* Linear regression model
* Logistic regression
* Decision trees
In the following chapters we will talk about these models.
Not in detail, only the basics, because there are already a ton of books, videos, tutorials, papers and more material.
We will focus on how to interpret the models.
The chapter covers linear models, logistic regression and decision trees in details and lists some more.
All of the interpretable models types explained in this book are interpretable on a modular level, except for the k-nearest neighbours method.
The following table gives an overview over the interpretable model types and their properties.
A model is linear if the association between features and target is modelled linearly.
A monotonic model ensures that the relationship between a feature and the target outcome is always in the same direction over the whole range of the feature:
an increase in the features value will consistently lead to either an increase or a decrease of the target outcome, but never both for this feature.
Monotonicity is useful for the interpretation of a model, because it makes it easier to understand a relationship.
Some models can automatically include interactions between the features for predicting the outcome.
You can always include interactions into any kind of model by manually creating interaction features.
This can be important for correctly predicting the outcome, but too many or too complex interactions can hurt interpretability.
Some models only handle regression, some only classification and some can manage to do both.
You can use this table to choose a suitable interpretable model for your task.
| Algorithm |Linear |Monotonicity |Interactions | Task type|
|:--------------|:----|:----|:------|:--------|
| Linear models | Yes | Yes | No | Regression |
| Logistic regression | No | Yes | No | Classification |
| Decision trees | No | No | Yes | Class. + Regr. |
| Naive Bayes | Yes | Yes | No | Classification |
| k-nearest neighbours | No | No | No | Class. + Regr.|
| RuleFit| Partially | No | Yes| Class. + Regr.|
## Terminology
- Y is the target outcome.
- X are the features (also called variables, covariables, covariates or inputs).
- $\beta$ are regression weights (also called coefficients).
## Linear Model {#limo}
Linear models have been used since a long time by statisticians, computer scientists and other people with quantitative problems.
Linear models learn linear (and therefore monotonic) relationships between the features and the target.
The linearity of the learned relationship makes the interpretation easy.
Linear models can be used to model the dependency of a regression target $y$ on $p$ features $x$.
The learned relationships are linear and, for a singular instance $i$, can be written as:
$$y_{i} = \beta_{0} + \beta_{1} \cdot x_{i1} + \ldots + \beta_{p} x_{ip} + \epsilon_{i}$$
The i-th instance's outcome is a weighted sum of it's $p$ features.
The $\beta_{j}$ represent the learned feature weights or coefficients.
The $\epsilon_{i}$ is the error we are still making, the difference between the predicted and actual outcome.
Different methods can be used to estimate the optimal weight vector $\hat{\boldsymbol{\beta}}$.
The ordinary least squares method is commonly used to find the weights that minimise the squared difference between the actual and the estimated outcome:
$$\hat{\boldsymbol{\beta}} = \arg\!\min_{\beta_0, \ldots, \beta_p} \sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p \beta_j x_{ij}\right)\right)$$
We won't go into detail about how the optimal weights can be found, but if you are interested you can read Chapter 3.2 of the book "Elements of Statistical Learning" [@Hastie2009] or one of the other zillions of sources about linear regression models.
The biggest advantage of linear regression models is the linearity:
It makes the estimation procedure straightforward and most importantly these linear equations have an easy to understand interpretation on a modular level (i.e. the weights).
That is one of the main reasons why the linear model and all similar models are so widespread in academic fields like medicine, sociology, psychology and many more quantitative research fields.
In this areas it is important to not only predict e.g. the clinical outcome of a patient, but also to quantify the influence of the medication while at the same time accounting for things like sex, age and other features in an interpretable manner.
Linear regression models also come with some assumptions that make them easy to use and interpret but which are often not satisfied in reality.
Assumed is: Linearity, normality, homoscedasticity, independence, fixed features and absence of multicollinearity.
- **Linearity**: Linear regression models force the estimated response to be a linear combination of the features, which is both the greatest strength and biggest limitation. Linearity leads to interpretable models: linear effects are simple to quantify and describe (see also next chapter) and are additive, so it is easy to separate the effects. If you suspect interactions of features or a non-linear association of a feature with the target value, then you can add interaction terms and use techniques like regression splines to estimate non-linear effects.
- **Normality**: The target outcome given the features are assumed to follow a normal distribution. If this assumption is violated, then the estimated confidence intervals of the feature weights are not valid. Any interpretation of the features p-values is not valid.
- **Homoscedasticity** (constant variance): The variance of the error terms $\epsilon_{i}$ is assumed to be constant over the whole feature space. Let's say you want to predict the value of a house given the living area in square meters. You estimate a linear model, which assumes that no matter how big the house, the error terms around the predicted response have the same variance. This assumption is often violated in reality. In the house example it is plausible that the variance of error terms around the predicted price is higher for bigger houses, since also the prices are higher and there is more room for prices to vary.
- **Independence**: Each instance is assumed to be independent from the next one. If you have repeated measurements, like multiple records per patient, the data points are not independent from each other and there are special linear model classes to deal with these cases, like mixed effect models or GEEs.
- **Fixed features**: The input features are seen as 'fixed', carrying no errors or variation, which, of course, is very unrealistic and only makes sense in controlled experimental settings. But not assuming fixed features would mean that you have to fit very complex measurement error models that account for the measurement errors of your input features. And usually you don't want to do that.
- **Absence of multicollinearity**: Basically you don't want features to be highly correlated, because this messes up the estimation of the weights. In a situation where two features are highly correlated (something like correlation > 0.9) it will become problematic to estimate the weights, since the feature effects are additive and it becomes indeterminable to which of the correlated features to attribute the effects.
### Interpretation
The interpretation of a weights in the linear model depends on the type of the corresponding feature:
- Numerical feature: For an increase of the numerical feature $x_{j}$ by one point, the estimated outcome changes by $\beta_{j}$. An example for a numerical feature is the size of a house.
- Binary feature: A feature, that for each instance takes on one of two possible values. An example is the feature "House comes with a garden". One of the values counts as the reference level (in some programming languages coded with 0), like "No garden". A change of the feature $x_{j}$ from the reference level to the other level changes the estimated outcome by $\beta_{j}$
- Categorical feature with multiple levels: A feature with a fixed amount of possible values. An example is the feature "Floor type", with possible levels "carpet", "laminate" and "parquet". One solution to deal with many levels is to one-hot-encode them, meaning each level gets it's own column. From a categorical feature with $l$ levels, you only need $l-1$ columns, otherwise the coding is overparameterised. The interpretation for each level is then according to the binary features. Some language like R allow you to code categorical feature in different ways, see Chapter \@ref(cat.code)
- Intercept $\beta_{0}$: The intercept is the feature weight for the constant feature, which is always 1 for all instances. Most software packages automatically add this feature for estimating the intercept. The interpretation is: Given all numerical features are zero and the categorical features are at the reference level, the estimated outcome of $y_{i}$ is $\beta_{0}$. The interpretation of $\beta_{0}$ is usually not relevant, because instances with all features at zero often don't make any sense, unless the features were standardised (mean of zero, standard deviation of one), where the intercept $\beta_0$ reflects the predicted outcome of an instance where all features are at their mean.
Another important measurement for interpreting linear models is the $R^2$ measurement.
$R^2$ tells you how much of the total variance of your target outcome is explained by the model.
The higher $R^2$ the better your model explains the data.
The formula to calculate $R^2$ is: $R^2 = 1 - SSE/SST$, where SSE is the squared sum of the error terms ($SSE = \sum_{i=1}^n (y_i - \hat{y}_i)^2$) and SST is the squared sum of the data variance ($SST = \sum_{i=1}^n (y_i - \bar{y})^2$).
The SSE tells you how much variance remains after fitting the linear model, which is measured by looking at the squared differences between the predicted and actual target values.
SST is the total variance of the target around the mean.
So $R^2$ tells you how much of your variance can be explained by the linear model.
$R^2$ ranges between 0 for models that explain nothing and 1 for models that explain all of the variance in your data.
There is a catch, because $R^2$ increases with the number of features in the model, even if they carry no information about the target value at all.
So it is better to use the adjusted R-squared ($\bar{R}^2$), which accounts for the number of features used in the model.
It's calculation is $\bar{R}^2 = R^2 - (1-R^2)\frac{p}{n - p - 1}$, where $p$ is the number of features and $n$ the number of instances.
It isn't helpful to do interpretation on a model with very low $R^2$ or $\bar{R}^2$, because basically the model is not explaining much of the variance, so any interpretation of the weights are not meaningful.
### Interpretation Example
In this example we use the linear model to predict the bike rentals on a day, given weather and calendrical information, see Chapter \@ref(bike-data).
For the interpretation we examine the estimated regression weights.
The features are a mix of numerical and categorical features.
```{r linear_model}
features_of_interest = c('season','holiday', 'workingday', 'weathersit', 'temp', 'hum', 'windspeed', 'days_since_2011')
X = bike.train[features_of_interest]
# colnames(X) = speed_dating_names_matches[colnames(X)]
y = bike.train[,'cnt']
dat = cbind(X, y)
#colnames(dat) = c(colnames(X), target_var_regress)
mod = lm(y ~ ., data = dat, x = TRUE)
lm_summary = summary(mod)$coefficients
pretty_rownames = function(rnames){
rnames = gsub('^`', '', rnames)
rnames = gsub('`$', '', rnames)
rnames = gsub('`', ':', rnames)
rnames
}
lm_summary_print = lm_summary
rownames(lm_summary_print) = pretty_rownames(rownames(lm_summary_print))
kable(lm_summary_print[,c('Estimate', 'Std. Error')], digits = 1)
```
Interpretation of a numerical feature ('Temperature'):
An increase of the temperature by 1 degree Celsius increases the expected number of bikes by `r sprintf('%.1f', lm_summary_print['temp', 'Estimate'])` given all other features stay the same.
Interpretation of a categorical feature ('weathersituation')):
The estimated number of bikes is `r sprintf('%.1f', lm_summary_print['weathersitRAIN/SNOW/STORM', 'Estimate'])` lower when it is rainy, snowing or stormy, compared to good weather, given that all features stay the same.
Also if the weather was misty, the expected number of bike rentals was `r sprintf('%.1f', lm_summary_print['weathersitMISTY', 'Estimate'])` lower, compared to good weather, given all other features stay the same.
As you can see in the interpretation examples, the interpretations always come with the footnote that 'all other features stay the same'.
That's because of the nature of linear models:
The target is a linear combination of the weighted features.
The estimated linear equation spans a hyperplane in the feature/target space (a simple line in the case of a single feature).
The $\beta$'s specify the slope (gradient) of the hyperplane in each direction.
The good side is, that it isolates the interpretation.
If you think of the features as turn-switches that you can turn up or down, it is nice to see what happens when you would just turn the switch for one feature.
On the bad side of things, the interpretation ignores the joint distribution with other features.
Increasing one feature, but not changing others, might create unrealistic or at least unlikely data points.
### Interpretation templates
The interpretation of the features in the linear model can be automated by using following text templates.
**Interpretation of a Numerical Feature**
An increase of $x_{k}$ by one unit increases the expectation for $y$ by $\beta_k$ units, given all other features stay the same.
**Interpretation of a Categorical Feature**
A change from $x_{k}$'s reference level to the other category increases the expectation for $y$ by $\beta_{k}$, given all other features stay the same.
### Visual parameter interpretation
Different visualisations make the linear model outcomes easier and quicker to grasp for humans.
#### Weight plot
The information of the weights table (weight estimates and variance) can be visualised.
Figure \@ref(fig:linear-weights-plot) shows a weight plot of the linear model fitted before.
```{r linear-weights-plot, fig.cap="Each row in the plot represents one feature weight. The weights are displayed as points and the 95\\% confidence intervals around the points with a line. A 95\\% confidence interval means that if the linear model would be estimated 100 times on similar data, in 95 out of 100 times, the confidence interval would cover the true weight, under the linear model assumptions (linearity, normality, homoscedasticity, independence, fixed features, absence of multicollinearity).", fig.align='center'}
coef_plot = function(mod, alpha = 0.05, remove_intercept = TRUE){
lm_summary = summary(mod)$coefficients
rownames(lm_summary) = pretty_rownames(rownames(lm_summary))
df = data.frame(Features = rownames(lm_summary),
Estimate = lm_summary[,'Estimate'],
std_error = lm_summary[,'Std. Error'])
df$lower = df$Estimate - qnorm(alpha/2) * df$std_error
df$upper = df$Estimate + qnorm(alpha/2) * df$std_error
if(remove_intercept){
df = df[!(df$Features == '(Intercept)'),]
}
ggplot(df) +
geom_vline(xintercept=0, linetype=4) +
geom_point(aes(x=Estimate, y=Features)) +
geom_segment(aes(y=Features, yend=Features, x=lower, xend=upper), arrow = arrow(angle=90, ends='both', length = unit(0.1, 'cm'))) +
scale_x_continuous('Weight estimate') +
scale_y_discrete('Feature') +
my_theme()
}
coef_plot(mod)
```
Figure \@ref(fig:linear-weights-plot) makes clear that rainy/snowy/stormy weather has a strong negative effect on the expected number of bikes.
The working day feature's weight is close to zero and the zero is included in the 95% interval, meaning it is not influencing the prediction significantly.
Some confidence intervals are very short and the estimates are close to zero, yet the features were important.
Temperature is such a candidate.
The problem about the weight plot is that the features are measured on different scales.
While for weather situation feature the estimated $\beta$ signifies the difference between good and rainy/storm/snowy weather, for temperature it signifies only an increase of 1 degree Celsius.
You can improve the comparison by scaling the features to mean zero and standard deviation of one before fitting the linear model, to make the estimated weights comparable.
#### Effect Plot
The weights of the linear model can be analysed more meaningfully, when multiplied with the actual feature values.
The weights depend on the scale of the features and will be different if you have a feature measuring some height and you switch from meters to centimetres.
The weight will change, but the actual relationships in your data will not.
Also it is important to know the distribution of your feature in the data, because if you have a very low variance, it means that almost all instances will get a similar contribution from this feature.
The effect plot can help to understand how much the combination of a weight and a feature contributes to the predictions in your data.
Start with the computation of the effects, which is the weight per feature times the feature of an instance:
$\text{effect}_{i,j} = w_{j} \cdot x_{i,j}$.
The resulting effects are visualised with boxplots:
A box in a boxplot contains the effect range for half of your data (25% to 75% effect quantiles).
The line in the box is the median effect, so 50% of the instances have a lower and the other half a higher effect on the prediction than the median value.
The lines reach up to $+/- 1.58 \text{ IQR} / \sqrt{n}$, with IQR being the inter quartile range ($q_{0.75} - q_{0.25}$).
The points are outliers.
The categorical feature effects can be aggregated into one boxplot, compared to the weight plot, where each weight gets a row.
```{r linear-effects, fig.cap="The feature effect plot shows the distribution of the effects (= feature value times feature weight) over the dataset for each feature.", fig.align='center'}
get_reference_dataset = function(dat){
df = lapply(dat, function(feature){
if(class(feature) == 'factor'){
factor(levels(feature)[1], levels = levels(feature))
} else {
0
}
})
data.frame(df)
}
get_effects = function(mod, dat){
X = data.frame(predict(mod, type = 'terms'))
# Nicer colnames
colnames(X) = gsub('^X\\.', '', colnames(X))
colnames(X) = gsub('\\.', ' ', colnames(X))
# predict with type='terms' centers the results, so we have to add the mean again
reference_X = predict(mod, newdata=get_reference_dataset(dat), type='terms')
X_star = data.frame(t(apply(X, 1, function(x){ x - reference_X[1,names(X)]})))
X_star
}
effect_plot = function(mod, dat, feature_names=NULL){
X = get_effects(mod, dat)
if(!missing(feature_names)){
rownames(X) = feature_names
}
X = gather(X)
ggplot(X) +
geom_hline(yintercept=0, linetype=4) +
geom_boxplot(aes(x=key, y=value, group=key)) +
coord_flip() +
scale_y_continuous('Feature effect') +
scale_x_discrete('Feature') +
my_theme()
}
effect_plot(mod, dat)
```
The largest contributions to the expected number of bike rentals come from temperature and the days feature, which captures the trend that the bike rental service became more popular over time. The temperature has a broad contribution distribution.
The day trend feature goes from zero to large positive contribution, because the first day in the dataset (01.01.2011) get's a very low day effect, and the estimated weight with this feature is positive (`r sprintf('%.2f', lm_summary_print['days_since_2011', 'Estimate'])`), so the effect gets higher with every day and is highest for the latest day in the dataset (31.12.2012).
Note that for effects from a feature with a negative weight, the instances with a positive effect are the ones that have a negative feature value, so days with a high negative effect of windspeed on the bike rental count have the highest windspeeds.
### Explaining Single Predictions
How much did each feature of an instance contribute towards the prediction?
This can, again, be answered by bringing together the weights and feature values of this instance and computing the effects.
An interpretation of instance specific effects is only meaningful in comparison with the distribution of each feature's effects.
```{r linear-effects-single, fig.cap="The effect for one instance shows the effect distribution while highlighting the effects of the instance of interest.", fig.align='center'}
i = 6
effects = get_effects(mod, dat)
predictions = predict(mod)
effects_i = gather(effects[i, ])
predictions_mean = mean(predictions)
# For proper indexing, names have to be removed
names(predictions) = NULL
pred_i = predictions[i]
effect_plot(mod, dat) +
geom_point(aes(x=key, y=value), color = 'red', data = effects_i, shape = 4, size=2) +
ggtitle(sprintf('Predicted value for instance: %.0f\nAverage predicted value: %.0f\nActual value: %.0f', pred_i, predictions_mean, y[i]))
```
Let's have a look at the effect realisation for the rental bike count of one instance (= one day).
Some features contribute unusually little or much to the predicted bike count, compared to the overall dataset: Temperature (`r round(X[i, 'temp'],0)` degrees) contributes less towards the predicted value compared to the average and the trend feature "days_since_2011" unusually much, because this instance is from late 2011 (`r X[i, 'days_since_2011']` days).
### Coding Categorical Features {#cat.code}
There are several ways to encode a categorical feature and the choice influences the interpretation of the $\beta$-weights.
Described in Chapter \@ref(limo) is the treatment coding, which is sufficient in most cases.
Using different codings boils down to creating different matrices (=design matrix) from your one column with the categorical feature.
This section presents three different codings, but there are many more.
The example used has six instances and one categorical feature with 3 levels.
For the first two instances, the feature takes on category A, for instances three and four category B and for the last two instances category C.
- **Treatment coding**: The $\beta$ per level is the estimated difference in $y$ compared to the reference level. The intercept of the linear model is the mean of the reference group (given all other features stay the same). The first column of the design matrix is the intercept, which is always 1. Column two is an indicator whether instance $i$ is in category B, column three is an indicator for category C. There is no need for a column for category A, because then the linear equation would be overspecified and no unique solution (= unique $\beta$'s) can be found. Knowing that an instance is neither in category B or C is enough.
$$
\begin{pmatrix}
1 & 0 & 0 \\
1 & 0 & 0 \\
1 & 1 & 0 \\
1 & 1 & 0 \\
1 & 0 & 1 \\
1 & 0 & 1 \\
\end{pmatrix}
$$
- **Effect coding**: The $\beta$ per level is the estimated $y$-difference from the level to the overall mean (again, given all other features are zero or the reference level). The first column is again used to estimate the intercept. The weight $\beta_{0}$ which is associated with the intercept represents the overall mean and $\beta_{1}$, the weight for column two is the difference between the overall mean and category B. The overall effect of category B is $\beta_{0} + \beta_{1}$. The interpretation for category C is equivalent. For the reference category A, $-(\beta_{1} + \beta_{2})$ is the difference to the overall mean and $\beta_{0} -(\beta_{1} + \beta_{2})$ the overall effect.
$$
\begin{pmatrix}
1 & -1 & -1 \\
1 & -1 & -1 \\
1 & 1 & 0 \\
1 & 1 & 0 \\
1 & 0 & 1 \\
1 & 0 & 1 \\
\end{pmatrix}
$$
- **Dummy coding**: The $\beta$ per level is the estimated mean of $y$ for each level (given all feature are at value zero or reference level). Note that the intercept was dropped here, so that a unique solution for the linear model weights can be found.
$$
\begin{pmatrix}
1 & 0 & 0 \\
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
0 & 0 & 1 \\
\end{pmatrix}
$$
If you want to dive a bit deeper into different encodings of categorical features, checkout [this webpage](http://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/) and
[this blog post](http://heidiseibold.github.io/page7/).
### The disadvantages of linear models
Linear models can only represent linear relationships.
Each **non-linearity or interaction has to be hand-crafted** and explicitly given to the model as an input feature.
Linear models are also often **not that good regarding predictive performance**, because the relationships that can be learned are so restricted and usually oversimplifies how complex reality is.
The interpretation of a weight **can be unintuitive** because it depends on all other features.
A feature with high positive correlation with the outcome $y$ and another feature might get a negative weight in the linear model, because, given the other correlated feature, it is negatively correlated with $y$ in the high-dimensional space.
Completely correlated features make it even impossible to find a unique solution for the linear equation.
An example:
You have a model to predict the value of a house and have features like number of rooms and area of the house.
House area and number of rooms are highly correlated: the bigger a house is, the more rooms it has. If you now take both features into a linear model, it might happen, that the area of the house is the better predictor and get's a large positive weight.
The number of rooms might end up getting a negative weight, because either, given that a house has the same size, increasing the number of rooms could make it less valuable or the linear equation becomes less stable, when the correlation is too strong.
### Extending Linear Models
Linear models have been extensively studied and extended to fix some of the shortcomings.
- Lasso, presented in Chapter \@ref(lasso) is a method to "pressure" weights of irrelevant features to get an estimate of zero. Having unimportant features weighted by zero is useful, because having less terms to interpret makes the model more interpretable.
- Generalised linear models (GLM) allow the target outcome to have different distributions. The target outcome is not any longer required to be normally distributed given the features, but GLMs allow you to model for example Poisson distributed count variables. Logistic regression, presented in Chapter \@ref(logistic), is a GLM for categorical outcomes.
- Generalised additive models (GAMs) are GLMs with the additional ability to allow non-linear relationships with features, while maintaining the linear equation structure (sounds paradox, I know, but it works).
You can also apply all sorts of tricks to go around some of the problems:
- Adding interactions: You can define interactions between features and add them as new features before estimating the linear model. The RuleFit algorithm, introduced in Chapter \@ref(rulefit), can add interactions automatically.
- Adding non-linear terms like polynomials to allow non-linear relationships with features.
- Stratifying data by feature and fitting linear models on subsets.
### Sparse linear models {#sparse-linear}
The examples for the linear models that I chose look all nice and tidy, right?
But in reality you might not have just a handful of features, but hundreds or thousands.
And your normal linear models?
Interpretability goes downriver.
You might even get into a situation with more features than instances and you can't fit a standard linear model at all.
The good news is that there are ways to introduce sparsity (= only keeping a few features) into linear models.
#### Lasso {#lasso}
The most automatic and convenient way to introduce sparsity is to use the Lasso method.
Lasso stands for "least absolute shrinkage and selection operator" and when added to a linear model, it performs feature selection and regularisation of the selected feature weights.
Let's review the minimization problem, the $\beta$s optimise:
$$ min_{\boldsymbol{\beta}} \left( \frac{1}{n} \sum_{i=1}^n (y_i - x_i^T \boldsymbol{\beta})^2\right)$$
Lasso adds a term to this optimisation problem:
$$ min_{\boldsymbol{\beta}} \left( \frac{1}{n} \sum_{i=1}^n (y_i - x_i^T \boldsymbol{\beta})^2 + \lambda ||\boldsymbol{\beta}||_1\right)$$
The term $||\boldsymbol{\beta}||_1$, the L1-norm of the feature vector, leads to a penalisation of large $\boldsymbol{\beta}$-values.
Since the L1-norm is used, many of the weights for the features will get an estimate of 0 and the others are shrunk.
The parameter $\lambda$ controls the strength of the regularising effect and is usually tuned by doing cross-validation.
Especially when $\lambda$ is large, many weights become 0.
#### Other Methods for Sparsity in Linear Models
A big spectrum of methods can be used to reduce the number of features in a linear model.
Methods that include a pre-processing step:
- Hand selected features: You can always use expert knowledge to choose and discard some features. The big drawback is, that it can't be automated and you might not be an expert.
- Use some measures to pre-select features: An example is the correlation coefficient. You only take features into account that exceed some chosen threshold of correlation between the feature and the target. Disadvantage is that it only looks at the features one at a time. Some features might only show correlation after the linear model has accounted for some other features. Those you will miss with this approach.
Step-wise procedures:
- Forward selection: Fit the linear model with one feature. Do that with each feature. Choose the model that works best (for example decided by the highest R squared). Now again, for the remaining features, fit different versions of your model by adding each feature to your chosen model. Pick the one that performs best. Continue until some criterium is reached, like the maximum number of features in the model.
- Backward selection: Same as forward selection, but instead of adding features, start with the model that includes all features and try out which feature you have to remove to get the highest performance increase. Repeat until some stopping criterium is reached.
I recommend using Lasso, because it can be automated, looks at all features at the same time and can be controlled via $\lambda$.
It also works for the logistic regression model for classification, which is the topic of Chapter \@ref(logistic).
## Logistic Regression {#logistic}
Logistic regression is the linear regression model made fit for classification problems.
### What's Wrong with Linear Regression Models for Classification?
The linear regression model works well in regression setups, but fails in the classification case.
Why is that?
In case of two classes, you could label one of the classes with 0 and the other with 1 and use a linear model on it and it would estimate the weights for you.
There are just a few problems with that approach:
- A linear model does not output probabilities, but it treats the classes as numbers (0 and 1) and fits the best hyperplane (if you have one feature, it's a line) that minimizes the distances between the points and the hyperplane. So it simply interpolates between the points, but there is no meaning in it and you cannot interpret it as probabilities.
- Also a linear model will extrapolate the features and give you values below zero and above one, which are not meaningful and should tell you that there might be a more clever approach to classification.
- Since the predicted outcome is not a probability but some linear interpolation between points there is no meaningful threshold at which you can distinguish one class from the other. A good illustration of this issue was given on [Stackoverflow](https://stats.stackexchange.com/questions/22381/why-not-approach-classification-through-regression), which I reproduced in Figure \@ref(fig:linear-class-threshold)
- Linear models don't extend to classification problems with multiple classes. You would have to start labeling the next class with a 2, then 3 and so on. The classes might not have any meaningful order, but the linear model would force a weird structure on the relationship between the features and your class predictions. So for a feature with a positive weight, the higher the value of that feature the more it contributes to the prediction of a class with a higher number, even if classes that happened to get a similiar number are not related at all.
```{r linear-class-threshold, fig.cap="An illustration why linear regression does not work well in a binary classification setting. A linear model is fitted on artificial data for classifying a tumour as malignant (1) or benign (0), dependant on the size of the tumour. Each point is a tumour, the x-axis shows the size of the tumour, the y-axis the malignancy, points are slightly jittered to reduce over-plotting. The lines display the fitted curve from a linear model. In the data setting on the left, we can use 0.5 as a threshold for the predicted outcome of the linear model for separating benign from malignant tumours. After introducing a few more malignant tumour cases, especially with larger tumour sizes, the regression line shifts and a threshold of 0.5 does not separate the classes any longer.", out.width = '80%'}
df = data.frame(x = c(1,2,3,6,7,8,9),
y = c(0,0,0,1,1,1,1),
case = '0.5 threshold ok')
df_extra = data.frame(x=c(df$x, 7, 7, 7, 4, 20, 19),
y=c(df$y, 1,1,1,1, 1,1),
case = '0.5 threshold not ok')
df.lin.log = rbind(df, df_extra)
p1 = ggplot(df.lin.log, aes(x=x,y=y)) +
geom_point(position = position_jitter(width=0, height=0.02)) +
geom_smooth(method='lm', se=FALSE) +
my_theme() +
scale_y_continuous('Tumour class', breaks = c(0, 0.5, 1), labels = c('benign tumour', '0.5', 'malignant tumour'), limits = c(-0.1,1.3)) +
scale_x_continuous('Tumour size') +
facet_grid(. ~ case) +
geom_hline(yintercept=0.5, linetype = 3)
p1
```
### Logistic Regression
A solution for classification is logistic regression.
Instead of fitting a straight line or hyperplane, the logistic regression model uses a non-linear function, the logistic function to squeeze the output of a linear equation between 0 and 1.
The logistic function is defined as:
$$ \text{logistic}(\eta) = \frac{1}{1 + exp(-\eta)}$$
And it looks like shown in Figure \@ref(fig:logistic-function).
```{r, logistic-function, fig.cap="The logistic function. It only outputs numbers between 0 and 1. At input 0 it outputs 0.5.", out.width = '80%'}
logistic = function(x){1 / (1 + exp(-x))}
x = seq(from=-6, to = 6, length.out = 100)
df = data.frame(x = x,
y = logistic(x))
ggplot(df) + geom_line(aes(x=x,y=y)) + my_theme()
```
The step from linear regression models to logistic regression is kind of straightforward. In the linear regression model we modelled the relationship between the outcome and the features with a linear equation:
$$\hat{y}_{i} = \beta_{0} + \beta_{1} \cdot x_{i,1} + \ldots + \beta_{p} x_{i,p} $$
For the classification we prefer probabilities, which are between 0 and 1, so we wrap the right side of the equation into the logistic regression function and like that force the output to only take on values between 0 and 1.
$$P(y_{i}=1) = \frac{1}{1 + exp(-(\beta_{0} + \beta_{1} \cdot x_{i,1} + \ldots + \beta_{p} x_{i,p}))}$$
Let's revisit the tumour size example from Figure \@ref(fig:linear-class-threshold) again.
But now instead of the linear regression model, we use the logistic regression model:
```{r logistic-class-threshold, fig.cap="The logistic regression model successfully finds the correct decision boundary to distinguish between malignant and benign tumours dependent on the size of the tumour in this example. The blue line is the logistic function shifted and squeezed so that it fits the data."}
logistic1 = glm(y ~ x, family = binomial, data = df.lin.log[df.lin.log$case == '0.5 threshold ok',])
logistic2 = glm(y ~ x, family = binomial, data = df.lin.log)
lgrid = data.frame(x = seq(from=0, to=20, length.out=100))
lgrid$y1_pred = predict(logistic1, newdata = lgrid, type='response')
lgrid$y2_pred = predict(logistic2 , newdata = lgrid, type='response')
p1 = ggplot(df.lin.log, aes(x=x,y=y)) +
geom_line(aes(x=x, y=y1_pred), data = lgrid, color='blue', size=1) +
geom_point(position = position_jitter(width=0, height=0.02)) +
my_theme() +
scale_y_continuous('Tumour class', breaks = c(0, 0.5, 1), labels = c('benign tumour', '0.5', 'malignant tumour'), limits = c(-0.1,1.3)) +
scale_x_continuous('Tumour size') +
facet_grid(. ~ case) +
geom_hline(yintercept=0.5, linetype = 3)
p1
```
It works better with logistic regression and we can use 0.5 as a threshold in both cases. Including the additional points does not affect the estimated curve much.
### Interpretation
The interpretation of the logistic regression weights differs from the linear regression case, because in logistic regression the outcome is a probability between 0 and 1, and the weights don't affect the probability linearly, but are squeezed through the logistic function.
That's why we need to reformulate the equation for the interpretation, so that there is only the linear term left on the right side of the formula.
$$log\left(\frac{P(y_{i}=1)}{1 - P(y_{i}=1)}\right) = log\left(\frac{P(y_{i}=1)}{ P(y_{i}=0)}\right) = \beta_{0} + \beta_{1} x_{i,1} + \ldots + \beta_{p} x_{i,p}$$
$\frac{P(y_{i}=1)}{1 - P(y_{i}=1)}$ is also called odds (probability of event divided by probability of no event) and $log\left(\frac{P(y_{i}=1)}{1 - P(y_{i}=1)}\right)$ is called log odds.
So with a logistic regression model we have a linear model for the log odds.
Great!
Doesn't sound helpful!
Well, with a bit of shuffling again, you can find out how the prediction changes, when one of the features $x_j$ is changed by 1 point.
For this we can first apply the $exp()$ function on both sides of the equation:
$$\frac{P(y_{i}=1)}{(1 - P(y_{i}=1))} = odds_i = exp\left(\beta_{0} + \beta_{1} \cdot x_{i,1} + \ldots + \beta_{p} x_{i,p}\right)$$
Then we compare what happens when we increase one of the $x_{i,j}$ by 1.
But instead of looking at the difference, we look at the ratio of the two predictions:
$$ \frac{odds_{i, x_j + 1}}{odds_i}= \frac{exp\left(\beta_{0} + \beta_{1} x_{i,1} + \ldots + \beta_{j} (x_{i,j} + 1) + \ldots+ \beta_{p} x_{i,p}\right)}{exp\left(\beta_{0} + \beta_{1} x_{i,1} + \ldots + \beta_{j} x_{i,j} + \ldots+ \beta_{p} x_{i,p}\right)}$$
Using the rule that $\frac{exp(a)}{exp(b)} = exp(a - b)$ gives us:
$$
\begin{aligned}
\frac{odds_{i, x_j + 1}}{odds_i} & = & exp\left(\beta_{0} + \beta_{1} \cdot x_{i,1} + \ldots + \beta_{j} \cdot (x_{i,j} + 1) + \ldots+ \beta_{i,p}x_{i,p}\right) \\
& & - exp\left(\beta_{0} + \beta_{1} \cdot x_{i,1} + \ldots + \beta_{j} \cdot x_{i,j} + \ldots+ \beta_{p} x_{i,p}\right)
\end{aligned}
$$
And then we can remove a lot of terms from the equation, which is convenient:
$$ \frac{odds_{i, x_j + 1}}{odds_i}= exp\left( \beta_{j} (x_{i,j} + 1) - \beta_{j} x_{i,j} \right) = exp\left(\beta_j\right)$$
And we end up with something simple like $\exp(\beta_j)$.
So a change of $x_j$ by one unit changes the odds ratio (multiplicatively) by a factor of $\exp(\beta_j)$.
We could also interpret it this way:
A change in $x_j$ by one unit changes the log odds ratio by $\beta_j$ units, but most people do the former because thinking about the $log()$ of something is known to be hard on the brain.
Interpreting the odds ratio already needs a bit of getting used to.
For example if you have odds of 2, it means that the probability for $y_i = 1$ is twice as big as $y_i = 0$.
If you have a $\beta_j$ (=odds ratio) of $0.7$, then an increase in the respective $x_j$ by one unit multiplies the odds by $\exp(0.7) \approx 2$ and the odds change to 4.
But usually you don't deal with the odds and only interpret the $\beta$'s as the odds ratios.
Because for actually calculating the odds you would need to set a value for each feature $x_j$, which only makes sense if you want to look at one specific instance of your dataset.
Here are the interpretations for the logistic regression model with different feature types:
- Numerical feature: For an increase of one unit of the feature $x_{j}$, the estimated odds change (multiplicatively) by a factor of $\exp(\beta_{j})$
- Binary categorical feature: One of the two values of the feature is the reference level (in some languages the one that was coded in 0). A change of the feature $x_{j}$ from the reference level to the other level changes the estimated odds (multiplicatively) by a factor of $\exp(\beta_{j})$
- Categorical feature with many levels: One solution to deal with many possible feature values is to one-hot-encode them, meaning each level gets it's own column. From a categorical feature with L levels, you only need L-1 columns, otherwise it is over-parameterised. The interpretation for each level is then according to the binary features.
- Intercept $\beta_{0}$: Given all numerical features are zero and the categorical features are at the reference level, the estimated odds are $\exp(\beta_{0})$. The interpretation of $\beta_{0}$ is usually not relevant.
### Example
We use the logistic regression model to predict cervical cancer given some risk factors.
The data are described in Chapter \@ref(cervical-data)
```{r logistic-example}
neat_cervical_names = c('Intercept', 'Hormonal contraceptives y/n',
'Smokes y/n', 'Num. of pregnancies',
'Num. of diagnosed STDs',
'Intrauterine device y/n')
# Load cervical data
source(sprintf('%s/create-cervical-cancer-data.R', src_dir))
# Fit logistic model for probability of cancer, use few features that are interesting
mod = glm(Biopsy ~ Hormonal.Contraceptives + Smokes + Num.of.pregnancies + STDs..Number.of.diagnosis + IUD,
data = cervical, family = binomial())
# Print table of coef, exp(coef), std, p-value
coef.table = summary(mod)$coefficients[,c('Estimate', 'Std. Error')]
coef.table = cbind(coef.table, 'Odds ratio' = as.vector(exp(coef.table[, c('Estimate')])))
# Interpret one numerical and one factor
rownames(coef.table) = neat_cervical_names
colnames(coef.table)[1] = 'Weight'
kable(coef.table[, c('Weight', 'Odds ratio', 'Std. Error')], digits=2, caption='The results from fitting a logistic regression model on the cervical cancer dataset. Shown are the features used in the model, their estimated weights and according odds ratios and the standard errors of the estimated weights.')
```
Interpretation of a numerical feature ('Num. of diagnosed STDs'):
An increase of the number of diagnosed STDs (sexually transmitted diseases) changes (decreases) the odds for cancer vs. no cancer multiplicatively by `r sprintf('%.2f', coef.table['Num. of diagnosed STDs', 'Odds ratio'])`, given all other features stay the same.
Keep in mind that correlation does not imply causation.
No recommendation here to get STDs.
Interpretation of a categorical feature ('Hormonal contraceptives y/n'):
For women with hormonal contraceptives, the odds for cancer vs. no cancer are by a factor of `r sprintf('%.2f', coef.table['Hormonal contraceptives y/n', 'Odds ratio'])` higher, compared to women without hormonal contraceptives, given all other features stay the same.
Again as in the linear models, the interpretations are always coming with the clause that 'all other features stay the same'.
## Decision Tree {#tree}
Linear regression models and logistic regression fail in situations where the relationship between features and outcome is non-linear or where the features are interacting with each other.
Time to shine for the decision trees!
Before you read about decision trees, have a look at Figure \@ref(fig:tree-artificial) for an illustration.
Tree-based models split the data according to certain cutoff values in the features multiple times.
Splitting means that different subsets of the dataset are created, where each instance belongs to one subset.
The final subsets are called terminal or leaf nodes and the intermediate subsets are called internal nodes or split nodes.
For predicting the outcome in each leaf node, a simple model is fitted with the instances in this subset (for example the subsets average target outcome).
Trees can be used for classification and regression.
There are a lot of tree algorithms with different approaches for how to grow a tree.
They differ in the possible structure of the tree (e.g. number of splits per node), criteria for how to find the splits, when to stop splitting and how to estimate the simple models within the leaf nodes.
Classification and regression trees (CART) is one of the more popular algorithms for tree induction.
We will focus on CART, but the interpretation is similar for most of the tree types.
I recommend the book 'The elements of statistical learning' [@Hastie2009] for a more detailed introduction.
```{r tree-artificial, fig.cap="Decision tree with artificial data. Instances with a value bigger than 3 for feature x1 end up in node 5. All other instances are assigned to node 3 or node 4, depending whether feature x2 values exceed 1."}
set.seed(42)
n = 100
dat_sim = data.frame(feature_x1 = rep(c(3,3,4,4), times = n), feature_x2 = rep(c(1,2,2,2), times = n), y = rep(c(1, 2, 3, 4), times = n))
dat_sim = dat_sim[sample(1:nrow(dat_sim), size = 0.9 * nrow(dat_sim)), ]
dat_sim$y = dat_sim$y + rnorm(nrow(dat_sim), sd = 0.2)
ct = ctree(y ~ feature_x1 + feature_x2, dat_sim)
plot(ct, inner_panel = node_inner(ct, pval = FALSE), type='simple')
```
The following formula describes the relationship between outcome $y$ and the features $x$.
$$\hat{y}_i = \hat{f}(x_i) = \sum_{m = 1}^M c_m I\{x_i \in R_m\}$$
Each instance $x_i$ falls into exactly one leaf node (=subset $R_m$). $I_{\{x_i \in R_m\}}$ is the identity function which returns 1 if $x_i$ is in the subset $R_m$ and else 0.
If $x_i$ falls into a leaf node $R_l$, the predicted outcome is $\hat{y} = c_l$, where $c_l$ is the mean of all the training instances in leaf node $R_l$.
But where do the subsets come from?
This is quite simple:
The algorithm takes a feature and tries which cut-off point minimises the sum of squares of $y$ for a regression tasks or the Gini index of the class distribution of $y$ for classification tasks.
The best cut-off point makes the two resulting subsets as different as possible in terms of the target outcome.
For categorical features the algorithm tries to build subsets by trying different groupings of categories.
After this was done for each feature, the algorithm looks for the feature with the best cut-off and chooses it to split the node into two new nodes.
The algorithm continues doing this recursively in both of the new nodes until a stopping criterium is reached.
Possible criteria are:
A minimum number of instances that have to be in a node before the split or the minimum number of instances that have to be in a terminal node.
### Interpretation
The interpretation is simple:
Starting from the root node you go to the next nodes and the edges tell you which subsets you are looking at.
Once you reach the leaf node, the node tells you the predicted outcome.
All the edges are connected by 'AND'.
Template: If feature x is [smaller/bigger] than threshold c AND ..., then the predicted outcome is $\hat{y}_{\text{leafnode}}$.
### Interpretation Example
Let's have a look again at the bike rental data from Chapter \@ref(bike-data).
We want to predict the number of bike rentals on a given day. The learned tree is visualised in Figure \@ref(fig:tree-example).
```{r tree-example, fig.cap="Regression tree fitted on the bike rental data. The maximally allowed depth for the tree was set to 2. The features picked for the tree splits were the trend feature (days since 2011) and the temperature (temp)"}
# increases readability of tree
x = rpart(y ~ ., data = na.omit(dat), method = 'anova', control = rpart.control(cp = 0, maxdepth = 2))
xp = as.party(x)
plot(as.simpleparty(xp), type='simple')
```
The first split and one of the second splits was done in the trend feature, which tells how many days passed since beginning of the data collection and covers the trend that the bike rental service became more popular over time.
For days that came before the 105th day the predicted number of bike rentals is ca. 1800, between the 106th and 430th day it is around 3900.
For days after the 430th day, depending on the temperature, the prediction is either 4600 (if below 12 degrees) or 6600 (if above 12 degrees).
### Advantages
The tree structure is perfectly suited to **cover interactions** between features in the data.
The data also ends up in **distinct groups**, which are often easier to grasp than points on a hyperplane like in linear regression.
The interpretation is arguably pretty straightforward.
The tree structure also has a **natural visualization**, with it's nodes and edges.
### Disadvantages
**Handling of linear relationships**, that's what trees suck at.
Any linear relationship between an input feature and the outcome has to be approximated by hard splits, which produces a step function.
This is not efficient.
This goes hand in hand with **lack of smoothness**.
Slight changes in the input feature can have a big impact on the predicted outcome, which might not be desirable.
Imagine a tree that predicts the value of a house and the tree splits in the square meters multiple times.
One of the splits is at 100.5 square meters.
Imagine a user of a house prize estimator, that uses your decision tree model: She measures her house, concludes that the house has 99 square meters, types it into some nice web interface and get's a prediction of 200 000 Euro.
The user notices that she forgot to measure a small storeroom with 2 square meters.
The storeroom has a skewed wall, so she is not sure if she can count it fully towards the whole house area or only half of the space.
So she decides to try both 100.0 and 101.0 square meters.
The results: 200 000 Euro and 205 000 Euro, which is quite unintuitive, because there was no change from 99 square meters to 100, but from 100 to 101.
Trees are also quite **unstable**, so a few changes in the training dataset might create a completely different tree.
That's because each split depends on the parent split.
And if a different feature gets selected as the first split feature, the whole tree structure will change.
It does not generate confidence in the model if the structure flips so easily.
## Other Interpretable Models
The list of interpretable models is evergrowing and of unkown size.
It contains simple models like linear models, decision trees and naive Bayes, but also more complex ones that combine or modify non-interpretable machine learning models to make them interpretable.
Especially publications about the latter type of models are currently created with a high frequency and it is hard to keep up with the developments.
We only tease a few additional ones in this chapter, especially the simpler and more established candidates.
### Naive Bayes classifier
The naive Bayes classifier makes use of the Bayes' theorem of conditional probabilities.
For each feature it computes the probability for a class given the value of the feature.
The naive Bayes classifier does so for each feature independently, which is the same as having a strong (=naive) assumption of independence of the features.
Naive Bayes is a conditional probability model and models the probability of a class $C_k$ in the following way:
$$ P(C_k|x) = \frac{1}{Z} P(C_k) \prod_{i=1}^n P(x_i | C_k)$$
The term $Z$ is a scaling parameter that ensures that the probabilities for all classes sum up to 1.
The probability of a class, given the features is the class probability times the probability of each feature given the class, normalized by $Z$.
This formula can be derived by using the Bayes' theorem.
Naive Bayes is an interpretable model, because of the independence assumption.
It is interpretable on the modular level.
For each classification it is very clear for each feature how much it contributes towards a certain class prediction.
### K-Nearest Neighbours
The k-nearest neighbour method can be used for regression and classification and uses the closest neighbours of a data point for prediction.
For classification it assigns the most common class among the closest $k$ neighbours of an instance and for regression it takes the average of the outcome of the neighbours.
The tricky parts are finding the right $k$ and deciding how to measure the distance between instances, which ultimately defines the neighbourhood.
This algorithm is different from the other interpretable models presented in this book, since it is an instance-based learning algorithm.
How is k-nearest neighbours interpretable?
For starters, it has no parameters to learn, so there is no interpretability on a modular level, like in linear models.
Also, it lacks global model interpretability, since the model is inherently local and there are no global weights or structures that are learned explicitly by the k-nearest neighbour method.
Maybe it is interpretable on a local level?
To explain a prediction, you can always retrieve the k-neighbours that were used for the prediction.
If the method is interpretable solely depends on the question if you can 'interpret' a single instances in the dataset.
If the dataset consists of hundreds or thousands of features, then it is not interpretable, I'd argue.
But if you have few features or a way to reduce your instance to the most important features, presenting the k-nearest neighbours can give you good explanations.
### RuleFit {#rulefit}
The RuleFit algorithm [@friedman2008predictive] is a regression and classification approach that uses decision rules in a linear model.
RuleFit consists of two components: The first component produces "rules" and the second component fits a linear model with these rules as input (hence the name "RuleFit").
It enables automatic integration of interactions between features into a linear model, while having the interpretability of a sparse linear model.
**Step 1: Rule generation**
The rules that the algorithm generates have a simple form:
For example: "if $x2 < 3$ and $x5 < 7$ then $1$ else $0$"
The rules are generated from the covariates matrix X.
You can also see the rules simply as new features based on your original features.
The RuleFit paper uses the Boston housing data in an example:
The goal is to predict the median house value in the Boston neighbourhood.
One of the rules (read: features) that is generated by RuleFit is:
"if (number of rooms $> 6.64$) and (concentration of nitric oxide $< 0.67$) then $1$ else $0$"
The interesting part is how those rules are generated:
They are derived from Decision Trees, by basically disassembling them.
Every path in a tree can be turned into a decision rule.
Figure \@ref(fig:rulefit) illustrates the rule generation.
You simply chain the binary decisions that lead to a certain node, and voilà, you have a rule.
It is desirable to generate a lot of diverse and meaningful rules. Gradient boosting is used to fit an ensemble of decision trees (by regressing or classifying $y$ with your original features $X$).
Each resulting tree is turned into multiple rules.
```{r rulefit, fig.cap="4 rules can be generated from a tree with 3 terminal nodes.", out.width="80%"}
knitr::include_graphics("images/rulefit.jpg")
```
Another way to see this step is a black box, that generates a new set of features $X'$ out of your original features $X$.
Those features are binary and can represent quite complex interactions of your original $X$.
The rules are chosen to maximise the prediction task at hand.
**Step 2: Sparse linear model**
You will get A LOT of rules from the first step (and that is what you want).
Since the first step is only a feature transformation function on your original data set you are still not done with fitting a model and also you want to reduce the number of rules.
Lasso or L1 regularised regression is a good choice in this scenario, see also Chapter \@ref(lasso).
Next to the rules, also all numerical features from your original dataset will be used in the Lasso linear model.
Every rule and numerical feature gets a weight ($\beta$).
And thanks to the regularisation, a lot of these weights will be estimated to zero.
The numerical features are added because trees suck at representing simple linear relationships between y and x.
The outcome is a linear model that has linear effects for all of the numerical features and also linear terms for the rules.
The interpretation is the same as with linear models, the only difference is that some features are now binary rules.
### And so many more ...
Many algorithms can produce interpretable models and not all can be listed here.
If you are a researcher or just a big fan and user of a certain interpretable method, that is not listed here, get in touch with me and add the method to this book!