-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.qmd
167 lines (135 loc) · 5.46 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
title: "Inflation of Graf. (1999) Integrated Brier Score"
author: "[John Zobolas](https://github.com/bblodfon)"
date: last-modified
description: "Proper and improper IBS inflation study"
bibliography: references.bib
format:
html:
date: last-modified
code-block-bg: true
code-copy: true
code-fold: show
code-overflow: wrap
code-block-border-left: true
toc: true
toc-location: left
html-math-method: katex
page-layout: full
execute:
freeze: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = FALSE)
```
# Aim {-}
We investigate the inflation that may occur when using two scoring rules for evaluating survival models.
The scoring rules are the **Integrated Survival Brier Score (ISBS)** [@Graf1999], and the proposed **re-weighted version (RISBS)** [@Sonabend2022].
See [documentation details](https://mlr3proba.mlr-org.com/reference/mlr_measures_surv.graf.html#details) for their respective formulas.
The first (ISBS) is not a proper scoring rule [@Rindt2022], the second (RISBS) is [@Sonabend2022].
# Example inflation {-}
:::{.callout-note}
In this section we investigate an example where the **proper ISBS gets inflated** (i.e. too large value for the score, compared to the improper version) and show how we can avoid such a thing from happening when evaluating model performance.
:::
Load libraries:
```{r, result=FALSE, message=FALSE}
library(GGally)
library(tidyverse)
library(mlr3proba)
```
Let's use a dataset where in a particular train/test resampling the issue occurs:
```{r}
inflated_data = readRDS(file = "inflated_data.rds")
task = inflated_data$task
part = inflated_data$part
task
```
Separate train and test data:
```{r}
task_train = task$clone()$filter(rows = part$train)
task_test = task$clone()$filter(rows = part$test)
```
Kaplan-Meier of the training survival data:
```{r, message=FALSE, cache=TRUE}
autoplot(task_train) +
labs(title = "Kaplan-Meier (train data)",
subtitle = "Time-to-event distribution")
```
Kaplan-Meier of the training censoring data:
```{r, message=FALSE, cache=TRUE}
autoplot(task_train, reverse = TRUE) +
labs(title = "Kaplan-Meier (train data)",
subtitle = "Censoring distribution")
```
Estimates of the censoring distribution $G_{KM}(t)$ (values from the above figure):
```{r}
km_train = task_train$kaplan(reverse = TRUE)
km_tbl = tibble(time = km_train$time, surv = km_train$surv)
tail(km_tbl)
```
:::{.callout-important}
As we can see from the above figures and table, due to having *at least one censored observation at the last time point*, $G_{KM}(t_{max}) = 0$ for $t_{max} = 13019$.
:::
Is there an observation **on the test set** that has died (`status` = $1$) on that last time point (or after)?
```{r}
max_time = max(km_tbl$time) # max time point
test_times = task_test$times()
test_status = task_test$status()
# get the id of the observation in the test data
id = which(test_times >= max_time & test_status == 1)
id
```
Yes there is such observation!
In `mlr3proba` using `proper = TRUE` for the RISBS calculation, this observation will be weighted by $1/0$ according to the formula.
Practically, to avoid division by zero, a small value `eps = 0.001` will be used.
Let's train a simple Cox model on the train set and calculate its predictions on the test set:
```{r}
cox = lrn("surv.coxph")
p = cox$train(task, part$train)$predict(task, part$test)
```
We calculate the ISBS (improper) and RISBS (proper) scores:
```{r}
graf_improper = msr("surv.graf", proper = FALSE, id = "graf.improper")
graf_proper = msr("surv.graf", proper = TRUE, id = "graf.proper")
p$score(graf_improper, task = task, train_set = part$train)
p$score(graf_proper, task = task, train_set = part$train)
```
As we can see there is **huge difference** between the two versions of the score.
We check the *observation-wise* scores (integrated across all time points):
Observation-wise RISBS scores:
```{r}
graf_proper$scores
```
Observation-wise ISBS scores:
```{r}
graf_improper$scores
```
It is **the one observation that we identified earlier** that causes the inflation of the RISBS score - it's pretty much an outlier compared to all other values:
```{r}
graf_proper$scores[id]
```
Same is true for the improper ISBS, value is approximately x10 larger compared to the other observation-wise scores:
```{r}
graf_improper$scores[id]
```
# Solution {-}
By setting `t_max` (time horizon to evaluate the measure up to) to the $95\%$ quantile of the event times, we can solve the inflation problem of the proper RISBS score, since we will divide by a value larger than zero from the above table of $G_{KM}(t)$ values.
The `t_max` time point is:
```{r}
t_max = as.integer(quantile(task_train$unique_event_times(), 0.95))
t_max
```
Integrating up to `t_max`, the proper RISBS score is:
```{r}
graf_proper_tmax = msr("surv.graf", id = "graf.proper", proper = TRUE, t_max = t_max)
p$score(graf_proper_tmax, task = task, train_set = part$train) # ISBS
```
The score for the specific observation that had experienced the event at (or beyond) the latest training time point is now:
```{r}
graf_proper_tmax$scores[id]
```
:::{.callout-tip title="Suggestion when calculating time-integrated scoring rules"}
To avoid the inflation of RISBS and generally have a more robust estimation of both RISBS and ISBS scoring rules, we advise to set the `t_max` argument (time horizon).
This can be either study-driven or based on a meaningful quantile of the distribution of (usually event) times in your dataset (e.g. $80\%$).
:::
# References