-
Notifications
You must be signed in to change notification settings - Fork 4
/
linelist.Rmd
333 lines (240 loc) · 10.3 KB
/
linelist.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
---
title: "An introduction to linelist"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{An introduction to linelist}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(linelist)
```
# Motivations
Outbreak analytics pipelines often start with *case line lists*, which are data
tables in which every line is a different case/patient, and columns record
different variables of potential epidemiological interest such as date of events
(e.g. onset of symptom, case notification), disease outcome, or patient data
(e.g. age, sex, occupation). Such data is typically held in a `data.frame` (or
a `tibble`) and used in various downstream analysis. While this approach is
functional, it often means that each analysis step will:
1. need to check the required inputs are present in the data, and for the user
to specify where (e.g. '*This is the column where dates of onset are
stored.*')
2. need to validate the required data (e.g. '*Check that the field storing dates
of onset are indeed dates, and not a `character`.*')
The aim of *linelist* is to take care of these pre-requisites once and for all
before downstream analyses, thus helping to make data pipelines more robust and
straightforward.
# linelist in a nutshell
## Outline
*linelist* is an R package which implements basic data representation for case
line lists, alongside accessors and basic methods. It essentially provides three
types of functionalities:
1. **tagging**: a *tags* system permits to pre-identify key epidemiological
variables needed in downstream analyses (e.g. dates of case notification,
symptom onset, age, gender, disease outcome)
2. **validation**: functions checking that tagged variables are indeed present
in the `data.frame/tibble`, and that they have the expected type
(e.g. checking that dates are `Date`, `integer` or `numeric`)
3. **secured methods**: generic functions which could lead to the loss of tagged
variables have dedicated methods for *linelist* objects with adapted
behaviours, either updating tags as needed (*e.g.* `rename()`, `names() <-
...`) or issuing warnings/errors when tagged variables are lost (*e.g.*
`select()`, `x[]`, `x[[]]`)
## Should I use *linelist*?
*linelist* is designed to add a robust, foundational layer to your data
pipelines, but it might add unnecessary complexity to your analysis
scripts. Here are a few hints to gauge if you should consider using the package.
**You may have use for *linelist* if ...**:
* your data changes/updates over time (e.g. new entries, new variables, renamed
variables)
* you build data pipelines entailing multiple layers of data processing and
analysis
* you are looking to build re-useable analysis scripts, *i.e.* which will work
on other datasets with minimal added changes
**Conversely, you probably do not need it if ...**:
* you work on historical data, which has likely already been curated/validated
and will no longer change
* you perform some quick, simple analysis of your data, which you will not need
to expand on later
* your analysis scripts are very specific and will not be re-used elsewhere
# Getting started
## Installation
Our stable versions are released periodically on CRAN, and can be installed
using:
```{r, eval=FALSE}
install.packages("linelist", build_vignettes = TRUE)
```
If you prefer using the latest features and bug fixes, you can alternatively
install the development version of *linelist* from [GitHub](https://github.com/)
using the following commands:
```r
if (!require(remotes)) {
install.packages("remotes")
}
remotes::install_github("epiverse-trace/linelist", build_vignettes = TRUE)
```
Once installed, you can load the package in your R session using:
```{r}
library(linelist)
```
## Key functionalities
A `linelist` object is an instance of a `data.frame` or a `tibble` in which key
epidemiological variables have been *tagged*. The main features of the packages
are broken down into the 3 categories outlined above.
### Tagging system
Tags are paired keys pointing a reference epidemiological variables to the name
of a column in a `data.frame` or `tibble`. The tagging system permits to
construct `linelist` objects, modify tags in existing objects, check and access
existing tags and the corresponding variables.
* `make_linelist()`: to create a `linelist` object by tagging key epi variables in
a `data.frame` or a `tibble`
* `set_tags()`: to add, remove, or modify tags in a `linelist`
* `tags()`: to list variables which have been tagged in a `linelist`
* `tags_names()`: to list all recognized tag names; details on what the tags represent can be found at [`?make_linelist`](https://epiverse-trace.github.io/linelist/reference/make_linelist.html)
* `tags_df()`: to obtain a `data.frame` of all the tagged variables in a `linelist`
### Validation
Basic routines are provided to validate *linelist* objects. More advanced
validation e.g. looking at compatibility of dated events will be implemented in
a separate package.
* `validate_tags()`: check that tagged variables are present in the dataset,
that tags match the pre-defined list of tagged variables
* `validate_types()`: check that tagged variables have an acceptable class,
as defined in `tags_types()`
* `validate_linelist()`: general validation of *linelist* objects, equivalent to
running both `validate_tags()` and `validate_types()`, and checking the class
of the object
### Secured methods
These are dedicated S3 methods for existing generics which can be used to
prevent the loss of tagged variables.
* `lost_tags_action()`: to set the behaviour to adopt when tagged variables
would be lost by an operation: issue a warning (default), an error, or ignore
* `get_lost_tags_action()`: to check the current behaviour for lost tagged
variables
* `names<-()`: the 'base R' approach to renaming columns of a `linelist`;
will rename tags as needed to match the new column names
* `x[]` and `x[[]]`: for subsetting columns using 'base R' syntax; will
behave according to `get_lost_tags_actions()` if tagged variables are lost
# Worked example
## Example dataset
In this example, we use the case line list of the Hagelloch 1861 measles
outbreak, distributed by the *outbreaks* package as `measles_hagelloch_1861` .
```{r}
data(measles_hagelloch_1861, package = "outbreaks")
# overview of the data
head(measles_hagelloch_1861)
```
## Creating a *linelist* object
Let us assume we want to tag the following variables to facilitate downstream
analyses, after having checked their tag name in `?make_linelist`:
* the date of symptom onset, here called `prodrome` (tag: `date_onset`)
* the date of death (tag: `date_death`)
* the age of the patient (tag: `age`)
* the gender of the patient (tag: `gender`)
We first load a few useful packages, and create a `linelist` with the above
information:
```{r}
library(tibble) # data.frame but with nice printing
library(dplyr) # for data handling
library(magrittr) # for the %>% operator
library(linelist) # this package!
x <- measles_hagelloch_1861 %>%
tibble() %>%
make_linelist(date_onset = "date_of_prodrome",
date_death = "date_of_death",
age = "age",
gender = "gender")
head(x)
```
The printing of the object confirms that the tags have been added. If we want to
double-check which variables have been tagged:
```{r}
tags(x)
```
## Changing tags
Tags can be added / removed / changed using `set_tags()`.
Let us assume we also want to record the outcome: it is currently missing, but
can be built from dates of deaths (missing date = survived). This can be done by
using `mutate()` on `x` to create the new variable (remember `x` is not only a
`linelist`, but also a regular `tibble` and this compatible with `dplyr` verbs),
and setting up a new tag using `set_tags()`:
```{r}
x <- x %>%
mutate(
inferred_outcome = if_else(is.na(date_of_death), "survived", "died")
) %>%
set_tags(outcome = "inferred_outcome")
x
```
If we wanted to undo the above, i.e. remove the `outcome` tag, we only need set
it to `NULL`:
```{r}
x <- x %>%
set_tags(outcome = NULL)
tags(x)
```
## Accessing tagged variables
Now that key variables have been tagged in `x`, we can used these pre-defined
fields in downstream analyses, without having to worry about variable names and
types. We could access tagged variables using any of the following means:
```{r}
# select tagged variables only
x %>%
select(has_tag(c("date_onset", "date_death")))
# select tagged variables only with renaming on the fly
x %>%
select(onset = has_tag("date_onset"))
# get all tagged variables in a data.frame
x %>%
tags_df()
```
## Using safeguards
Because `x` remains a valid `tibble`, we can use any data handling operations
implemented in `dplyr`. However, some of these operations may cause accidental
removal of key tagged variables. *linelist* provides a safeguard mechanism
against this. For instance, let's assume we want to select only some columns of
`x`:
```{r}
x %>%
select(1:2)
```
Here, the above command gave a meaningful warning, in which `select()` removes
some of the variables that were tagged.
We can also use the `has_tag()` select helper to select columns via their tag.
For example, to retain the first 2 variables, and the `gender` tag:
```{r}
# hybrid selection
x %>%
select(1:2, has_tag("gender"))
```
Again, we observe a warning as before due to the loss of tagged variables in the
operation. This behaviour can be silenced if needed, or could be changed to
issue an error (for stronger pipelines for instance):
```{r error = TRUE, purl = FALSE}
# hybrid selection
x %>%
select(1:2, has_tag("gender"))
# hybrid selection - no warning
lost_tags_action("none")
x %>%
select(1:2, has_tag("gender"))
# hybrid selection - error due to lost tags
lost_tags_action("error")
x %>%
select(1:2, has_tag("gender"))
# note that `lost_tags_action` sets the behavior for any later operation, so we
# need to reset the default
get_lost_tags_action() # check current behaviour
lost_tags_action() # reset default
```
### Changing tag loss action permanently
If you wish to change the `lost_tags_action` in a way that persists across R sessions, you can do so by setting the `LINELIST_LOST_ACTION` environment variable. For example, your `.Renviron` file could contain the following line:
```text
LINELIST_LOST_ACTION="error"
```