/
a02_cohorts.Rmd
341 lines (237 loc) · 13.9 KB
/
a02_cohorts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
---
title: "Working with cohorts"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Working with cohorts}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
eval = rlang::is_installed("CirceR") & rlang::is_installed("Capr") & rlang::is_installed("duckdb"),
comment = "#>"
)
library(CDMConnector)
library(dplyr, warn.conflicts = FALSE)
if (Sys.getenv("EUNOMIA_DATA_FOLDER") == "") Sys.setenv("EUNOMIA_DATA_FOLDER" = file.path(tempdir(), "eunomia"))
if (!dir.exists(Sys.getenv("EUNOMIA_DATA_FOLDER"))) dir.create(Sys.getenv("EUNOMIA_DATA_FOLDER"))
if (!eunomia_is_available()) downloadEunomiaData()
```
Cohorts are a fundamental building block for observational health data analysis. A "cohort" is a set of persons satisfying a one or more inclusion criteria for a duration of time. If you are familiar with the idea of sets in math then a cohort can be nicely represented as a set of person-days. In the OMOP Common Data Model we represent cohorts using a table with four columns.
| cohort_definition_id | subject_id | cohort_start_date | cohort_end_date |
|----------------------|------------|-------------------|-----------------|
| 1 | 1000 | 2020-01-01 | 2020-05-01 |
| 1 | 1000 | 2021-06-01 | 2020-07-01 |
| 1 | 2000 | 2020-03-01 | 2020-09-01 |
| 2 | 1000 | 2020-02-01 | 2020-03-01 |
: An example cohort table
A cohort table can contain multiple cohorts and each cohort can have multiple persons. There can even be multiple records for the same person in a single cohort as long as the date ranges do not overlap. In the same way that an element is either in a set or not, a single person-day is either in a cohort or not. For a more comprehensive treatment of cohorts in OHDSI check out the Cohorts chapter in [The Book of OHDSI](https://ohdsi.github.io/TheBookOfOhdsi/Cohorts.html).
## Cohort Generation
The $n*4$ cohort table is created through the process of cohort *generation*. To generate a cohort on a specific CDM dataset means that we combine a *cohort definition* with CDM to produce a cohort table. The standardization provided by the OMOP CDM allows researchers to generate the same cohort definition on any OMOP CDM dataset.
A cohort definition is an expression of the rules goverining the inclusion/exclusion of person-days in the cohort. There are three common ways to create cohort definitions for the OMOP CDM.
1. The Atlas cohort builder
2. The Capr R package
3. Custom SQL and/or R code
Atlas is a web application that provides a graphical user interface for creating cohort definitions. . To get started with Atlas check out the free course on [Ehden Academy](https://academy.ehden.eu/course/index.php) and the demo at <https://atlas-demo.ohdsi.org/>.
Capr is an R package that provides a code-based interface for creating cohort definitions. The options available in Capr exactly match the options available in Atlas and the resulting cohort tables should be identical.
There are times when more customization is needed and it is possible to use bespoke SQL or dplyr code to build a cohort. CDMConnector provides the `generate_concept_cohort_set` function for quickly building simple cohorts that can then be a starting point for further subsetting.
Atlas cohorts are represented using json text files. To "generate" one or more Atlas cohorts on a cdm object use the `read_cohort_set` function to first read a folder of Atlas cohort json files into R. Then create the cohort table with `generate_cohort_set`. There can be an optional csv file called "CohortsToCreate.csv" in the folder that specifies the cohort IDs and names to use. If this file doesn't exist IDs will be assigned automatically using alphabetical order of the filenames.
```{r}
path_to_cohort_json_files <- system.file("cohorts1", package = "CDMConnector")
list.files(path_to_cohort_json_files)
readr::read_csv(file.path(path_to_cohort_json_files, "CohortsToCreate.csv"),
show_col_types = FALSE)
```
### Atlas cohort definitions
First we need to create our CDM object. Note that we will need to specify a `write_schema` when creating the object. Cohort tables will go into the CDM's `write_schema`.
```{r}
library(CDMConnector)
path_to_cohort_json_files <- system.file("example_cohorts",
package = "CDMConnector")
list.files(path_to_cohort_json_files)
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir("GiBleed"))
cdm <- cdm_from_con(con, cdm_name = "eunomia", cdm_schema = "main", write_schema = "main")
cohort_details <- read_cohort_set(path_to_cohort_json_files) |>
mutate(cohort_name = snakecase::to_snake_case(cohort_name))
cohort_details
cdm <- generate_cohort_set(
cdm = cdm,
cohort_set = cohort_details,
name = "study_cohorts"
)
cdm$study_cohorts
```
The generated cohort has associated metadata tables. We can access these with utility functions.
- `cohort_count` contains the person and record counts for each cohort in the cohort set
- `settings` table contains the cohort id and cohort name
- `attrition` table contains the attrition information (persons, and records dropped at each sequential inclusion rule)
```{r}
cohort_count(cdm$study_cohorts)
cohort_set(cdm$study_cohorts)
attrition(cdm$study_cohorts)
```
Note the this cohort table is still in the database so it can be quite large. We can also join it to other CDM table or subset the entire cdm to just the persons in the cohort.
```{r, eval=FALSE}
cdm_gibleed <- cdm %>%
cdm_subset_cohort(cohort_table = "study_cohorts")
```
### Capr cohort definitions
Capr allows us to use R code to create the same cohorts that can be created in Atlas. This is helpful when you need to create a large number of similar cohort definitions. Below we create a single Cohort definition with one inclusion criteria
`generate_cohort_set` will accept a named list of Capr
```{r}
library(Capr)
gibleed_concept_set <- cs(192671, name = "gibleed")
gibleed_definition <- cohort(
entry = conditionOccurrence(gibleed_concept_set)
)
gibleed_male_definition <- cohort(
entry = conditionOccurrence(gibleed_concept_set, male())
)
# create a named list of Capr cohort definitions
cohort_details = list(gibleed = gibleed_definition,
gibleed_male = gibleed_male_definition)
# generate cohorts
cdm <- generate_cohort_set(
cdm,
cohort_set = cohort_details,
name = "gibleed" # name for the cohort table in the cdm
)
cdm$gibleed
```
We should get the exact same result from Capr and Atlas if the definitions are equivalent.
Learn more about Capr at the package website <https://ohdsi.github.io/Capr/>.
```{r}
DBI::dbDisconnect(con, shutdown = TRUE)
```
### Subset a cohort
Suppose you have a generated cohort and you would like to create a new cohort that is a subset of the first. This can be done using the
First we will generate an example cohort set and then create a new cohort based on filtering the Atlas cohort.
```{r}
library(CDMConnector)
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir())
cdm <- cdm_from_con(con, cdm_schema = "main", write_schema = "main")
cohort_set <- read_cohort_set(system.file("cohorts3", package = "CDMConnector"))
cdm <- generate_cohort_set(cdm, cohort_set, name = "cohort")
cdm$cohort
cohort_count(cdm$cohort)
```
As an example we will take only people in the cohort that have a cohort duration that is longer than 4 weeks.
Using dplyr we can write this query and save the result in a new table in the cdm.
```{r}
library(dplyr)
cdm$cohort_subset <- cdm$cohort %>%
# only keep persons who are in the cohort at least 28 days
filter(!!datediff("cohort_start_date", "cohort_end_date") >= 28) %>%
# optionally you can modify the cohort_id
mutate(cohort_definition_id = 100 + cohort_definition_id) %>%
compute(name = "cohort_subset", temporary = FALSE, overwrite = TRUE) %>%
new_generated_cohort_set()
cohort_count(cdm$cohort_subset)
```
In this case we can see that cohorts 1 and 5 were dropped completely and some patients were dropped from cohorts 2, 3, and 4.
Let's confirm that everyone in cohorts 1 and 5 were in the cohort for less than 28 days.
```{r}
days_in_cohort <- cdm$cohort %>%
filter(cohort_definition_id %in% c(1,5)) %>%
mutate(days_in_cohort = !!datediff("cohort_start_date", "cohort_end_date")) %>%
count(cohort_definition_id, days_in_cohort) %>%
collect()
days_in_cohort
```
We have confirmed that everyone in cohorts 1 and 5 were in the cohort less than 10 days.
Now suppose we would like to create a new cohort table with three different versions of the cohorts in the original cohort table. We will keep persons who are in the cohort at 2 weeks, 3 weeks, and 4 weeks. We can simply write some custom dplyr to create the table and then call `new_generated_cohort_set` just like in the previous example.
```{r}
cdm$cohort_subset <- cdm$cohort %>%
filter(!!datediff("cohort_start_date", "cohort_end_date") >= 14) %>%
mutate(cohort_definition_id = 10 + cohort_definition_id) %>%
union_all(
cdm$cohort %>%
filter(!!datediff("cohort_start_date", "cohort_end_date") >= 21) %>%
mutate(cohort_definition_id = 100 + cohort_definition_id)
) %>%
union_all(
cdm$cohort %>%
filter(!!datediff("cohort_start_date", "cohort_end_date") >= 28) %>%
mutate(cohort_definition_id = 1000 + cohort_definition_id)
) %>%
compute(name = "cohort_subset", temporary = FALSE, overwrite = TRUE) %>%
new_generated_cohort_set() # this function creates the cohort object and metadata
cdm$cohort_subset %>%
mutate(days_in_cohort = !!datediff("cohort_start_date", "cohort_end_date")) %>%
group_by(cohort_definition_id) %>%
summarize(mean_days_in_cohort = mean(days_in_cohort, na.rm = TRUE)) %>%
collect() %>%
arrange(mean_days_in_cohort)
```
This is an example of creating new cohorts from existing cohorts using CDMConnector. There is a lot of flexibility with this approach. Next we will look at completely custom cohort creation which is quite similar.
### Custom Cohort Creation
Sometimes you may want to create cohorts that cannot be easily expressed using Atlas or Capr. In these situations you can create implement cohort creation using SQL or R. See the chapter in [The Book of OHDSI](https://ohdsi.github.io/TheBookOfOhdsi/Cohorts.html#implementing-the-cohort-using-sql) for details on using SQL to create cohorts. CDMConnector provides a helper function to build simple cohorts from a list of OMOP concepts. `generate_concept_cohort_set` accepts a named list of concept sets and will create cohorts based on those concept sets. While this function does not allow for inclusion/exclusion criteria in the initial definition, additional criteria can be applied "manually" after the initial generation.
```{r}
library(dplyr, warn.conflicts = FALSE)
cdm <- generate_concept_cohort_set(
cdm,
concept_set = list(gibleed = 192671),
name = "gibleed2", # name of the cohort table
limit = "all", # use all occurrences of the concept instead of just the first
end = 10 # set explicit cohort end date 10 days after start
)
cdm$gibleed2 <- cdm$gibleed2 %>%
semi_join(
filter(cdm$person, gender_concept_id == 8507),
by = c("subject_id" = "person_id")
) %>%
record_cohort_attrition(reason = "Male")
attrition(cdm$gibleed2)
```
We could visualise attrition using a package like VisR
```{r, fig.width= 7, fig.height=10}
library(visR)
gibleed2_attrition <- CDMConnector::attrition(cdm$gibleed2) %>%
dplyr::select(Criteria = "reason", `Remaining N` = "number_subjects")
class(gibleed2_attrition) <- c("attrition", class(gibleed2_attrition))
visr(gibleed2_attrition)
```
In the above example we built a cohort table from a concept set. The cohort essentially captures patient-time based off of the presence or absence of OMOP standard concept IDs. We then manually applied an inclusion criteria and recorded a new attrition record in the cohort. To learn more about this approach to building cohorts check out the [PatientProfiles](https://darwin-eu-dev.github.io/PatientProfiles/) R package.
You can also create a generated cohort set using any method you choose. As long as the table is in the CDM database and has the four required columns it can be added to the CDM object as a generated cohort set.
Suppose for example our cohort table is
```{r}
cohort <- dplyr::tibble(
cohort_definition_id = 1L,
subject_id = 1L,
cohort_start_date = as.Date("1999-01-01"),
cohort_end_date = as.Date("2001-01-01")
)
cohort
```
First make sure the table is in the database and create a dplyr table reference to it and add it to the CDM object.
```{r}
library(omopgenerics)
cdm <- insertTable(cdm = cdm, name = "cohort", table = cohort, overwrite = TRUE)
cdm$cohort
```
To make this a true generated cohort object use the `cohort_table`
```{r}
cdm$cohort <- newCohortTable(cdm$cohort)
```
We can see that this cohort is now has the class "cohort_table" as well as the various metadata tables.
```{r}
cohort_count(cdm$cohort)
cohort_set(cdm$cohort)
attrition(cdm$cohort)
```
If you would like to override the attribute tables then pass additional dataframes to cohortTable
```{r}
cdm <- insertTable(cdm = cdm, name = "cohort2", table = cohort, overwrite = TRUE)
cdm$cohort2 <- newCohortTable(cdm$cohort2)
settings(cdm$cohort2)
cohort_set <- data.frame(cohort_definition_id = 1L,
cohort_name = "made_up_cohort")
cdm$cohort2 <- newCohortTable(cdm$cohort2, cohortSetRef = cohort_set)
settings(cdm$cohort2)
```
```{r}
DBI::dbDisconnect(con, shutdown = TRUE)
```
Cohort building is a fundamental building block of observational health analysis and CDMConnector supports different ways of creating cohorts. As long as your cohort table is has the required structure and columns you can add it to the cdm with the `new_generated_cohort_set` function and use it in any downstream OHDSI analytic packages.
<div style="margin-bottom:3cm;"></div>