-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.Rmd
193 lines (149 loc) · 5.83 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
title: "naaccr"
output:
github_document:
html_preview: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```
## Summary
The `naaccr` R package enables researchers to easily read and begin analyzing
cancer incidence records stored in the
[North American Association of Central Cancer Registries](https://www.naaccr.org/)
(NAACCR) file format.
## Usage
`naaccr` focuses on two tasks: arranging the records and preparing the fields
for analysis.
### Records
The `naaccr_record` class defines objects which store cancer incidence records.
It inherits from `data.frame`, and for now only makes sure a dataset has a
standard set of columns. While `naaccr_record` has a singular-sounding name, it
can contain multiple records as rows.
The `read_naaccr` function creates a `naaccr_record` object from a
NAACCR-formatted file.
```{r showRecords}
record_file <- system.file(
"extdata/synthetic-naaccr-18-abstract.txt",
package = "naaccr"
)
record_lines <- readLines(record_file)
## Marital status and race fields
cat(substr(record_lines[1:5], 206, 216), sep = "\n")
```
```{r readNaaccr}
library(naaccr)
records <- read_naaccr(record_file, version = 18)
records[1:5, c("maritalStatusAtDx", "race1", "race2", "race3")]
```
By default, `read_naaccr` reads all fields defined in a format. For example,
the NAACCR 18 format used above has `r nrow(naaccr_format_18)` fields. Rarely
would an analysis need even 100 fields. By specifying which fields to keep, one
can improve time and memory efficiency.
```{r readKeepColumns}
dim(records)
records_slim <- read_naaccr(
input = record_file,
version = 18,
keep_fields = c("ageAtDiagnosis", "countyAtDx", "primarySite")
)
dim(records_slim)
```
Like with most classes, one can create a new `naaccr_record` object with the
function of the same name. The result will have the given columns.
```{r naaccrRecord}
nr <- naaccr_record(
primarySite = "C010",
dateOfBirth = "19450521"
)
nr[, c("primarySite", "dateOfBirth")]
```
The `as.naaccr_record` function can transform an existing data frame. It does
require any existing columns to use NAACCR's XML names.
```{r asNaaccrRecord}
prefab <- data.frame(
ageAtDiagnosis = c(1, 120, 999),
race1 = c("01", "02", "88")
)
converted <- as.naaccr_record(prefab)
converted[, c("ageAtDiagnosis", "race1")]
```
### Code translation
The NAACCR format uses similar schemes for a lot of fields, and the `naaccr`
package includes functions to help translate them.
`naaccr_boolean` translates "yes/no" fields. By default, it assumes `"0"` stands
for "no", and `"1"` stands for "yes."
```{r naaccrBoolean}
naaccr_boolean(c("0", "1", "2"))
```
Some fields use `"1"` for `FALSE` and `"2"` for `TRUE`. Use the `false_value`
parameter to work with these.
```{r falseValue}
naaccr_boolean(c("0", "1", "2"), false_value = "1")
```
#### Categorical fields
The `naaccr_factor` function translates values using a specific field's category
codes.
```{r naaccrFactor}
naaccr_factor(c("01", "31", "65"), "primaryPayerAtDx")
```
Some fields have multiple codes explaining why an actual value isn't known.
By default, they'll all be converted to `NA` so they can propagate that information in R.
But the reasons can be useful, so `naaccr_factor` and `naaccr_record` both have
a `keep_unknown` parameter.
```{r keepUnknown}
naaccr_factor(c("1", "9"), field = "sex")
naaccr_factor(c("1", "9"), field = "sex", keep_unknown = TRUE)
naaccr_record(sex = c("1", "9"), race1 = c("01", "99"), keep_unknown = TRUE)
```
#### Numeric with special missing
Some fields contain primarily continuous or count data but also use special
codes. One name for this type of code is a "sentinel value." The
`split_sentineled` function splits these fields in two.
```{r naaccrSentineled}
rnp <- split_sentineled(c(10, 20, 90, 95, 99, NA), "regionalNodesPositive")
rnp
```
## Building
```{r needForBuild}
library(devtools)
deps <- packageDescription("naaccr", fields = c("Depends", "Imports", "Suggests"))
deps <- Filter(function(x) any(!is.na(x)), deps)
dep_names <- lapply(deps, function(x) devtools::parse_deps(x)[["name"]])
dep_names <- sort(unlist(dep_names))
dep_list <- paste0("- `", dep_names, "`", collapse = "\n")
```
To build the `naaccr` package, you'll need the following R packages:
`r dep_list`
To document, build, and test the package, run the `build.R` script with the
package's root as the working directory.
## Project files
First, know this project fills two roles:
1. Creating a package to work with NAACCR data in R.
2. Collecting the data needed to process NAACCR files in plain-text and
machine-readable formats.
```
naaccr/
├ R/ # R files to create the package objects
├ data-raw/ # Plain-text data files and scripts for processing them
│ ├ code-labels/ # Mappings of codes to understandable labels
│ ├ sentinel-labels/ # Mappings of sentinel values to understandable labels
│ └ record-formats/ # Tables defining each NAACCR file format
├ external/ # Downloaded files and scripts to create files in `data-raw`
├ inst/
│ └ extdata/ # Data files for examples in the documentation
└ tests/ # tests and data using the `testthat` package
```
Files in `external` only need to be updated or run when NAACCR publishes a new
or revised format. In that case, refer to the comments in the `.R` scripts in
that directory for where to download the new files.
Think of these scripts as handy tools for generating `data-raw` files.
Some cleaning of their output may be required.
To run `create-record-format-files.R`, you'll need to create an account for the
[SEER API](https://api.seer.cancer.gov/) from the National Cancer Institute's
Surveillance, Epidemiology and End Results (SEER) program.
Store the API key as an environment variable named `SEER_API_KEY`.