S3 class that allows different flavors of missing in numeric vectors.
One can divide measures into two groups: qualitative and quantitative. However, record formats often mix the two. Some of the values are simply interpreted as is: a 2
is a 2. Some of the values are codes which represent qualities instead of numbers: an 8
means the measure's not applicable. These are sometimes called "sentinel values." And, of course, some values are just plain missing.
When handling these data in R, a common idiom is to split the column in twain: a numeric vector for the quantitative and a factor for the qualitative. This is the simplest solution and will often work fine. But it does something risky: it separates linked data. The user must remember to keep them together, and usually does this with clever variable or column names.
Clever is bad. Code with my_data[, paste0(vars, c("_num", "_flag"))]
is hard to read. Code with get
is hard to follow.
The sentinel
package offers the sentineled
class to bundle numeric and categorical missing values into a single object.
library(sentinel)
x <- sentineled(
c(10, 20, 98, 99, NA),
sentinels = c(98, 99),
labels = c("refused", "not recorded")
)
x
## [1] 10 20 <refused> <not recorded>
## [5] NA
## sentinel values: "" "refused" "not recorded"
The numbers are numbers, the categories are categorical, and the unknowns are just unknown.
A sentineled
object is a vector. When subsetting, a it will remain a sentineled
object with the same possible sentinel values.
x[1]
## [1] 10
## sentinel values: "" "refused" "not recorded"
x[1:2]
## [1] 10 20
## sentinel values: "" "refused" "not recorded"
x[[3]]
## [1] <refused>
## sentinel values: "" "refused" "not recorded"
x[x < 15]
## [1] 10 <refused> <not recorded> NA
## sentinel values: "" "refused" "not recorded"
A sentineled
vector can be used in arithmetic, with all non-missing values acting like normal numeric values. If possible, a sentineled
object with the appropriate sentinel values will be the result.
mean(x, na.rm = TRUE)
## [1] 15
x / 100
## [1] 0.1 0.2 <refused> <not recorded>
## [5] NA
## sentinel values: "" "refused" "not recorded"
It can even be a column in a data.frame
.
data.frame(
element = c("argon", "boron", "chlorine"),
mass = sentineled(c(3, "x", 8), "x", "scale malfunction")
)
## element mass
## 1 argon 3
## 2 boron <scale malfunction>
## 3 chlorine 8
The sentinel codes are treated as missing, but the different categories of missing are stored as a factor vector in the "sentinels"
attribute of the object. Use the sentinels
function to access them.
sentinels(x)
## [1] refused not recorded <NA>
## Levels: refused not recorded
x[sentinels(x) != "refused"]
## [1] 10 20 <not recorded> NA
## sentinel values: "" "refused" "not recorded"
Notice that, for the non-missing values in x
, their respective sentinel codes are blanks (""
).
as.character(sentinels(x))
## [1] "" "" "refused" "not recorded"
## [5] NA
It's recommended to use explanatory sentinel levels for all expected types of missing. That way, if a value is shown as just plain NA
, it's a sign something went wrong in the analysis.