Skip to content

covidcast has issue "update" data that re-reports the previous version of observations #1362

@brookslogan

Description

@brookslogan

Actual Behavior:

The covidcast issue data includes rows where all entries besides issue and lag appear to be the same as in the preceding version of that row.

Reprex with commentary:

[2022-02-14: Note that this snippet's original code is no longer supported: as_of + issues cannot be used together. I've updated it with something that hopefully is equivalent, but gives 10566 update rows rather than 10465, perhaps due to other version data patches?]

library("tibble")
library("dplyr")

analysis.date = as.Date("2021-11-09")

geo.values = c("ak", "al")

## jhu.case.updates.original.code =
##  delphi.epidata::covidcast("jhu-csse", "confirmed_incidence_num",
##                            "day", "state",
##                            delphi.epidata::epirange(12340101,34560101), geo.values,
##                            as_of = format(analysis.date-1L, "%Y%m%d"), # try to make this very reproducible
##                            issues = delphi.epidata::epirange(12340101,34560101)) %>%
##  delphi.epidata::fetch_tbl()
jhu.case.updates =
  epidatr::covidcast("jhu-csse", "confirmed_incidence_num",
                             "state", "day",
                            geo.values, epidatr::epirange(12340101,34560101),
                            issues = epidatr::epirange(12340101,format(analysis.date-1L, "%Y%m%d"))) %>%
  epidatr::fetch_tbl()


## See if we have any issue data that repeats the same `value` for an
## observation as the preceding issue (or issues) by performing an RLE of the
## value across issues for each observation (geo_value x time_value)
jhu.case.updates %>%
  group_by(geo_value, time_value) %>%
  arrange(issue, .by_group=TRUE) %>%
  summarize(value.rle.tbl = {
    value.rle = rle(value)
    tibble(length = value.rle[["lengths"]], value = value.rle[["value"]])
  }, .groups="drop") %>%
  mutate(
    value.run.length = value.rle.tbl[["length"]],
    value.run.value = value.rle.tbl[["value"]],
    value.rle.tbl = NULL
  ) %>%
  arrange(-value.run.length) %>%
  print()

## We see that there are updates that don't update the value, but maybe
## something else (besides `issue`&`lag`) could have been updated. Let's check
## on the other columns.
## - Examine the observation with the longest run in more detail.
jhu.case.updates %>%
  filter(geo_value=="ak", time_value==as.Date("2020-04-23")) %>%
  arrange(issue) %>%
  select(issue, lag, value, missing_value, stderr, missing_stderr, sample_size, missing_sample_size) %>%
  count(across(-c(issue, lag))) %>%
  print(n=100L)
##   There are two versions of the observation factoring in these additional
##   columns. However, since one of these only appears 1x and the other 56x,
##   there must be at least 54 instances of re-reporting the same row (with
##   different `issue`&`lag`)

## Still, to avoid the complications above regarding the other columns and the
## row ordering, let's try to directly detect re-reporting of "entire
## observations":
rereporting =
  jhu.case.updates %>%
  select(-lag) %>%
  group_by(geo_value, time_value) %>%
  arrange(issue, .by_group=TRUE) %>%
  ## "lag the row" by `lead`ing the `issue`
  mutate(issue = lead(issue)) %>% filter(!is.na(issue)) %>% ungroup() %>%
  ## find the re-reporting:
  inner_join(., jhu.case.updates %>% select(-lag), by=names(.))

rereporting %>%
  nrow()
## 8212 re-reported rows in this extract

jhu.case.updates %>%
  nrow()
## 10465 total rows in this extract

## So, at least for ak&al cases, the amount of re-reporting appears substantial.
## If duplicate rows continue to be steadily re-reported, we will expect "very"
## quadratic growth in the size of the data set; with sparser re-reporting and
## sparse revisions, we would expect "linear-ish" growth.
rereporting %>%
  count(issue) %>%
  arrange(issue) %>%
  print(n=100L)
## The last issue re-reporting an observation appears to be 2020-10-30 for these
## states, so it looks like there isn't the "very" quadratic growth, at least
## for the current signal, geo_type, and geo_values.

Expected behavior

I soft-expected an issue-query to return only data with changes to entries (other than to issue itself and lag), for two reasons:

  • ?covidcast::covidcast_signal describes issue-queries as "[f]etch[ing] only data that was published or updated ("issued") on these [issues]", which might leave this impression.
  • storage efficiency-wise, this would make sense; query efficiency-wise, I'm not sure but I'd guess that it'd probably help, especially if there is an index over source,signal,geo_value,time_value,issue.

Context

I am working on cmu-delphi/epiprocess#23 to build some utilities for working with data version history from delphi-epidata or elsewhere. A natural example was to use version history from something in delphi-epidata using an issue-query, which unearthed this surprise.

The re-reporting isn't really a problem for me right now. The utilities I am working on should be built to accept "update" or snapshot (as-of-query) data that contains duplicate data, and to eventually to compact the data to remove these duplicates. Removing re-reporting in covidcast issue data wouldn't ensure that users wouldn't input such re-reported data from other data providers.

Metadata

Metadata

Assignees

Labels

data qualityMissing data, weird data, broken data

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions