`covidcast` has `issue` "update" data that re-reports the previous version of observations

**Actual Behavior:**



The covidcast issue data includes rows where all entries besides `issue` and `lag` appear to be the same as in the preceding version of that row.

Reprex with commentary:

[2022-02-14: Note that this snippet's original code is no longer supported: as_of + issues cannot be used together.  I've updated it with something that hopefully is equivalent, but gives 10566 update rows rather than 10465, perhaps due to other version data patches?]

```
library("tibble")
library("dplyr")

analysis.date = as.Date("2021-11-09")

geo.values = c("ak", "al")

## jhu.case.updates.original.code =
##  delphi.epidata::covidcast("jhu-csse", "confirmed_incidence_num",
##                            "day", "state",
##                            delphi.epidata::epirange(12340101,34560101), geo.values,
##                            as_of = format(analysis.date-1L, "%Y%m%d"), # try to make this very reproducible
##                            issues = delphi.epidata::epirange(12340101,34560101)) %>%
##  delphi.epidata::fetch_tbl()
jhu.case.updates =
  epidatr::covidcast("jhu-csse", "confirmed_incidence_num",
                             "state", "day",
                            geo.values, epidatr::epirange(12340101,34560101),
                            issues = epidatr::epirange(12340101,format(analysis.date-1L, "%Y%m%d"))) %>%
  epidatr::fetch_tbl()


## See if we have any issue data that repeats the same `value` for an
## observation as the preceding issue (or issues) by performing an RLE of the
## value across issues for each observation (geo_value x time_value)
jhu.case.updates %>%
  group_by(geo_value, time_value) %>%
  arrange(issue, .by_group=TRUE) %>%
  summarize(value.rle.tbl = {
    value.rle = rle(value)
    tibble(length = value.rle[["lengths"]], value = value.rle[["value"]])
  }, .groups="drop") %>%
  mutate(
    value.run.length = value.rle.tbl[["length"]],
    value.run.value = value.rle.tbl[["value"]],
    value.rle.tbl = NULL
  ) %>%
  arrange(-value.run.length) %>%
  print()

## We see that there are updates that don't update the value, but maybe
## something else (besides `issue`&`lag`) could have been updated. Let's check
## on the other columns.
## - Examine the observation with the longest run in more detail.
jhu.case.updates %>%
  filter(geo_value=="ak", time_value==as.Date("2020-04-23")) %>%
  arrange(issue) %>%
  select(issue, lag, value, missing_value, stderr, missing_stderr, sample_size, missing_sample_size) %>%
  count(across(-c(issue, lag))) %>%
  print(n=100L)
##   There are two versions of the observation factoring in these additional
##   columns. However, since one of these only appears 1x and the other 56x,
##   there must be at least 54 instances of re-reporting the same row (with
##   different `issue`&`lag`)

## Still, to avoid the complications above regarding the other columns and the
## row ordering, let's try to directly detect re-reporting of "entire
## observations":
rereporting =
  jhu.case.updates %>%
  select(-lag) %>%
  group_by(geo_value, time_value) %>%
  arrange(issue, .by_group=TRUE) %>%
  ## "lag the row" by `lead`ing the `issue`
  mutate(issue = lead(issue)) %>% filter(!is.na(issue)) %>% ungroup() %>%
  ## find the re-reporting:
  inner_join(., jhu.case.updates %>% select(-lag), by=names(.))

rereporting %>%
  nrow()
## 8212 re-reported rows in this extract

jhu.case.updates %>%
  nrow()
## 10465 total rows in this extract

## So, at least for ak&al cases, the amount of re-reporting appears substantial.
## If duplicate rows continue to be steadily re-reported, we will expect "very"
## quadratic growth in the size of the data set; with sparser re-reporting and
## sparse revisions, we would expect "linear-ish" growth.
rereporting %>%
  count(issue) %>%
  arrange(issue) %>%
  print(n=100L)
## The last issue re-reporting an observation appears to be 2020-10-30 for these
## states, so it looks like there isn't the "very" quadratic growth, at least
## for the current signal, geo_type, and geo_values.
```

**Expected behavior**



I soft-expected an issue-query to return only data with changes to entries (other than to `issue` itself and `lag`), for two reasons:
- `?covidcast::covidcast_signal` describes issue-queries as "[f]etch[ing] only data that was published or updated ("issued") on these [issues]", which might leave this impression.
- storage efficiency-wise, this would make sense; query efficiency-wise, I'm not sure but I'd guess that it'd probably help, especially if there is an index over `source,signal,geo_value,time_value,issue`.

**Context**


I am working on cmu-delphi/epitools#23 to build some utilities for working with data version history from delphi-epidata or elsewhere.  A natural example was to use version history from something in delphi-epidata using an issue-query, which unearthed this surprise.

The re-reporting isn't really a problem for me right now.  The utilities I am working on should be built to accept "update" or snapshot (as-of-query) data that contains duplicate data, and to eventually to compact the data to remove these duplicates.  Removing re-reporting in covidcast issue data wouldn't ensure that users wouldn't input such re-reported data from other data providers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`covidcast` has `issue` "update" data that re-reports the previous version of observations #1362

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

covidcast has issue "update" data that re-reports the previous version of observations #1362

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`covidcast` has `issue` "update" data that re-reports the previous version of observations #1362