Skip to content
This repository has been archived by the owner on Dec 5, 2022. It is now read-only.

Death data is wrong on 6/26 for the San Francisco Bay Area counties #363

Open
kengo-sony opened this issue Jul 31, 2020 · 14 comments
Open
Assignees
Labels
needs-verification Waiting for verification on an item that we believe to be fixed.

Comments

@kengo-sony
Copy link

  • San Francisco County
    "2020-06-25": {
    "cases": 3297,
    "deaths": 48,
    "tested": 129617,
    "hospitalized_current": 47,
    "icu_current": 17,
    "growthFactor": 1.01
    },
    "2020-06-26": {
    "cases": 3400,
    "deaths": 4800,
    "tested": 132575,
    "hospitalized_current": 45,
    "icu_current": 18,
    "growthFactor": 1.03
    },
    "2020-06-27": {
    "cases": 3468,
    "deaths": 49,
    "tested": 135170,
    "hospitalized_current": 54,
    "icu_current": 19,
    "growthFactor": 1.02
    },

  • Alameda County
    {
    "cases": 5382,
    "deaths": 128,
    "tested": 0,
    "hospitalized": 0,
    "recovered": 0,
    "icu": 0,
    "growthFactor": 1.02,
    "date": "2020-06-25"
    },
    {
    "cases": 5493,
    "deaths": 13000,
    "tested": 0,
    "hospitalized": 0,
    "recovered": 0,
    "icu": 0,
    "growthFactor": 1.02,
    "date": "2020-06-26"
    },
    {
    "cases": 5493,
    "deaths": 130,
    "tested": 0,
    "hospitalized": 0,
    "recovered": 0,
    "icu": 0,
    "growthFactor": 1,
    "date": "2020-06-27"
    },

Other counties shows wrong huge deaths number on 6/26 only.

@jzohrab
Copy link
Contributor

jzohrab commented Aug 1, 2020

Well that's total garbage. Thanks @kengo-sony for the issue!

A request: if you're using timeseries-byLocation.json, in the future when reporting data issues, please also include the dateSources section of the file for the location in question ... it helps me determine where to look. :-) In this case, that section says "2020-04-15..2020-07-29": "us-covidtracking", so the us-covidtracking source is the one causing the trouble.

Cheers, looking into it! jz

@jzohrab
Copy link
Contributor

jzohrab commented Aug 1, 2020

Correction, it was actually "2020-03-21..2020-07-31": "us-ca-mercury-news".

@jzohrab
Copy link
Contributor

jzohrab commented Aug 1, 2020

Should be fixed in #366. I'll launch it to prod and the data should be regenerated in at most a few days.

@jzohrab jzohrab self-assigned this Aug 1, 2020
@jzohrab jzohrab added the needs-verification Waiting for verification on an item that we believe to be fixed. label Aug 1, 2020
@jzohrab
Copy link
Contributor

jzohrab commented Aug 1, 2020

Thanks again @kengo-sony ! jz

@kengoy
Copy link

kengoy commented Aug 2, 2020

Thanks for your quick fix @jzohrab ! 6/26 data looks good now.

Yes, I will make sure to include the dateSources section when reporting a data issue.

Thanks again!

@kengoy
Copy link

kengoy commented Aug 2, 2020

Sorry again, @jzohrab.

I see another issue in Alameda County. I could be an side effect of the fix.
"cases" turns small from 6/24 suddenly, it looks daily cases number instead of cumulative number, and turned back to cumulative number on 8/1 .

  "2020-06-22": {
    "cases": 5007,
    "deaths": 120,
    "growthFactor": 1.04
  },
  "2020-06-23": {
    "cases": 5140,
    "deaths": 120,
    "growthFactor": 1.03
  },
  "2020-06-24": {
    "cases": 5,
    "deaths": 122,
    "growthFactor": 0
  },
  "2020-06-25": {
    "cases": 5,
    "deaths": 128,
    "growthFactor": 1
  },

...

  "2020-07-31": {
    "cases": 11,
    "deaths": 182,
    "growthFactor": 1
  },
  "2020-08-01": {
    "cases": 11131,
    "deaths": 182,
    "growthFactor": 1011.91
  }
},

Here is the data source.

"dateSources": {
  "2020-01-24..2020-02-29": "jhu-usa",
  "2020-03-01..2020-03-20": {
    "jhu-usa": [
      "deaths"
    ],
    "nyt": [
      "cases"
    ]
  },
  "2020-03-21..2020-07-31": "us-ca-mercury-news",
  "2020-08-01": "jhu-usa"
},

@1ec5
Copy link
Contributor

1ec5 commented Aug 3, 2020

I’m seeing something similar with case counts, except it doesn’t go back to being a cumulative number: #370. I’m also seeing more deaths than cases for the following California counties:

  • Alameda County
  • Los Angeles County
  • Riverside County
  • San Bernardino County

and more recoveries than cases for the following counties:

  • San Bernardino County
  • Ventura County

and no cases for the following counties that have had cases:

  • San Joaquin County
  • San Mateo County
  • Shasta County
  • Ventura County

@1ec5
Copy link
Contributor

1ec5 commented Aug 5, 2020

#371 fixed some but not all of the issues in #363 (comment).

@TomGoBravo
Copy link

Tested looks good in the "California County Coronavirus Reporting" Google Spreadsheet maintained by Harriet Rowan but the data I'm fetching from https://coronadatascraper.com/timeseries.csv.zip is still broken for Contra Costa County. Do you think this is due to caching or remaining issues with parsing?

@1ec5
Copy link
Contributor

1ec5 commented Aug 6, 2020

Here’s what timeseries-byLocation.json says for Contra Costa County in August:

      "2020-08-01": {
        "cases": 7806,
        "deaths": 121,
        "hospitalized_current": 106,
        "tested": 135408,
        "growthFactor": 1.02
      },
      "2020-08-02": {
        "cases": 7966,
        "deaths": 125,
        "hospitalized_current": 107,
        "tested": 136325,
        "growthFactor": 1.02
      },
      "2020-08-03": {
        "cases": 8033,
        "deaths": 127,
        "hospitalized_current": 100,
        "tested": 136801,
        "growthFactor": 1.01
      },
      "2020-08-04": {
        "cases": 8176,
        "deaths": 131,
        "hospitalized_current": 101,
        "tested": 137460,
        "growthFactor": 1.02
      },
      "2020-08-05": {}

137,460 matches what the spreadsheet shows for August 4 in Contra Costa County. The empty object for August 5 might be because the spreadsheet already shows data for some counties on August 5. The scraper only avoids returning a result if no county has reported data on a certain date:

if (counties.length === 0) {
throw new Error(`Timeseries does not contain a sample for ${filterDate}`)
}

@TomGoBravo
Copy link

Apologies for what may have been a false alarm. I agree that cases for Contra Costa County now look good.

@1ec5
Copy link
Contributor

1ec5 commented Aug 6, 2020

The COVID Atlas site still shows 15,500 deaths in Santa Clara County and similarly catastrophic spikes across the Bay Area on June 26, as originally reported above:

santa-clara

One solution is to stand up alternative scrapers that will be preferred over the Mercury News source, such as #375 for Santa Clara County, #378 for Alameda County, and #379 in Marin County.

@1ec5
Copy link
Contributor

1ec5 commented Aug 10, 2020

As a followup to #363 (comment), San Mateo County and possibly others are showing an explicit 0 cases on recent days for which there’s no data, instead of undefined:

    "2020-08-05": {
      "cases": 5758,
      "deaths": 120,
      "hospitalized_current": 60,
      "tested": 107268,
      "icu_current": 15,
      "growthFactor": 1
    },
    "2020-08-06": {
      "cases": 0,
      "deaths": 0
    },
    "2020-08-07": {
      "cases": 0,
      "deaths": 0
    },
    "2020-08-08": {
      "cases": 0,
      "deaths": 0
    },
    "2020-08-09": {
      "cases": 0,
      "deaths": 0
    }

@jzohrab
Copy link
Contributor

jzohrab commented Aug 10, 2020

Yep I don't know why some are coded that way, it's incorrect. Thanks for catching it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs-verification Waiting for verification on an item that we believe to be fixed.
Projects
None yet
Development

No branches or pull requests

5 participants