Skip to content
This repository has been archived by the owner on Dec 5, 2022. It is now read-only.

Number of US states are missing deaths/tested #546

Open
zbraniecki opened this issue Mar 27, 2020 · 20 comments
Open

Number of US states are missing deaths/tested #546

zbraniecki opened this issue Mar 27, 2020 · 20 comments
Assignees
Labels
needs-verification Waiting for verification on an item that we believe to be fixed.

Comments

@zbraniecki
Copy link

States with reported deaths that are not in today's data:

  • New York
  • New Jersey
  • Texas
  • Georgia
  • Colorado
  • Tennessee
  • Wisconsin
  • Maryland
  • Missouri
  • Arizona
  • Oklahoma
  • Kansas
  • Rhode Island
  • Maine
  • New Hampshire
  • Delaware
  • New Mexico
  • Montana
  • West Virginia
  • Alaska

Compared to https://coronavirus.1point3acres.com/en

@zbraniecki
Copy link
Author

And tested I think as well.

In the report.json for NY I see:

country:"USA"
url:"https://covidtracking.com/api/states"
type:"json"
curators:Array[1]
aggregate:"state"
priority:-0.5
timeseries:false
headless:false
certValidation:true
state:"NY"
deaths:385
tested:122104
cases:37258
ssl:true
rating:0.49019607843137253

but in byLocation.json, I see:

      "2020-3-26": {
        "cases": 37258,
        "growthFactor": 1.209243452013891
      }

Same for Texas

@zbraniecki zbraniecki changed the title Number of US states are missing deaths Number of US states are missing deaths/tested Mar 27, 2020
@zbraniecki
Copy link
Author

They're also reported with aggregation: "county". I think they're deduped against counties because all those states have both county and state with that name.

@zbraniecki
Copy link
Author

@lazd sorry to poke you, but I think this is pretty severe and I'd like to make sure it doesn't escape your attention before the next update. Can you mark it with appropriate labels?

@lazd
Copy link
Contributor

lazd commented Mar 27, 2020

No worries @zbraniecki. As part of covidatlas/coronadatascraper#410, I will try to get tested data back via COVIDTracking. That said, I don't think we can get deaths unless we have it somewhere.

@zbraniecki
Copy link
Author

COVIDTracking has deaths for those states:

Will that also fix it?

@zbraniecki
Copy link
Author

And covidatlas/coronadatascraper#410 seems to be about combining data from two sources. My suspicion is that this bug is about two sources for two different things (county vs. state) ending up conflated as two sources of the same thing and the county one wins.

Here's what's in report.json for "NY, USA":

      "NY, USA": [
        {
          "country": "USA",
          "url": "https://covidtracking.com/api/states",
          "type": "json",
          "curators": [
            {
              "name": "The COVID Tracking Project",
              "url": "https://covidtracking.com/",
              "twitter": "@COVID19Tracking",
              "github": "COVID19Tracking"
            }
          ],
          "aggregate": "state",
          "priority": -0.5,
          "timeseries": false,
          "headless": false,
          "certValidation": true,
          "state": "NY",
          "deaths": 385,
          "tested": 122104,
          "cases": 37258,
          "ssl": true,
          "rating": 0.49019607843137253
        },
        {
          "state": "NY",
          "country": "USA",
          "type": "table",
          "aggregate": "county",
          "timeseries": false,
          "headless": false,
          "certValidation": true,
          "priority": 0,
          "url": "https://coronavirus.health.ny.gov/county-county-breakdown-positive-cases",
          "cases": 37258,
          "ssl": true,
          "rating": 0.3137254901960784
        }
      ],

I think this is a different thing than covidatlas/coronadatascraper#410. - mainly, those counties should not be conflated with states. and "NY, USA" should be a state and collect state data.

@lazd
Copy link
Contributor

lazd commented Mar 28, 2020

No, those counties cases are rolled up into state totals, which is exactly what we want to do. However, testing numbers aren't being reported on a per-county basis, so they're not getting rolled up.

So what we want to do is take our rolled up case numbers and take COVIDTracking's testing numbers, which is what covidatlas/coronadatascraper#410 is about.

@mvanmidd
Copy link

Just as another data point, I'm still seeing deaths == nan for all of NY state and city.

# omitted: load `df` from `timeseries.csv`, parse dates, drop lat/lon/url columns
>> usall = df[(df["country"] == "USA")]
>> usall[(usall.deaths.notnull()) & (usall.deaths>0)].state.unique()
array(['WA', 'CA', 'MA', 'GA', 'FL', 'NJ', 'OR', 'IL', 'PA', 'IA', 'NC',
       'SC', 'IN', 'KY', 'NV', 'OH', 'WI', 'CT', 'HI', 'OK', 'UT', 'KS',
       'LA', 'MO', 'VT', 'AR', 'ID', 'ME', 'MI', 'MS', 'NM', 'ND', 'SD',
       'CO', nan, 'VA', 'DC', 'AL', 'PR', 'GU', 'AK', 'MN'], dtype=object)

(Personally I'm less interested in tested, except to the extent that it's caused by the same underlying issue... reports on number tested have been inconsistent across most aggregators; cases+deaths have been more reliable).

Also, thanks for putting this dataset together; I've been lurking for a while and am impressed with the work y'all're putting in. Unfortunately I just migrated the daily updates I send to friends and family on cases+deaths (in the states we live in) to use this timeseries data; bad timing I guess :)

Good luck with the fix, and thanks again!

@lazd
Copy link
Contributor

lazd commented Mar 30, 2020

@mvanmidd we don't have a source for deaths in NY on a per-county basis: https://coronavirus.health.ny.gov/county-county-breakdown-positive-cases

NYC only notes deaths for the entire city: https://www1.nyc.gov/assets/doh/downloads/pdf/imm/covid-19-daily-data-summary.pdf

We can pull deaths for NYC from the daily update PDF, but we're out of luck for the rest of the New York counties. Until we implement covidatlas/coronadatascraper#410, we won't be pulling deaths for NY state either, unfortunately.

@mvanmidd
Copy link

Gotcha, thanks for the update. covidatlas/coronadatascraper#410 seems like a big one, good luck! y'all are going to have a fully featured generic/configurable ETL framework pretty soon :)

In all seriousness, I think the auditable data aggregation is the biggest strength of this project... there's plenty of fronted work going on elsewhere (e.g. the explosion of "babby's first plotly visualizations," including my own), and on the backend, lots of data sources that are either incomplete or opaque. Keep up the good work!

@cristipp
Copy link

cristipp commented Apr 7, 2020

For US state/county data, how about NYT repo: https://github.com/nytimes/covid-19-data ?

@jzohrab
Copy link
Contributor

jzohrab commented Apr 7, 2020

@cristipp - good find. I haven't looked at the actual data, but the README is encouraging.

@cristipp
Copy link

cristipp commented Apr 7, 2020

Rows: 1884
Columns: date, state, fips, cases, deaths
[
    {date: '2020-03-01', state: 'New York', fips: '36', cases: '1', deaths: '0'},
    {date: '2020-03-02', state: 'New York', fips: '36', cases: '1', deaths: '0'},
    {date: '2020-03-03', state: 'New York', fips: '36', cases: '2', deaths: '0'},
    {date: '2020-03-04', state: 'New York', fips: '36', cases: '11', deaths: '0'},
    {date: '2020-03-05', state: 'New York', fips: '36', cases: '22', deaths: '0'},
    {date: '2020-03-06', state: 'New York', fips: '36', cases: '44', deaths: '0'},
    {date: '2020-03-07', state: 'New York', fips: '36', cases: '89', deaths: '0'},
    {date: '2020-03-08', state: 'New York', fips: '36', cases: '106', deaths: '0'},
    {date: '2020-03-09', state: 'New York', fips: '36', cases: '142', deaths: '0'},
     ...
]

And for county level:

[
    {date: '2020-03-28', county: 'Albany', state: 'New York', fips: '36001', cases: '195', deaths: '1'},
    {date: '2020-03-29', county: 'Albany', state: 'New York', fips: '36001', cases: '205', deaths: '1'},
    {date: '2020-03-30', county: 'Albany', state: 'New York', fips: '36001', cases: '217', deaths: '1'},
    {date: '2020-03-31', county: 'Albany', state: 'New York', fips: '36001', cases: '226', deaths: '1'},
    {date: '2020-04-01', county: 'Albany', state: 'New York', fips: '36001', cases: '240', deaths: '2'},
    {date: '2020-04-02', county: 'Albany', state: 'New York', fips: '36001', cases: '253', deaths: '2'},
    {date: '2020-04-03', county: 'Albany', state: 'New York', fips: '36001', cases: '267', deaths: '4'},
    {date: '2020-04-04', county: 'Albany', state: 'New York', fips: '36001', cases: '293', deaths: '6'},
    {date: '2020-04-05', county: 'Albany', state: 'New York', fips: '36001', cases: '305', deaths: '8'},
    {date: '2020-04-01', county: 'Allegany', state: 'New York', fips: '36003', cases: '10', deaths: '1'},
   ...
]

They also have NYC as a separate entry [empty fips]:

{date: '2020-03-14', county: 'New York City', state: 'New York', fips: '', cases: '269', deaths: '1'},
    {date: '2020-03-15', county: 'New York City', state: 'New York', fips: '', cases: '330', deaths: '5'},
    {date: '2020-03-16', county: 'New York City', state: 'New York', fips: '', cases: '464', deaths: '7'},
    {date: '2020-03-17', county: 'New York City', state: 'New York', fips: '', cases: '645', deaths: '10'},
    {date: '2020-03-18', county: 'New York City', state: 'New York', fips: '', cases: '1339', deaths: '20'},
    {date: '2020-03-19', county: 'New York City', state: 'New York', fips: '', cases: '2468', deaths: '22'},

@jzohrab
Copy link
Contributor

jzohrab commented Apr 7, 2020

That looks encouraging, @lazd what do you think?
@cristipp, do you feel you could take a shot at writing a scraper for this data?

@cristipp
Copy link

cristipp commented Apr 7, 2020

I'd be happy to. Though I see it already appearing in the https://coronadatascraper.com/#crosscheck for many [all?] counties. Perhaps you don't have it for state-level data?

@cristipp
Copy link

cristipp commented Apr 7, 2020

Oh, I found NY at state level too: https://coronadatascraper.com/#crosscheck:iso2:US-NY-iso1:US. It appears the scrapper prefers the arcgis dataset for some reason.

FWIW, looks like most recent data + deaths + tested for iso2:US-NY is coming from https://covidtracking.com, see https://coronadatascraper.com/#crosscheck:iso2:US-NY-iso1:US.

  cases deaths tested recovered
https://health.data.ny.gov/api/views/xdss-u53e/rows.csv?accessType=DOWNLOAD 130689 - 320811 -
https://covidtracking.com/api/states 130689 4758 320811 -
https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv 122911 3483 - -
https://github.com/CSSEGISandData/COVID-19 131815 4389 - -

@jzohrab
Copy link
Contributor

jzohrab commented Apr 25, 2020

I believe that @hyperknot has closed out this issue by upping the priority of the covidtracking scraper. @cristipp, what's your feeling?

@cristipp
Copy link

cristipp commented Apr 26, 2020

Good to have the state level data fixed. We're still lacking county level data for NY fatalities [0].

The county level fatalities can be pulled from NYT [1] or USAFacts [2], with the quirk that NYT sums the 5 counties of NYC together. Also note these sources don't report 'tested', which CoronaDataScraper does.

  • We have NY scraper using NYT county data. Scrapes cases + deaths, nothing else. coronadatascraper#876, which adds county-level fatalities data from NYT. Alas, that PR does not fill in data for NYC counties [Kings, Queens, New York, Bronx, Richmond, see https://en.wikipedia.org/wiki/Boroughs_of_New_York_City] because of the aforementioned quirk, which makes it less useful than I'd like.

  • There is a general question on whether CoronaDataScraper wants to 'fallback' missing data from other metaaggregators as a post-processing step on a cell-by-cell basis. Think of an extra field 'fallback: true', which enables the fallback behavior on a source by source basis, in priority order. Then we could add nyt, usafacts and whatnot. Would that be something of interest?

    [0] coronadatascraper: [
        {key: 'US-NY', date: '2020-04-22', cases: 257216, tested: 669982, deaths: 15302},
        {key: 'US-NY-Queens County', date: '2020-04-22', cases: 43713, tested: 88388, deaths: undefined},
        {key: 'US-NY-Kings County', date: '2020-04-22', cases: 38481, tested: 81787, deaths: undefined},
        {key: 'US-NY-Nassau County', date: '2020-04-22', cases: 31555, tested: 74571, deaths: undefined},
        {key: 'US-NY-Bronx County', date: '2020-04-22', cases: 30868, tested: 65304, deaths: undefined},
        {key: 'US-NY-Suffolk County', date: '2020-04-22', cases: 28854, tested: 71268, deaths: undefined},
        {key: 'US-NY-Westchester County', date: '2020-04-22', cases: 25276, tested: 76564, deaths: undefined},
        {key: 'US-NY-New York County', date: '2020-04-22', cases: 19025, tested: 49687, deaths: undefined},
        {key: 'US-NY-Richmond County', date: '2020-04-22', cases: 10345, tested: 26289, deaths: undefined},
        {key: 'US-NY-Rockland County', date: '2020-04-22', cases: 9699, tested: 23150, deaths: undefined},
        ... 53 more
    ]
    [1] nyt: [
        {key: 'US-NY', date: '2020-04-22', cases: 257246, deaths: 15302},
        {key: 'US-NY-New York City', date: '2020-04-22', cases: 142442, deaths: 10614},
        {key: 'US-NY-Nassau', date: '2020-04-22', cases: 31555, deaths: 1764},
        {key: 'US-NY-Suffolk', date: '2020-04-22', cases: 28854, deaths: 959},
        {key: 'US-NY-Westchester', date: '2020-04-22', cases: 25275, deaths: 932},
        {key: 'US-NY-Rockland', date: '2020-04-22', cases: 9699, deaths: 309},
        {key: 'US-NY-Orange', date: '2020-04-22', cases: 6705, deaths: 183},
        {key: 'US-NY-Dutchess', date: '2020-04-22', cases: 2391, deaths: 57},
        {key: 'US-NY-Erie', date: '2020-04-22', cases: 2233, deaths: 174},
        {key: 'US-NY-Monroe', date: '2020-04-22', cases: 1112, deaths: 72},
        ... 50 more
  ]
  [2] usafacts: [
        {key: 'US-NY-Queens County', date: '2020-04-22', cases: 43713, deaths: 3432},
        {key: 'US-NY-Kings County', date: '2020-04-22', cases: 38481, deaths: 3458},
        {key: 'US-NY-Nassau County', date: '2020-04-22', cases: 31555, deaths: 1431},
        {key: 'US-NY-Bronx County', date: '2020-04-22', cases: 31130, deaths: 2258},
        {key: 'US-NY-Suffolk County', date: '2020-04-22', cases: 28854, deaths: 926},
        {key: 'US-NY-Westchester County', date: '2020-04-22', cases: 25276, deaths: 838},
        {key: 'US-NY-New York County', date: '2020-04-22', cases: 19025, deaths: 1337},
        {key: 'US-NY-Richmond County', date: '2020-04-22', cases: 10405, deaths: 492},
        {key: 'US-NY-Rockland County', date: '2020-04-22', cases: 9699, deaths: 334},
        {key: 'US-NY-Orange County', date: '2020-04-22', cases: 6690, deaths: 185},
        ... 54 more
    ]

@jzohrab
Copy link
Contributor

jzohrab commented Aug 9, 2020

Hi @cristipp - getting back to this one after a long delay!

The reports from Li at https://covidatlas.com/data merge data sources by priority. If a lower-priority source supplies a data point that no higher-pri source has, that value is preserved, and we also give the final source selected for each data point (see timeseries-byLocation.json). I believe we're doing what you've suggested.

I believe this issue can be closed -- thoughts?

@jzohrab jzohrab closed this as completed Aug 9, 2020
@jzohrab jzohrab reopened this Aug 9, 2020
@jzohrab jzohrab transferred this issue from covidatlas/coronadatascraper Aug 9, 2020
@jzohrab
Copy link
Contributor

jzohrab commented Aug 9, 2020

Hi @cristipp and @zbraniecki - getting back to this one after a long delay!

The reports from Li at https://covidatlas.com/data merge data sources by priority. If a lower-priority source supplies a data point that no higher-pri source has, that value is preserved, and we also give the final source selected for each data point (see timeseries-byLocation.json). I believe we're doing what you've suggested.

I believe this issue can be closed -- thoughts?

@jzohrab jzohrab self-assigned this Aug 9, 2020
@jzohrab jzohrab added the needs-verification Waiting for verification on an item that we believe to be fixed. label Aug 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs-verification Waiting for verification on an item that we believe to be fixed.
Projects
None yet
Development

No branches or pull requests

5 participants