Skip to content
This repository has been archived by the owner on Dec 5, 2022. It is now read-only.

Future data provided for California in timeseries.csv #360

Open
jingjtang opened this issue Jul 30, 2020 · 10 comments
Open

Future data provided for California in timeseries.csv #360

jingjtang opened this issue Jul 30, 2020 · 10 comments
Assignees
Labels
needs-verification Waiting for verification on an item that we believe to be fixed.

Comments

@jingjtang
Copy link

Dear friends in Corona Data Scraper groups, thank you so much for providing such a source. I am using your data (almost the timeseries.zip) for covid-19 related research. I find there is future data provided for California in the file which confuse me. For example, today is 07-30, but there are case numbers for California 07-31. Is there any mismatches between the cases/deaths/tested and the dates?

@jzohrab
Copy link
Contributor

jzohrab commented Jul 30, 2020

Hi there @jingjtang , thanks for the issue! We have recently converted to a new report and this is a new bug. I'm not sure yet where it comes from, but I'm going to try to solve it now as a few people have noted it. Thank you! jz

@jzohrab jzohrab self-assigned this Jul 30, 2020
@jzohrab
Copy link
Contributor

jzohrab commented Jul 30, 2020

Hm, I just downloaded timeseries-byLocation.json from https://covidatlas.com/data (which links to a file on s3, https://liproduction-reportsbucket-bhk8fnhv1s76.s3-us-west-1.amazonaws.com/v1/latest/timeseries-byLocation.json), and its last date appears to be

      "2020-07-13": {
        "cases": 317167,
        "deaths": 7017,
        "tested": 3920501,
        "hospitalized": 3591,
        "recovered": 66625,
        "icu": 532,
        "growthFactor": 1
      }

timeseries.csv linked on that same page does have the date you mentioned though:

 MacBook-Air:Downloads jeff$ grep iso1:us#iso2:us-ca, timeseries.csv | tail -n 2
iso1:us#iso2:us-ca,california-us,"California, US",state,,,California,United States,37.25,-119.61,39512223,,America/Los_Angeles,485300,8901,117835,,5257026,4048,,,632,,2020-07-30
iso1:us#iso2:us-ca,california-us,"California, US",state,,,California,United States,37.25,-119.61,39512223,,America/Los_Angeles,485300,8901,117835,,5257026,4048,,,632,,2020-07-31

There's a few things to diagnose here, checking.

@jzohrab
Copy link
Contributor

jzohrab commented Jul 30, 2020

baseData.json, which is the base data source for all reports, has future data:

      "2020-07-31": {
        "cases": 485300,
        "deaths": 8901,
        "tested": 5257026,
        "hospitalized": 4048,
        "recovered": 117835,
        "icu": 632,
        "growthFactor": 1
      }

The dateSources in that report shows

      "2020-07-30..2020-07-31": "us-ca-mercury-news"

Checking that source to see what's up.

@jzohrab
Copy link
Contributor

jzohrab commented Jul 30, 2020

merc news has the following source: https://docs.google.com/spreadsheets/d/1CwZA4RPNf_hUrwzNLyGGNHRlh1cwl8vDHwIoae51Hac/gviz/tq?tqx=out:csv&sheet=timeseries

But that source currently has the latest date 07-29:

"County","Date","URL","Created","Last updated","Submitted By","Cases Total","Cases New Reported","Cases New Calculated","Cases Percent","Deaths Total","Deaths New","Tests Total","Tests New","Testing Turnaround","Pending Tests Current","Negative Tests Total","Negative Tests New","Positive Tests Total","Positive Tests New","Inconclusive Tests Total","Inconclusive Tests New","Recovered Total","Recovered New","Hospital Confirmed Total","Hospital Confirmed New","Hospital Confirmed Current","ICU Total","ICU New","ICU Current","Ventilator Total","Ventilator New","Ventilator Current","Symptomatic No Hospital Total","Symptomatic No Hospital New","Asymptomatic Total","Asymptomatic New","Hospital Suspected Total","Hospital Suspected New","Hospital Suspected Current","7-day new case average","7-day new death average","county population","14-day new cases","14-day new deaths","new daily testing (total or pos+neg)","total tests (total or calculated pos+neg)","7-day positivity rate","test rate 7-day average"
"Alameda","2020-07-29","http://www.acphd.org/2019-ncov.aspx","7/29/2020 12:44:12","7/29/2020 16:13:12","HR","10,773","","140","","181","0","","","","","","","","","","","","","","","","","","","","","","","","","","","","","161","1.428571429","1,685,886","13.6130201","27","0","0","","0.0000000"
"Alpine","2020-07-29","http://alpinecountyca.gov/Index.aspx?NID=516","7/29/2020 12:44:12","7/22/2020 18:15:35","","2","","0","","0","0","","","","","","","","","","","2","0","","","","","","","","","","","","","","","","","0","0","1,117","8.952551477","0","0","0","","0.0000000"
"Amador","2020-07-29","https://www.amadorgov.org/services/covid-19/-fsiteid-1","7/29/2020 12:44:12","7/28/2020 18:12:22","","89","","0","2.26%","0","0","3932","0","","","","","","","","","60","0","13","","4","","","","","","","","","","","","","","3","0","38,531","10.64078275","0","0","0","8.63%","1.0307100"
"Butte","2020-07-29","https://infogram.com/1pe66wmyjnmvkrhm66x9362kp3al60r57ex","7/29/2020 12:44:12","7/29/2020 16:46:04","EW","883","","17","4.97%","7","0","17755","0","","","16889","0","883","17","","","709","17","","","5","","","","","","","","","","","","","","29","0.2857142857","217,769","20.48041732","3","0","0","13.05%","0.9938447"
"Calaveras","2020-07-29","https://covid19.calaverasgov.us/","7/29/2020 12:44:12","7/28/2020 18:33:55","","108","","0","","1","0","","","","","","","","","","","80","0","","","1","","","","","","","","","","","","","","2","0","44,289","7.451060083","1","0","0","-0.34%","1.6740693"
"Colusa","2020-07-29","http://www.countyofcolusa.org/771/COVID19","7/29/2020 12:44:12","7/29/2020 16:51:13","EW","304","","12","","3","1","","","","","1768","4","304","12","","","214","5","","","4","","","","","","","","","","","","","","9","0.1428571429","22,593","72.58885496","3","0.7081839508","304","24.60%","1.5934139"
"Contra Costa","2020-07-29","https://cchealth.org/coronavirus/","7/29/2020 12:44:12","7/29/2020 12:45:36","HR","7,714","","410","5.74%","109","1","134,411","4,031","","","","","","","","","","","","","105","","","","","","","","","","","","","","216","1","1,160,099","22.18776156","17","3.474703452","4031","9.10%","2.0471159"
"Del Norte","2020-07-29","https://dnco.maps.arcgis.com/apps/opsdashboard/index.html#/3dd5de4df5194963853f7f40e38a3a01","7/29/2020 12:44:12","7/29/2020 18:09:50","EW","88","","0","2.54%","0","0","3466","-525","","","","","","","","","","","2","","0","","","","","","","","","","","","","","1","0","27,558","9.797517962","0","-19.05072937","-525","11.25%","0.4147098"
"El Dorado","2020-07-29","https://www.edcgov.us/Government/hhsa/Pages/EDCCOVID-19-Cases.aspx","7/29/2020 12:44:12","7/29/2020 18:09:31","EW","589","24","10","3.34%","1","0","17644","131","","","17055","121","589","10","","","386","16","","","1","","","1","","","","","","","","","","","15","0","193,098","11.70390165","1","0.6784119981","131","8.85%","0.8611467"

These return no records: grep 2020-07-30 data.csv , grep 2020-07-31 data.csv .

I recently updated merc news, so will check the old implementation to see if it messes up the dates.

@jzohrab
Copy link
Contributor

jzohrab commented Jul 30, 2020

Old code did a bad move with the data. e.g. running with the current data from the site, running scrape gives the following: 2020-07-30 is newer than last sample 2020-07-29. Using last sample anyway. So, merc news is getting set as 2020-07-30 date in the data, even though there's only 2020-07-29.

Still doesn't explain 2020-07-31 showing up, still looking.

@jzohrab
Copy link
Contributor

jzohrab commented Jul 30, 2020

Running npm run gen-reports only contains dates up to 07-30, nothing is being forward-dated.

It currently is Friday July 31 in a few areas of the world -- Tokyo, for example -- but honestly I'd be surprised if our main running timezone was ahead of us that much! Will check prod log.

@jzohrab
Copy link
Contributor

jzohrab commented Jul 30, 2020

Checking prod data first. eg below is check for 2020-07-29:

image

Have 07-29 data in table, 2020-07-30, and 2020-07-31 as well. Not good. Re-checking logs, couldn't see anything obvious though.

@jzohrab
Copy link
Contributor

jzohrab commented Jul 30, 2020

the 2020-07-31 data was updated 2020-07-30T12:38:47.347Z. in dynamodb. That is still 07-30 though, can't see why there would be another date recorded.

@jzohrab
Copy link
Contributor

jzohrab commented Jul 30, 2020

I'm not sure what is happening in the code that is causing this, which doesn't fill me with confidence! The only thought I have here is that the lambda doing the scraping is running in a different timezone, and so assigning a different date. I can't see how it's in such an advanced timezone. Unfortunately our logging is inadequate at the moment, so I can't see how this was set to the future date.

Regardless, a fix that I implemented recently should result in the data having the actual date specified in the data files. I'll keep this issue open until we see the change in effect.

@jingjtang - I'll assign this to you as well to do the check in a couple of days. I'll check too if I can, though I'm spread thin these days. I'll try clearing out the 07-31 data points for mercury-news, though that's a slow operation. :-)

Thanks again @jingjtang for the issue.

@jzohrab jzohrab added question Further information is requested needs-verification Waiting for verification on an item that we believe to be fixed. and removed question Further information is requested labels Jul 30, 2020
@jzohrab
Copy link
Contributor

jzohrab commented Aug 1, 2020

I've also pushed #365 to staging and prod, which had the same forward-dating bug. I believe that this will fix the issue. It may take a couple of days for us to know.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs-verification Waiting for verification on an item that we believe to be fixed.
Projects
None yet
Development

No branches or pull requests

2 participants