Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

articles/data #59

Closed
utterances-bot opened this issue May 13, 2020 · 35 comments
Closed

articles/data #59

utterances-bot opened this issue May 13, 2020 · 35 comments

Comments

@utterances-bot
Copy link

Data • COVID-19 Data Hub

https://covid19datahub.io/articles/data.html

Copy link

Hello, very good job, we will use it soon for pedagogic and research purposes !!
thanks a lot

but unfortunately ...

btw i think that the French data are not " up to date": many restrictions are finished or less intense in France for 3 days (like school, transports ...) 13/05/2020

please have a look to : http://www.leparisien.fr/societe/coronavirus-dernier-jour-de-confinement-en-france-suivez-notre-direct-10-05-2020-8313930.php

best regards +
take care
Henri

@eguidotti
Copy link
Collaborator

Hello, many thanks for your feedback!

For policy measures, we are using the data by "Oxford Covid-19 Government Response Tracker"
Would you mind notifying the issue to them?
https://github.com/OxCGRT/covid-policy-tracker

Thanks!

Copy link

Hi Team,

Thanks for these sources of data it is really useful.

I have one doubt regarding one of sources-:

https://storage.covid19datahub.io/data-2.csv

The number of tests is 0 for all states and for alabama the tests column seemed to have correct til 31st January 2021 while it has incorrect data from 1st February.
Id -: 50a7f84e for Alabama

Please can you let me know about the same.

eguidotti pushed a commit that referenced this issue Apr 13, 2021
@eguidotti
Copy link
Collaborator

Hi @aman091291, thank you for your message!

We are using the data from the U.S. Department of Health & Human Services. It seems that the access to the data has changed and was returning only the first 1000 rows. I fixed this and now tests and hospitalizations should be back and up-to-date for all states (please allow about 1 hour for the workflow to complete).

Please note that we stopped updating the pre-processed data. Switch to the raw data if you are still using the pre-processed files. E,g.:

Thanks!

Copy link

Hi @eguidotti
Thanks for the update.

I could find that data is now updated in raw file but I am currently using covid19dh package in python using below statement which is still not picking up the correct data

df = covid19("USA", level = 2, start = date(2020,1,1), verbose = False).

Please can you check the same if the python package is referring correct raw data file.

Thanks!

@eguidotti
Copy link
Collaborator

Are you using covid19dh version 2.3.0?

Copy link

@eguidotti I am using version 1.0.0 as version 2.3.0 creates a tuple while loading the data.

df = covid19("USA", level = 2, start = date(2020,1,1), verbose = False)

Please can you suggest way to load this data incase we are to use version 2.3.0

@eguidotti
Copy link
Collaborator

Yes the tuple returns both the data and the data sources. Please have a look at the description here https://pypi.org/project/covid19dh/ The following should work in v2.3.0:

df, src = covid19("USA", level = 2, start = date(2020,1,1), verbose = False)

Hope this helps!

Copy link

Hi @eguidotti,
thanks for putting together this data-collection and tools.
The vintage data is pretty interesting, however I had problems with a few zip-containers.

  1. The following vintage zips seem to be missing
Date URL
2020-06-23 https://storage.covid19datahub.io/2020-06-23.zip
2020-06-24 https://storage.covid19datahub.io/2020-06-24.zip
2020-08-02 https://storage.covid19datahub.io/2020-08-02.zip
2020-08-16 https://storage.covid19datahub.io/2020-08-16.zip
2020-12-30 https://storage.covid19datahub.io/2020-12-30.zip
2021-01-13 https://storage.covid19datahub.io/2021-01-13.zip
2021-02-23 https://storage.covid19datahub.io/2021-02-23.zip
2021-02-26 https://storage.covid19datahub.io/2021-02-26.zip
2021-03-24 https://storage.covid19datahub.io/2021-03-24.zip
  1. In the following zip containers the file src.csv is missing (in some of these containers, other files seem to be missing as well) which causes problems reading the vintage via the python API
Date URL
2020-08-26 https://storage.covid19datahub.io/2020-08-26.zip
2020-08-30 https://storage.covid19datahub.io/2020-08-30.zip
2020-11-14 https://storage.covid19datahub.io/2020-11-14.zip
2020-11-16 https://storage.covid19datahub.io/2020-11-16.zip
2020-11-23 https://storage.covid19datahub.io/2020-11-23.zip

Would be great if you could take a look and thanks again for the great tools!

@eguidotti
Copy link
Collaborator

Hi @MalteKurz, thanks for your feedback and for taking the time to double-check the vintage data.

Unfortunately, some vintage containers are missing or not complete due to technical errors that may have happened in the past. We do our best to promptly fix the pipeline in case of issues. However, it is not always possible to spot the problem and fix it in 24h. In those cases, the vintage data for the day are lost. We don't want to fill them retroactively as the vintage data are guaranteed to be frozen screenshots of the data taken on that day.

We should have a coverage around 99% of the days since we started but unfortunately not 100%.

Copy link

Hello, thank you for your effort! It is very helpful for my essay.And I want to confirm one thing that If I want to download the latest data about US, should I download the raw data for Administrative area level 2?

Copy link
Collaborator

Hello, thanks for your message! Yes you can download the raw data for US level 2 (state-wise data). Level 1 is for national-wide data, while level 3 is at county level. If you need to ensure reproducibility for your work you should instead download the latest available vintage file. This is a frozen snapshot of the data that will not change in the future.

@Zhihui123-123
Copy link

Zhihui123-123 commented Jun 4, 2021 via email

@Zhihui123-123
Copy link

Zhihui123-123 commented Jun 7, 2021 via email

@eguidotti
Copy link
Collaborator

Hi @Zhihui123-123, I'm afraid it's not possible or, at least, I'm not aware of data providers for the number of vaccines since December 14, 2020. Have you seen these data somewhere else?

@Zhihui123-123
Copy link

Zhihui123-123 commented Jun 8, 2021 via email

Copy link

Hi,
I appreciate your efforts. This is a great source for many researchers. There are some missing vintage datasets. I could not download the following dates' vintage datasets:

  1. 2021-11-20
  2. 2021-11-21
  3. 2021-11-22

Do you have any chance to fix this problem?

Also, I think your vintage datasets after 2021-11-14 have one day lag. That is, for example, vintage data set 2021-11-30 contains 2021-11-29 data but not 2021-11-30. I think this is a matter of choice and nothing problematic here. Can you confirm this?

Best,

@eguidotti
Copy link
Collaborator

Hi @yasin-simsek, thanks for you message.

I could not download the following dates' vintage datasets:

Unfortunately there are (very) few vinatege datasets that are missing due to technical issues and it is not possible to backfill them. Please see #59 (comment)

vintage data set 2021-11-30 contains 2021-11-29 data but not 2021-11-30

Yes, this is due to reporting delays. After 14 November 2021, the date of the vintage dataset is the date when the snapshot was taken. Typically, the data available on day T include the counts up to day T-1 due to natural delays in reporting the data to the local authorities and different time zones worldwide. I have updated the documentation on the website to make this point more clear. Thanks!

Before 14 November 2021, the vintage data were generated with a delay of 48 hours to make sure all the observations are complete (so that the dataset at time T contains the count up to time T). In other words, the vintage datasets before 14 November 2021 are affected by a look-ahead bias of 2 days. This is no longer the case after 14 November 2021. Apologize for the confusion. Please let me know if you still have any doubt.

Copy link

Hi, @eguidotti, thanks to you and the Team for a great dataset!

This is the first time using the dataset, and I thought I followed the instructions correctly and downloaded the “LatestData” .gzip, unzipped it to a .csv file and then read it into R with data.table. Reading the structure of the data indicated there were approximately 2.6 million records. I then tried to construct an epicurve using the dataset based on the date_confirmed variable but unexpectedly found that the the last confirmed_date data point was in 2020 June— is that correct? My impression was that the data project has been collecting public data points on diagnosed cases from the start of the Pandemic until know (2022). Should I be using another variable in the data set or try down Ling another dataset file?

I appreciate any assistance you can provide.

Thank you :-)

@eguidotti
Copy link
Collaborator

Hi @SugarRayLua, thanks for giving a try to the dataset! Which data file are you using? Because as of today (28 April 2022) I see that:

  • the file at level 1 contains about 186,709 observations
  • the file at level 2 contains about 593,602 observations
  • the file at level 3 contains about 8,585,829 observations

Anyway, I suggest using directly the package COVID19 to read the data in R. See here

You can install the package

install.packages("COVID19")

And import the data:

library(COVID19)
x <- covid19(level = 1) # -> national level. Use 2 or 3 for finer-grained data

My impression was that the data project has been collecting public data points on diagnosed cases from the start of the Pandemic until now (2022)

That's correct. typing:

max(x$date)

I get "2022-04-28"

Copy link

Thank you very much, @eguidotti. I did get a more complete range of data points when I loaded the dataset with the COVID19 package as you suggested. It is also helpful that the COVID19 converts the date columns (which seem to be strings in the raw dataset) to date format for the user which means one can directly feed the dataset into the incidence or incidence2 R packages.

The only thing that I preferred with the raw dataset is that the raw dataset gave me data in line listing format; it seems that the COVID19 package instead gives me summary data for each day—- is that correct?

@eguidotti
Copy link
Collaborator

The only thing that I preferred with the raw dataset is that the raw dataset gave me data in line listing format; it seems that the COVID19 package instead gives me summary data for each day—- is that correct?

No, it's not correct. The two formats should be the same (except for type conversion upon import). Can you provide the link to the "raw data" you are using? I don't understand which file it is

Copy link

@eguidotti, from this website where I’m posting the comments:
https://covid19datahub.io/articles/data.html
(I downloaded previously the “all in one” gzip file).
I can try downloading that all in one gzip file again but had I remembered it giving me line listings and included symptom onset dates in addition to confirmed dates.
Thanks!

Copy link

(Covid19 package not giving me that granular information; one caveat that might be key; I wasn’t able to download level = 3 data [memory error]? with COVID19 package that I could if I used data.table directly to read in raw data file. Perhaps the level = 3 data contains the line listing information)

@eguidotti
Copy link
Collaborator

There are no "line listing" information in the database or symptom dates. There is only one date per day with the corresponding variables. I can't quite understand which is the file you are using (?)

With the following code, you can read the level 3 data (highlighted in the picture) in R:

library(data.table)
x <- fread("https://storage.covid19datahub.io/level/3.csv.gz")

Screenshot 2022-04-28 at 22 29 31

Maybe you are trying to unzip and read the file https://storage.covid19datahub.io/latest.db.gz ? This is a SQLite file. It must be read with SQLite. It is not a CSV file that can be read with data.table

Hope this helps!

Copy link

Thanks, @eguidotti, for all your help— sorry for the confusion. I’ll look into this more tonight and get back to you. I also was reviewing a Swiss COVID19 dataset so maybe inadvertently mislabel the datasets I was reading.

Copy link

@eguidotti, I figured it out—I got the COVID-19 Data Hub confused with the data from the Open COVID-19 Data Working Group which is described at the folllowing site:

https://github.com/beoutbreakprepared/nCoV2019

Their group has line list data from COVID-19 Pandemic available for download, and their download file is labeled “latestdata.csv” which I confused with the dataset I downloaded form this COVID-19 Data Hub site. It was actually then the Open COVID-19 Data Working Group data site that didn’t seem to have updated data and for whom I should have directed my initial question to. As you can see from the structure of that data base, the Open COVID-19 Data Working Group has variables which store symptom onset:

'data.frame': 2676311 obs. of 33 variables:
$ ID : Factor w/ 2676311 levels 000-1-1,000-1-10,..: 1 2 3 4 5 6 7 8 9 10 ...
$ age : Factor w/ 286 levels ,0,0-1,0-10,0-18,..: 1 250 219 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 3 levels ,female,male: 3 3 2 1 1 1 1 1 1 1 ...
$ city : Factor w/ 8531 levels ,$16.00,$21.00,..: 6894 7850 1 8315 5755 8235 6874 2524 5010 5010 ...
$ province : Factor w/ 1246 levels ,ABANCAY,ACOMAYO,..: 460 1154 1 448 521 521 521 521 521 521 ...
$ country : Factor w/ 147 levels ,Afghanistan,Albania,..: 30 66 121 30 30 30 30 30 30 30 ...
$ latitude : num 22.37 45.3 1.35 34.63 27.51 ...
$ longitude : num 114.1 11.7 103.8 113.5 113.9 ...
$ geo_resolution : Factor w/ 7 levels ,JPCP,admin0,admin1,..: 7 7 3 5 5 5 5 5 5 5 ...
$ date_onset_symptoms : Factor w/ 171 levels ,01.01.2020,01.01.2020-12.01.2020,..: 1 1 1 1 1 1 1 1 1 1 ...
$ date_admission_hospital : Factor w/ 163 levels ,01.01.2020,01.02.2020,..: 1 1 1 1 1 1 1 1 1 1 ...
$ date_confirmation : Factor w/ 179 levels ,01.02.2020,01.03.2020,..: 86 127 86 153 86 86 86 86 94 94 ...
$ symptoms : Factor w/ 449 levels ,37.1 ° C, mild coughing,..: 1 1 1 1 1 1 1 1 1 1 ...
$ lives_in_Wuhan : Factor w/ 3 levels ,no,yes: 1 1 1 1 1 1 1 1 1 1 ...
$ travel_history_dates : Factor w/ 215 levels ,- 01.03.2020,..: 160 1 1 1 1 1 1 1 1 1 ...
$ travel_history_location : Factor w/ 773 levels ,;;Belgium,;;Cape Verde,..: 144 1 1 1 1 1 1 1 1 1 ...
$ reported_market_exposure: Factor w/ 6 levels ,contact with a positive case,..: 1 1 1 1 1 1 1 1 1 1 ...
$ additional_information : Factor w/ 26720 levels ,"Federal Areas",..: 21816 22911 21835 1 1 1 1 1 1 1 ...
$ chronic_disease_binary : Factor w/ 2 levels False,True: 1 1 1 1 1 1 1 1 1 1 ...
$ chronic_disease : Factor w/ 84 levels ,"thought to have had other pre-existing conditions",..: 1 1 1 1 1 1 1 1 1 1 ...
$ source : Factor w/ 12227 levels ,"@jmcapitanich : "Chaco tiene 33 casos confirmados de coronavirus"",..: 11707 10260 10184 1672 174 174 174 174 175 175 ...
$ sequence_available : Factor w/ 6 levels ,02.03.2020,10.03.2020,..: 1 1 1 1 1 1 1 1 1 1 ...
$ outcome : Factor w/ 35 levels ,Alive,Critical condition,..: 19 21 23 1 1 1 1 1 1 1 ...
$ date_death_or_discharge : Factor w/ 145 levels ,01.02.2020,01.03.2020,..: 1 102 77 1 1 1 1 1 1 1 ...
$ notes_for_discussion : Factor w/ 204 levels ,A Burundian who arrived in Rwanda on 15 March 2020 from Dubai. UAE in transit to Bujumbura Burundi was detected by airport medic"| truncated,..: 1 1 1 1 1 1 1 1 1 1 ...
$ location : Factor w/ 343 levels ,Abu Dhabi,Aegean Bay, Oriental Plaza,..: 272 327 1 1 1 1 1 1 1 1 ...
$ admin3 : Factor w/ 410 levels ,Baihe County,..: 1 1 1 1 1 1 1 1 1 1 ...
$ admin2 : Factor w/ 2086 levels ,Aa en Hunze,Aalsmeer,..: 1 1 1 2064 1434 2032 1646 647 1253 1253 ...
$ admin1 : Factor w/ 617 levels ,Aargau,Abruzzo,..: 226 577 1 219 247 247 247 247 247 247 ...
$ country_new : Factor w/ 146 levels ,Afghanistan,Albania,..: 31 67 121 31 31 31 31 31 31 31 ...
$ admin_id : num 8029 8954 200 10091 7060 ...
$ data_moderator_initials : Factor w/ 13 levels ,-,DSC,FS,FS, PA,..: 1 1 1 1 1 1 1 1 1 1 ...
$ travel_history_binary : Factor w/ 3 levels ,False,True: 1 1 1 1 1 1 1 1 1 1 ...

I apology for confusing the two datasets.

Have a good day and upcoming weekend :- )

@eguidotti
Copy link
Collaborator

Thanks for the update @SugarRayLua! Best, Emanuele

Copy link

Any sources for a shapefile for the boundaries of the admin level 2(county) shapes?

@eguidotti
Copy link
Collaborator

You can download shapefiles from GADM and merge with this database via the key_gadm variable. A short description of key_gadm is provided in the online documentation. More details can be found in the paper https://www.nature.com/articles/s41597-022-01245-1 Hope this helps!

@SugarRayLua
Copy link

SugarRayLua commented Aug 23, 2022 via email

Copy link

Azad77 commented Sep 4, 2022

Thank you for this great work. How we can download Iraqi COVID19 data at Governorate level (level 2)?

@eguidotti
Copy link
Collaborator

There are no data for Iraq at level 2 in the database. Do you know where to find them? I would be interested in adding them

Copy link

Thanks for the data. I use them in my research.
In the file https://storage.covid19datahub.io/level/2.csv.zip
the data of the "population" column in the regions of Russia: "Stavropol Krai", "Chechen Republic", "Tatarstan Republic", "Saint Petersburg", "Moscow Oblast" are overestimated 10 times.

@eguidotti
Copy link
Collaborator

@doctorprog55 Thank you for your message! This is fixed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests