# How to use Data Collector module

In this notebook, a simple demonstration of how to use Data Collector features is provided.

In [1]:
import COVID19Py
import pandas as pd
from pydemic.data_collector import AvailableCountryData, CountryDataCollector
from pydemic.data_collector import export_updated_full_dataset_from_jhu, get_updated_full_dataset_from_jhu

from tqdm import tqdm

You can check the list of available countries that you can get from data collector. It can be done using internet, which will provided the up-to-date data, or you can use it offline, but the that can present several days of delay.

In [2]:
available_countries = AvailableCountryData(use_internet_connection=True)

available_countries.list_of_available_country_names()

Processing available countries DataFrame: 100%|██████████| 253/253 [00:00<00:00, 702.83it/s]


['Afghanistan',
 'Albania',
 'Algeria',
 'Andorra',
 'Angola',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Benin',
 'Bhutan',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Brazil',
 'Brunei',
 'Bulgaria',
 'Burkina Faso',
 'Cabo Verde',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Congo (Brazzaville)',
 'Congo (Kinshasa)',
 'Costa Rica',
 "Cote d'Ivoire",
 'Croatia',
 'Diamond Princess',
 'Cuba',
 'Cyprus',
 'Czechia',
 'Denmark',
 'Djibouti',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Eswatini',
 'Ethiopia',
 'Fiji',
 'Finland',
 'France',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Guatemala',
 'Guinea',
 'Guyana',
 'Haiti',
 'Holy See',
 'Honduras',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 'Iran',
 'Iraq'

In [3]:
available_countries = AvailableCountryData(use_internet_connection=False)

available_countries.list_of_available_country_names()

Processing available countries DataFrame: 100%|██████████| 249/249 [00:00<00:00, 892.85it/s]


['Afghanistan',
 'Albania',
 'Algeria',
 'Andorra',
 'Angola',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Benin',
 'Bhutan',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Brazil',
 'Brunei',
 'Bulgaria',
 'Burkina Faso',
 'Cabo Verde',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Congo (Brazzaville)',
 'Congo (Kinshasa)',
 'Costa Rica',
 "Cote d'Ivoire",
 'Croatia',
 'Diamond Princess',
 'Cuba',
 'Cyprus',
 'Czechia',
 'Denmark',
 'Djibouti',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Eswatini',
 'Ethiopia',
 'Fiji',
 'Finland',
 'France',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Guatemala',
 'Guinea',
 'Guyana',
 'Haiti',
 'Holy See',
 'Honduras',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 'Iran',
 'Iraq'

Likewise, you can get a `pandas.DataFrame` with recorded cases for a given country:

In [4]:
brazil_data = CountryDataCollector("Brazil", use_online_resources=True)

Processing available countries DataFrame: 100%|██████████| 253/253 [00:00<00:00, 386.60it/s]
Processing available countries DataFrame: 100%|██████████| 253/253 [00:00<00:00, 749.40it/s]


In [5]:
brazil_data = CountryDataCollector("Brazil", use_online_resources=False)

Processing available countries DataFrame: 100%|██████████| 249/249 [00:00<00:00, 867.27it/s]
Processing available countries DataFrame: 100%|██████████| 249/249 [00:00<00:00, 1046.32it/s]


Then, you can simply retrieve a `pandas.DataFrame` with cases and deaths time series for that country:

In [6]:
brazil_data.get_time_series_data_frame

  df_grouped_country.groupby("date")["date", "confirmed", "deaths"]


Unnamed: 0,date,confirmed,deaths
0,2020-01-22,0,0
1,2020-01-23,0,0
2,2020-01-24,0,0
3,2020-01-25,0,0
4,2020-01-26,0,0
...,...,...,...
61,2020-03-23,1924,34
62,2020-03-24,2247,46
63,2020-03-25,2554,59
64,2020-03-26,2985,77


Finally, you can get a full and up-to-date dataset time series from JHU thanks to [COVID19Py](https://github.com/Kamaropoulos/covid19py) amazing package:

In [7]:
df_full_data = get_updated_full_dataset_from_jhu()

Processing available countries DataFrame: 100%|██████████| 253/253 [00:00<00:00, 977.82it/s]
Processing country code XK: 100%|██████████| 175/175 [03:12<00:00,  1.10s/it]


In [8]:
df_full_data

Unnamed: 0,country,province,day,date,confirmed,deaths
0,Afghanistan,,0,2020-01-22,0,0
1,Afghanistan,,1,2020-01-23,0,0
2,Afghanistan,,2,2020-01-24,0,0
3,Afghanistan,,3,2020-01-25,0,0
4,Afghanistan,,4,2020-01-26,0,0
...,...,...,...,...,...,...
16946,Zimbabwe,,62,2020-03-24,3,1
16947,Zimbabwe,,63,2020-03-25,3,1
16948,Zimbabwe,,64,2020-03-26,3,1
16949,Zimbabwe,,65,2020-03-27,5,1


And it's also possible to generate a file from the same database:

In [9]:
export_updated_full_dataset_from_jhu()

Processing available countries DataFrame: 100%|██████████| 253/253 [00:00<00:00, 708.36it/s]
Processing country code XK: 100%|██████████| 175/175 [03:13<00:00,  1.10s/it]
