# Usage: datasets
Here, we will review the raw/cleaned datasets. `Scenario` class performs data cleaning internally using `JHUData` class and so on, but it is important to review the features and data types before analysing them.

### Preparation
Prepare the packages.

In [None]:
# Standard users
# !pip install covsirphy

In [None]:
# Developers (Note: this notebook is in example directory)
import os
os.chdir("../")

In [None]:
from pprint import pprint

In [None]:
import covsirphy as cs
cs.__version__

### Dataset preparation
Download the datasets to "input" directory and load them.

If "input" directory has the datasets, `DataLoader` instance will load the local files. If the datasets were updated in remote servers, `DataLoader` will update the local files automatically and download the datasets to "input" directory and load them.

In [None]:
# Create DataLoader instance
data_loader = cs.DataLoader("input")

In [None]:
# (Main) The number of cases (JHU style)
jhu_data = data_loader.jhu()
# (Main) Population in each country
population_data = data_loader.population()
# (Main) Government Response Tracker (OxCGRT)
oxcgrt_data = data_loader.oxcgrt()
# Linelist of case reports
linelist = data_loader.linelist()
# The number of tests
pcr_data = data_loader.pcr()
# The number of vaccinations
vaccine_data = data_loader.vaccine()
# Population pyramid
pyramid_data = data_loader.pyramid()
# Japan-specific dataset
japan_data = data_loader.japan()

### The number of cases (JHU style)
The main dataset is that of the number of cases and was saved as `jhu_data`, an instance of `JHUData` class. This includes "Confirmed", "Infected", "Recovered" and "Fatal". "Infected" was calculated as "Confirmed - Recovered - Fatal".

In [None]:
type(jhu_data)

The dataset will be retrieved from [COVID-19 Data Hub](https://covid19datahub.io/) and [Data folder of CovsirPhy project](https://github.com/lisphilar/covid19-sir/tree/master/data). Description of these projects will be shown as follows.

In [None]:
# Description/citation
print(jhu_data.citation)

In [None]:
# Detailed citation list of COVID-19 Data Hub
# print(data_loader.covid19dh_citation)

In [None]:
# Raw data
jhu_data.raw.tail()

In [None]:
# Cleaned data
jhu_data.cleaned().tail()

In [None]:
jhu_data.cleaned().info()

Total number of cases in all countries with `JHUData.total()` method.

In [None]:
# Calculate total values
total_df = jhu_data.total()
total_df.tail()

In [None]:
# Plot the total values
cs.line_plot(total_df[["Infected", "Fatal", "Recovered"]], "Total number of cases over time")

In [None]:
# Statistics of rate values in all countries
total_df.loc[:, total_df.columns.str.contains("per")].describe().T

We can create a subset for a country using `JHUData.subset()` method.

In [None]:
# Subset for a country
df, _ = jhu_data.records("Japan")
df.tail()
# We can use ISO3 code etc.
# df, _ = jhu_data.records("JPN")
# df.tail()

Province ("prefecture" for Japan) name can be specified.

In [None]:
df, _ = jhu_data.records("Japan", province="Tokyo")
df.tail()

In [None]:
# Countries we can select
pprint(jhu_data.countries(), compact=True)

`JHUData.records()` automatically complement the records, if necesssary and `auto_complement=True` (default). Each country can have either none or one or multiple complements, depending on the records and their preprocessing analysis.

We can show the specific kind of complements that were applied to the records of each country with `JHUData.show_complement()` method. The possible kinds of complement for each country are the following:  

1. "Monotonic_confirmed/fatal/recovered" (monotonic increasing complement)  
Force the variable show monotonic increasing.

2. "Full_recovered" (full complement of recovered data)  
Estimate the number of recovered cases using the value of estimated average recovery period.

3. "Partial_recovered" (partial complement of recovered data)  
When recovered values are not updated for some days, extrapolate the values.

In [None]:
# Show the details of complement for all countries
jhu_data.show_complement().tail()

In [None]:
# For selected country
# jhu_data.show_complement(country="Japan")
# For selected province
# jhu_data.show_complement(country="Japan", province="Tokyo")
# For selected countries
# jhu_data.show_complement(country=["Greece", "Japan"])

Note for recovery period:  
With the global cases records, we estimate the average recovery period using `JHUData.calculate_recovery_period()`.  

What we currently do is to calculate the difference between confirmed cases and fatal cases and try to match it to some recovered cases value in the future. We apply this method for every country that has valid recovery data and average the partial recovery periods in order to obtain a single (average) recovery period. During the calculations, we ignore time intervals that lead to very short (<7 days) or very long (>90 days) partial recovery periods, if these exist with high frequency (>50%) in the records. We have to assume temporarily invariable compartments for this analysis to extract an approximation of the average recovery period.

Alternatively, we had tried to use linelist data to get precise value of recovery period (average of recovery date minus confirmation date for cases), but the number of records was too small.

In [None]:
recovery_period = jhu_data.calculate_recovery_period()
print(f"Average recovery period: {recovery_period} [days]")

### Linelist of case reports
The number of cases is important, but linelist of case reports will helpful to understand the situation deeply. Linelist data was saved as `linelist`, an instance of `LinelistData` class. This dataset is from [Open COVID-19 Data Working Group](https://github.com/beoutbreakprepared/nCoV2019).

In [None]:
type(linelist)

In [None]:
# Citation
print(linelist.citation)

In [None]:
# Raw dataset
linelist.raw.tail()

In [None]:
# Cleaned dataset
linelist.cleaned().tail()

In [None]:
# Subset for specified area
linelist.subset("Japan", province="Tokyo").tail()

In [None]:
# Subset for outcome ("Recovered" or "Fatal")
linelist.closed(outcome="Recovered").tail()

As the median value of the period from confirmation to recovery, we can calculate recovery period.

In [None]:
# Recovery period (integer) [days]
linelist.recovery_period()

### Population in each country
Population values are necessary to calculate the number of susceptible people. Susceptible is a variable of SIR-derived models. This dataset was saved as `population_data`, an instance of `PopulationData` class.

In [None]:
type(population_data)

In [None]:
# Description/citation
print(population_data.citation)

In [None]:
# Raw data (the same as jhu_data)
# population_data.raw.tail()

In [None]:
# Cleaned data
population_data.cleaned().tail()

We will get the population values with `PopulationData.value()`.

In [None]:
# In a country
population_data.value("Japan", province=None)
# In a country with ISO3 code
# population_data.value("JPN", province=None)
# In a province (prefecture)
# population_data.value("Japan", province="Tokyo")

We can update the population values.

In [None]:
# Before
population_before = population_data.value("Japan", province="Tokyo")
print(f"Before: {population_before}")
# Register population value of Tokyo in Japan
# https://www.metro.tokyo.lg.jp/tosei/hodohappyo/press/2020/06/11/07.html
population_data.update(14_002_973, "Japan", province="Tokyo")
population_after = population_data.value("Japan", province="Tokyo")
print(f" After: {population_after}")

### Government Response Tracker (OxCGRT)
Government responses are tracked with [Oxford Covid-19 Government Response Tracker (OxCGRT)](https://github.com/OxCGRT/covid-policy-tracker). Because government responses and activities of persons change the parameter values of SIR-derived models, this dataset is significant when we try to forcast the number of cases.  
With `DataLoader` class, the dataset was retrieved via [COVID-19 Data Hub](https://covid19datahub.io/) and saved as `oxcgrt_data`, an instance of `OxCGRTData` class.

In [None]:
type(oxcgrt_data)

In [None]:
# Description/citation
print(oxcgrt_data.citation)

In [None]:
# Raw data (the same as jhu_data)
# oxcgrt_data.raw.tail()

In [None]:
# Cleaned data
oxcgrt_data.cleaned().tail()

In [None]:
# Subset for a country
oxcgrt_data.subset("Japan").tail()
# We can use ISO3 codes
# oxcgrt_data.subset("JPN").tail()

### The number of tests
The number of tests is also key information to understand the situation.
This dataset was saved as `pcr_data`, an instance of `PCRData` class.

In [None]:
type(pcr_data)

In [None]:
# Description/citation
print(pcr_data.citation)

In [None]:
# Raw data (the same as jhu_data)
# pcr_data.raw.tail()

In [None]:
# Cleaned data
pcr_data.cleaned().tail()

In [None]:
# Subset for a country
pcr_data.subset("Japan").tail()
# We can use ISO3 codes
# pcr_data.subset("JPN").tail()

Under the assumption that all tests were PCR test, we can calculate the positive rate of PCR tests as "the number of confirmed cases per the number of tests".

In [None]:
# Positive rate in Japan
_ = pcr_data.positive_rate("Japan")

### The number of vaccinations
The number of vaccinations is a key factor to end the outbreak as soon as possible. This dataset was saved as `vaccine_data`, an instance of `VaccineData` class.

In [None]:
# The number of vaccinations
type(vaccine_data)

In [None]:
# Description/citation
print(vaccine_data.citation)

In [None]:
# Raw data
# vaccine_data.raw.tail()

In [None]:
# Cleaned data
vaccine_data.cleaned().tail()

In [None]:
# Registered countries
vaccine_data.countries()

In [None]:
# Subset for a country
vaccine_data.subset("United Kingdom").tail()
# We can use ISO3 codes
# pcr_data.subset("GBR").tail()

### Population pyramid
With population pyramid, we can divide the population to sub-groups. This will be useful when we analyse the meaning of parameters. For example, how many days go out is different between the sub-groups.
This dataset was saved as `pyramid_data`, an instance of `PopulationPyramidData` class.

In [None]:
# Population pyramid
type(pyramid_data)

In [None]:
# Description/citation
print(pyramid_data.citation)

In [None]:
# Subset will retrieved from the server when set
pyramid_data.subset("Japan").tail()

### Japan-specific dataset
This includes the number of confirmed/infected/fatal/recovered/tests/moderate/severe cases at country/prefecture level and metadata of each prefecture.
This dataset was saved as `japan_data`, an instance of `JapanData` class.

In [None]:
# Japan-specific dataset
type(japan_data)

In [None]:
# Description/citation
print(japan_data.citation)

In [None]:
# Cleaned dataset
japan_data.cleaned().tail()

In [None]:
# Metadata
japan_data.meta().tail()