# Predictor submission


The COVID-19 crisis is proving to be one of the world’s most critical challenges — a challenge bigger than any one government or organization can tackle on its own. Right now, countries around the world are not equipped to implement health and safety interventions and policies that effectively protect both their citizens and economies.
 
In order to fight this pandemic, we need access to localized, data-driven planning systems and the latest in artificial intelligence (AI) to help decision-makers develop and implement robust Intervention Plans (IPs) that successfully reduce infection cases and minimize economic impact.

**Intervention Plan (IP)**: A plan of action or schedule for setting and resetting various intervention policies at various strengths or stringency.

**Predictor Model**: Given a time sequence of IPs in effect, and other data like a time sequence of number of cases, a predictor model will estimate the number of cases in the future.

## Intervention Plan

An intervention plan consists of a set of [containment and closure policies](https://github.com/OxCGRT/covid-policy-tracker/blob/master/documentation/codebook.md#containment-and-closure-policies), as well as [health system policies](https://github.com/OxCGRT/covid-policy-tracker/blob/master/documentation/codebook.md#health-system-policies). Checkout the links to understand what these policies correspond to and how they are coded.

For instance the **C1_School closing** policy, which records closings of schools and universities, is coded like that:

| Code      | Meaning     |
| :-------- | :---------- |
|  0        | no measures |
|  1        | recommend closing|
|  2        | require closing (only some levels or categories, eg just high school, or just public schools) |
|  3        | require closing all levels |
| Blank     | no data |

Interventions plans are recorded daily for each countries and sometimes for regions. For this competition, the following policies are considered:

In [1]:
IP_COLUMNS = ['C1_School closing',
              'C2_Workplace closing',
              'C3_Cancel public events',
              'C4_Restrictions on gatherings',
              'C5_Close public transport',
              'C6_Stay at home requirements',
              'C7_Restrictions on internal movement',
              'C8_International travel controls',
              'H1_Public information campaigns',
              'H2_Testing policy',
              'H3_Contact tracing']

## Data
The university of Oxford Blavatnik School of Government is [tracking coronavirus government responses](https://www.bsg.ox.ac.uk/research/research-projects/coronavirus-government-response-tracker). They have assembled a [data set](https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv) containing historical data since January 1st, 2020 for the number of cases and IPs for most countries in the world.

In [9]:
import pandas as pd

In [10]:
DATA_URL = 'https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv'
df = pd.read_csv(DATA_URL,
                 parse_dates=['Date'],
                 encoding="ISO-8859-1",
                 error_bad_lines=False)

In [11]:
df.sample(3)

Unnamed: 0,CountryName,CountryCode,RegionName,RegionCode,Date,C1_School closing,C1_Flag,C2_Workplace closing,C2_Flag,C3_Cancel public events,...,StringencyIndex,StringencyIndexForDisplay,StringencyLegacyIndex,StringencyLegacyIndexForDisplay,GovernmentResponseIndex,GovernmentResponseIndexForDisplay,ContainmentHealthIndex,ContainmentHealthIndexForDisplay,EconomicSupportIndex,EconomicSupportIndexForDisplay
57477,United States,USA,Vermont,US_VT,2020-02-16,0.0,,0.0,,0.0,...,11.11,11.11,16.67,16.67,14.1,14.1,16.67,16.67,0.0,0.0
36932,Rwanda,RWA,,,2020-09-04,,,,,,...,,70.37,,71.43,,71.79,,75.76,,50.0
53640,United States,USA,New Mexico,US_NM,2020-01-05,0.0,,0.0,,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Listing the number of cases and IPs

In [22]:
CASES_COLUMNS = ["CountryName", "RegionName", "Date", "ConfirmedCases"]

In [23]:
df[CASES_COLUMNS + IP_COLUMNS].sample(3)

Unnamed: 0,CountryName,RegionName,Date,ConfirmedCases,C1_School closing,C2_Workplace closing,C3_Cancel public events,C4_Restrictions on gatherings,C5_Close public transport,C6_Stay at home requirements,C7_Restrictions on internal movement,C8_International travel controls,H1_Public information campaigns,H2_Testing policy,H3_Contact tracing
38617,El Salvador,,2020-06-10,3191.0,3.0,2.0,2.0,4.0,2.0,3.0,2.0,4.0,2.0,3.0,1.0
43213,Timor-Leste,,2020-07-22,24.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0
55800,United States,South Carolina,2020-05-20,9175.0,3.0,2.0,2.0,4.0,1.0,1.0,1.0,3.0,2.0,2.0,2.0


### Computing the daily change in cases
The daily change in confirmed cases can be computed like this:

In [24]:
df["DailyChangeConfirmedCases"] = df.groupby(["CountryName", "RegionName"]).ConfirmedCases.diff().fillna(0)

### Listing the latest historical daily new cases for a given country and region
For instance, for country **United States**, region **California**, the latest available changes in confirmed cases are:

In [26]:
country = "United States"
region = "California"
country_region_df = df[(df.CountryName == country) & (df.RegionName == region)]
country_region_df[["CountryName", "RegionName", "Date", "ConfirmedCases", "DailyChangeConfirmedCases"]].tail(7)

Unnamed: 0,CountryName,RegionName,Date,ConfirmedCases,DailyChangeConfirmedCases
46798,United States,California,2020-09-03,726018.0,4737.0
46799,United States,California,2020-09-04,730662.0,4644.0
46800,United States,California,2020-09-05,735314.0,4652.0
46801,United States,California,2020-09-06,738856.0,3542.0
46802,United States,California,2020-09-07,740965.0,2109.0
46803,United States,California,2020-09-08,744344.0,3379.0
46804,United States,California,2020-09-09,,0.0


## Predictor input
The goal of a predictor is to predict the expected number of daily cases for countries and regions for a list of days, assumging the given daily IPs are in place:

In [28]:
EXAMPLE_INPUT_FILE = "data/input/20200801_20200804_npis.csv"
prediction_input_df = pd.read_csv(EXAMPLE_INPUT_FILE,
                                  parse_dates=['Date'],
                                  encoding="ISO-8859-1")
prediction_input_df.head()

Unnamed: 0,CountryName,RegionName,Date,C1_School closing,C2_Workplace closing,C3_Cancel public events,C4_Restrictions on gatherings,C5_Close public transport,C6_Stay at home requirements,C7_Restrictions on internal movement,C8_International travel controls,H1_Public information campaigns,H2_Testing policy,H3_Contact tracing
0,Aruba,,2020-08-01,0.0,1.0,0.0,0.0,0.0,1.0,1.0,3.0,2.0,2.0,1.0
1,Aruba,,2020-08-02,0.0,1.0,0.0,0.0,0.0,1.0,1.0,3.0,2.0,2.0,1.0
2,Aruba,,2020-08-03,0.0,1.0,0.0,0.0,0.0,1.0,1.0,3.0,2.0,2.0,1.0
3,Aruba,,2020-08-04,0.0,1.0,0.0,4.0,0.0,1.0,1.0,3.0,2.0,2.0,1.0
4,Afghanistan,,2020-08-01,3.0,3.0,2.0,4.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0


## Predictor expected output
The output produced by the predictor should look like that:

In [29]:
EXAMPLE_OUTPUT_FILE = "data/output/20200801_20200804_predictions.csv"
prediction_output_df = pd.read_csv(EXAMPLE_OUTPUT_FILE,
                                   parse_dates=['Date'],
                                   encoding="ISO-8859-1")
prediction_output_df.head()

Unnamed: 0,CountryName,RegionName,Date,PredictedDailyNewCases
0,Aruba,,2020-08-01,0.647746
1,Aruba,,2020-08-02,0.792283
2,Aruba,,2020-08-03,0.0
3,Aruba,,2020-08-04,0.0
4,Afghanistan,,2020-08-01,72.84468


## Train a model
Train a predictor model that can produce the output file given the input file.

Use additional data source.

Predictors do not have to predict for all regions. They can ignore them and return a row in the csv file only for regions for which they want to make a prediction. Note that a predictor submission can consist of multiple models, for example those specializing in different regions, that are accessed through the same call. A predictor must return a prediction in less than 30 seconds per region.


In [6]:
# WRITE YOUR CODE HERE

In [33]:
def predict(start_date: str, end_date: str, path_to_input_file: str):
    # WRITE YOUR CODE HERE
    # Save output to path_to_output_file
    # TODO explain convention for path_to_output_file
    pass

## Make predictions

In [32]:
start_date = "2020-08-01"
end_date = "2020-08-04"
predict(start_date, end_date, EXAMPLE_INPUT_FILE)

## Display predictions

In [35]:
# convention for path_to_output_file
EXAMPLE_OUTPUT_FILE = "data/output/20200801_20200804_predictions.csv"
prediction_output_df = pd.read_csv(EXAMPLE_OUTPUT_FILE,
                                   parse_dates=['Date'],
                                   encoding="ISO-8859-1")
prediction_output_df.head()

Unnamed: 0,CountryName,RegionName,Date,PredictedDailyNewCases
0,Aruba,,2020-08-01,0.647746
1,Aruba,,2020-08-02,0.792283
2,Aruba,,2020-08-03,0.0
3,Aruba,,2020-08-04,0.0
4,Afghanistan,,2020-08-01,72.84468


## Quantitative evaluation