# Example Predictor: Linear Rollout Predictor

This example contains basic functionality for training and evaluating a linear predictor that rolls out predictions day-by-day.

First, a training data set is created from historical case and npi data.

Second, a linear model is trained to predict future cases from prior case data along with prior and future npi data.
The model is an off-the-shelf sklearn Lasso model, that uses a positive weight constraint to enforce the assumption that increased npis has a negative correlation with future cases.

Third, a sample evaluation set is created, and the predictor is applied to this evaluation set to produce prediction results in the correct format.

## Training

In [1]:
import pickle
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split

### Copy the data locally

In [2]:
from AlphanumericsTeam.data.util import get_aug_oxford_df, filter_df_regions

# Has 6 additional columns 
# 'New Cases' 
# 'GeoID' 
# 'Holidays' 
# 'pop_2020' 
# 'area_km2' 
# 'density_perkm2'
df = get_aug_oxford_df() 
df = filter_df_regions(df)
assert df.CountryName.unique().size == 180
assert df.RegionName.unique().size == 56 + 1 


In [3]:
# For testing, restrict training data to that before a hypothetical predictor submission date
HYPOTHETICAL_SUBMISSION_DATE = np.datetime64("today")
TRAINING_START_DATE = np.datetime64("2020-04-01" )
df = df[(TRAINING_START_DATE<=df.Date) & (df.Date<= HYPOTHETICAL_SUBMISSION_DATE)]
df

Unnamed: 0,CountryName,CountryCode,RegionName,RegionCode,Jurisdiction,Date,C1_School closing,C1_Flag,C2_Workplace closing,C2_Flag,...,ContainmentHealthIndex,ContainmentHealthIndexForDisplay,EconomicSupportIndex,EconomicSupportIndexForDisplay,NewCases,GeoID,Holidays,pop_2020,area_km2,density_perkm2
92,Aruba,ABW,,,NAT_TOTAL,2020-04-01,3.0,1.0,3.0,1.0,...,65.38,65.38,87.5,87.5,0.0,Aruba__,0.0,106766.0,180.0,593.14
93,Aruba,ABW,,,NAT_TOTAL,2020-04-02,3.0,1.0,3.0,1.0,...,65.38,65.38,87.5,87.5,5.0,Aruba__,0.0,106766.0,180.0,593.14
94,Aruba,ABW,,,NAT_TOTAL,2020-04-03,3.0,1.0,3.0,1.0,...,65.38,65.38,87.5,87.5,2.0,Aruba__,0.0,106766.0,180.0,593.14
95,Aruba,ABW,,,NAT_TOTAL,2020-04-04,3.0,1.0,3.0,1.0,...,65.38,65.38,87.5,87.5,2.0,Aruba__,1.0,106766.0,180.0,593.14
96,Aruba,ABW,,,NAT_TOTAL,2020-04-05,3.0,1.0,3.0,1.0,...,65.38,65.38,87.5,87.5,0.0,Aruba__,1.0,106766.0,180.0,593.14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
183263,Zimbabwe,ZWE,,,NAT_TOTAL,2020-12-14,,,,,...,,60.26,,25.0,112.0,Zimbabwe__,0.0,14862924.0,390757.0,38
183264,Zimbabwe,ZWE,,,NAT_TOTAL,2020-12-15,,,,,...,,60.26,,25.0,164.0,Zimbabwe__,0.0,14862924.0,390757.0,38
183265,Zimbabwe,ZWE,,,NAT_TOTAL,2020-12-16,,,,,...,,60.26,,25.0,227.0,Zimbabwe__,0.0,14862924.0,390757.0,38
183266,Zimbabwe,ZWE,,,NAT_TOTAL,2020-12-17,,,,,...,,60.26,,25.0,0.0,Zimbabwe__,0.0,14862924.0,390757.0,38


In [4]:
# Keep only columns of interest
id_cols = ['CountryName',
           'RegionName',
           'GeoID',
           'Date']
cases_col = ['NewCases']
npi_cols = ['C1_School closing',
            'C2_Workplace closing',
            'C3_Cancel public events',
            'C4_Restrictions on gatherings',
            'C5_Close public transport',
            'C6_Stay at home requirements',
            'C7_Restrictions on internal movement',
            'C8_International travel controls',
            'H1_Public information campaigns',
            'H2_Testing policy',
            'H3_Contact tracing',
            'H6_Facial Coverings']
new_features = ['Holidays'
                ]
df = df[id_cols + cases_col + npi_cols + new_features] 

In [5]:
df[df["NewCases"]<=0] = 1
df["LogNewCases"] = np.log(df["NewCases"])
df["LogNewCases"]

92        0.000000
93        1.609438
94        0.693147
95        0.693147
96        0.000000
            ...   
183263    4.718499
183264    5.099866
183265    5.424950
183266    0.000000
183267    0.000000
Name: LogNewCases, Length: 124372, dtype: float64

In [6]:
# Fill any missing case values by interpolation and setting NaNs to 0
df.update(df.groupby('GeoID').NewCases.apply(
    lambda group: group.interpolate()).fillna(0))

# Fill any missing NPIs by assuming they are the same as previous day
for npi_col in npi_cols:
    df.update(df.groupby('GeoID')[npi_col].ffill().fillna(0))

In [7]:
df.columns

Index(['CountryName', 'RegionName', 'GeoID', 'Date', 'NewCases',
       'C1_School closing', 'C2_Workplace closing', 'C3_Cancel public events',
       'C4_Restrictions on gatherings', 'C5_Close public transport',
       'C6_Stay at home requirements', 'C7_Restrictions on internal movement',
       'C8_International travel controls', 'H1_Public information campaigns',
       'H2_Testing policy', 'H3_Contact tracing', 'H6_Facial Coverings',
       'Holidays', 'LogNewCases'],
      dtype='object')

In [8]:
from AlphanumericsTeam.predictors.tools.exp_fit import get_exp_fit
nb_lookback_days = 3
gdf = df[df.GeoID == "United States__Alabama"] 
all_case_data = np.squeeze(np.array(gdf[cases_col]) )
print( all_case_data.shape, np.min(all_case_data), np.max(all_case_data))

Rt, A, Lambda ,ExpFit  = get_exp_fit(all_case_data, nb_lookback_days, 1)
#print(gdf)
print(Rt[10:20], np.min(Rt), np.max(Rt))
print(A[10:20], np.min(A), np.max(A))
print(Lambda[10:20], np.min(Lambda), np.max(Lambda))
print(ExpFit[10:20], np.min(ExpFit), np.max(ExpFit))

(524,) 39.0 5348.0
[0.71856793 0.82959948 1.42535805 0.93140827 0.72987004 1.14470294
 0.96123702 0.58810259 1.20914239 1.43708642] 0.16289259247507343 7.161148740394319
[131.28692325 160.02489866 463.41910882 215.12249345 118.98443104
 274.90134225 178.30872225  54.28523318 218.93096483 328.86443528] 1.0 27947.0704608495
[-0.33049503 -0.18681225  0.35442305 -0.07105757 -0.31488878  0.13514516
 -0.03953426 -0.53085387  0.18991134  0.36261775] -1.8146642372437034 1.9686704067062166
[ 94.33857261 132.75657242 660.53815711 200.3668697   86.84317185
 314.68037549 171.39694478  31.92528648 264.7187098  472.60661542] 1.0 200133.1284284237


In [9]:

# Set number of past days to use to make predictions
nb_lookback_days = 7

# Create training data across all countries for predicting one day ahead
X_cols = cases_col + npi_cols
y_col = cases_col
X_samples = []
y_samples = []
geo_ids = df.GeoID.unique()
for g in geo_ids:
    gdf = df[df.GeoID == g]
    all_case_data = np.array(gdf[cases_col])
    all_npi_data = np.array(gdf[npi_cols])
    all_feat_data = np.array(gdf[new_features])

    # Create one sample for each day where we have enough data
    # Each sample consists of cases and npis for previous nb_lookback_days
    nb_total_days = len(gdf)
    for d in range(nb_lookback_days, nb_total_days - 1):
        X_cases = all_case_data[d-nb_lookback_days:d]

        # Take negative of npis to support positive
        # weight constraint in Lasso.
        X_npis = -all_npi_data[d - nb_lookback_days:d]

        X_feats = all_feat_data[d - nb_lookback_days:d]

        # Flatten all input data so it fits Lasso input format.
        X_sample = np.concatenate([X_cases.flatten(),
                                   X_npis.flatten(),
                                   X_feats.flatten()])
                                   
        y_sample = all_case_data[d]
        X_samples.append(X_sample)
        y_samples.append(y_sample)

X_samples = np.array(X_samples)
y_samples = np.array(y_samples).flatten()

In [10]:
print(X_samples.shape)
print(y_samples.shape)

(122482, 98)
(122482,)


In [11]:
# Helpful function to compute mae
def mae(pred, true):
    return np.mean(np.abs(pred - true))

In [12]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_samples,
                                                    y_samples,
                                                    test_size=0.2,
                                                    random_state=301)

In [13]:
# Create and train Lasso model.
# Set positive=True to enforce assumption that cases are positively correlated
# with future cases and npis are negatively correlated.
model = Lasso(alpha=0.1,
              precompute=True,
              max_iter=10000,
              positive=True,
              selection='random')
# Fit model
model.fit(X_train, y_train)

Lasso(alpha=0.1, max_iter=10000, positive=True, precompute=True,
      selection='random')

In [14]:
# Evaluate model
train_preds = model.predict(X_train)
train_preds = np.maximum(train_preds, 0) # Don't predict negative cases
print('Train MAE:', mae(train_preds, y_train))

test_preds = model.predict(X_test)
test_preds = np.maximum(test_preds, 0) # Don't predict negative cases
print('Test MAE:', mae(test_preds, y_test))

Train MAE: 377.3451003985817
Test MAE: 384.88543678674023


In [23]:
# Inspect the learned feature coefficients for the model
# to see what features it's paying attention to.

# Give names to the features
x_col_names = []
for d in range(-nb_lookback_days, 0):
    x_col_names.append('Day ' + str(d) + ' ' + cases_col[0])
for d in range(-nb_lookback_days, 1):
    for col_name in npi_cols:
        x_col_names.append('Day ' + str(d) + ' ' + col_name)
for d in range(-nb_lookback_days, 1):
    for col_name in new_features:
        x_col_names.append('Day ' + str(d) + ' ' + col_name)

# View non-zero coefficients
for (col, coeff) in zip(x_col_names, list(model.coef_)):
    if coeff != 0.:
        print(col, coeff)
print('Intercept', model.intercept_)

Day -7 NewCases 0.07639677043473975
Day -6 NewCases 0.06038552960529926
Day -5 NewCases 0.2803740504202994
Day -4 NewCases 0.05254730085470343
Day -3 NewCases 0.04797148303872078
Day -2 NewCases 0.4135238168103157
Day -1 NewCases 0.06505683335297956
Day -7 C6_Stay at home requirements 2.887834461571283
Day -7 C8_International travel controls 17.16082414522693
Day -6 C6_Stay at home requirements 9.51519817202433
Day 0 C1_School closing 68.2749113413561
Day 0 C3_Cancel public events 8.606317269862253e-05
Day 0 C6_Stay at home requirements 0.0003024128042793258
Day 0 H1_Public information campaigns 0.0021905117228077473
Day 0 H6_Facial Coverings 0.0002800765552415766
Day -7 Holidays 24.377009551176883
Day -7 density_perkm2 0.0002007922417092906
Day -6 Holidays 9.850256928272572
Day -6 density_perkm2 0.0008647302861244309
Day -5 Holidays 230.51354384817876
Day -5 density_perkm2 0.00015959388981122872
Intercept 217.43187563476658


In [24]:
# Save model to file
if not os.path.exists('models'):
    os.mkdir('models')
with open('models/model.pkl', 'wb') as model_file:
    pickle.dump(model, model_file)

## Evaluation

Now that the predictor has been trained and saved, this section contains the functionality for evaluating it on sample evaluation data.

In [None]:
# Reload the module to get the latest changes
import predict
from importlib import reload
reload(predict)
from predict import predict_df

In [None]:
%%time
preds_df = predict_df("2020-08-01", "2020-08-31", path_to_ips_file="../../validation/data/2020-09-30_historical_ip.csv", verbose=True)

In [None]:
# Check the predictions
preds_df.head()

# Validation
This is how the predictor is going to be called during the competition.  
!!! PLEASE DO NOT CHANGE THE API !!!

In [None]:
!python predict.py -s 2020-08-01 -e 2020-08-04 -ip ../../validation/data/2020-09-30_historical_ip.csv -o predictions/2020-08-01_2020-08-04.csv

In [None]:
!head predictions/2020-08-01_2020-08-04.csv

# Test cases
We can generate a prediction file. Let's validate a few cases...

In [None]:
import os
from covid_xprize.validation.predictor_validation import validate_submission

def validate(start_date, end_date, ip_file, output_file):
    # First, delete any potential old file
    try:
        os.remove(output_file)
    except OSError:
        pass
    
    # Then generate the prediction, calling the official API
    !python predict.py -s {start_date} -e {end_date} -ip {ip_file} -o {output_file}
    
    # And validate it
    errors = validate_submission(start_date, end_date, ip_file, output_file)
    if errors:
        for error in errors:
            print(error)
    else:
        print("All good!")

## 4 days, no gap
- All countries and regions
- Official number of cases is known up to start_date
- Intervention Plans are the official ones

In [None]:
validate(start_date="2020-08-01",
         end_date="2020-08-04",
         ip_file="../../validation/data/2020-09-30_historical_ip.csv",
         output_file="predictions/val_4_days.csv")

## 1 month in the future
- 2 countries only
- there's a gap between date of last known number of cases and start_date
- For future dates, Intervention Plans contains scenarios for which predictions are requested to answer the question: what will happen if we apply these plans?

In [None]:
%%time
validate(start_date="2021-01-01",
         end_date="2021-01-31",
         ip_file="../../validation/data/future_ip.csv",
         output_file="predictions/val_1_month_future.csv")

## 180 days, from a future date, all countries and regions
- Prediction start date is 1 week from now. (i.e. assuming submission date is 1 week from now)  
- Prediction end date is 6 months after start date.  
- Prediction is requested for all available countries and regions.  
- Intervention plan scenario: freeze last known intervention plans for each country and region.  

As the number of cases is not known yet between today and start date, but the model relies on them, the model has to predict them in order to use them.  
This test is the most demanding test. It should take less than 1 hour to generate the prediction file.

### Generate the scenario

In [None]:
from datetime import datetime, timedelta

start_date = datetime.now() + timedelta(days=7)
start_date_str = start_date.strftime('%Y-%m-%d')
end_date = start_date + timedelta(days=180)
end_date_str = end_date.strftime('%Y-%m-%d')
print(f"Start date: {start_date_str}")
print(f"End date: {end_date_str}")

In [None]:
from covid_xprize.validation.scenario_generator import get_raw_data, generate_scenario, NPI_COLUMNS
DATA_FILE = 'data/OxCGRT_latest.csv'
latest_df = get_raw_data(DATA_FILE, latest=True)
scenario_df = generate_scenario(start_date_str, end_date_str, latest_df, countries=None, scenario="Freeze")
scenario_file = "predictions/180_days_future_scenario.csv"
scenario_df.to_csv(scenario_file, index=False)
print(f"Saved scenario to {scenario_file}")

### Check it

In [None]:
%%time
validate(start_date=start_date_str,
         end_date=end_date_str,
         ip_file=scenario_file,
         output_file="predictions/val_6_month_future.csv")