# Scenario 1: Calibrating models and estimating causal effects

In this scenario, we are interested in determining the effects of masking and social distancing on Covid-19 infections using simulated data for four different countries: Afghanistan, Colombia, France, and the United Kingdom. The simulations use contact matrices and populations subdivided into age groups for each of the countries. In these simulations, each country implements interventions at the same time, and we are interested in understanding how the effects of the interventions differ between countries. The data are generated from an SEIRD model. 

In these questions, we provide the contact matrices and population data for each country, as well as the outputs of the simulated SEIRD model. We ask you to calibrate a model to estimate R0, and to estimate the causal effects of interventions.

In the data provided, ‘UK_population.csv’ and ‘UK_contact_matrix.csv’ contain data on the population counts and the contact matrix for each age group in the UK. ‘UK_compartments.csv’ has compartments S, E, ICase, IMild, R, and D. All patients in IMild transition to R, whereas all patients in ICase transition to the hospital (IHospital, not currently included in this output). From there, some patients recover and others transition to D. ‘UK_infections.csv’ has total infections for each day. There are analogous files for the other countries.

In each country:
1. From t = 0 to t = 40 days, there are no interventions in place, and Covid spreads unabated.
2. At t = 40 days, people begin wearing masks, decreasing the spread of covid.
3. At t = 100 days, mask fatigue sets in. People start to mask less, but more than they did from t = 0 to t = 40. Cases increase again.

During this same time frame, each country begins implementing a social distancing policy at t = 60 days. Due to the policy, contact decreases to 30% of its original levels for the interactions of people over the age of 65 with others over the age of 65 (i.e., the 3x3 matrix on the bottom right of the contact matrix) and it decreases to 60% of its original levels otherwise.

Because this model is stochastic, (in particular, the number of people transitioning between compartments is randomly drawn from a binomial distribution), we run each simulation ten times for each country. The dt parameter is set to 0.2 so that there are 5 time steps per day and 1000 time steps in each 200-day simulation.

Using the data provided, which includes contact matrices and population counts for all four countries, along with the output from running 10 simulations for each country:

### (1) Calibrate models to estimate R0 for each country in each of the four time intervals: [0,40), [40,60), [60,100), [100, 200].

### (2) Estimate the causal effects of masking alone, the combined effect of masking and social distancing, and the effect of social distancing alone on infections. Include uncertainty in the estimated effects.

### (3) (Optional) In each country, what is the maximum value R0  can be over the last time interval ([100, 200]), to ensure that there are no more than 50 infections at t = 200 days with 90% confidence? Are these different across countries? 

### (4) (Optional) Can R0 be changed in one of the three preceding intervals ([0,40), [40,60), [60,100)), without changing it in the fourth, to ensure that infections at t = 200 days stay below ¾ of their simulated value at t = 200 with 95% confidence?"

### Load dependencies

In [1]:
import os
import pyciemss
import torch
import pandas as pd
import numpy as np
import requests
from io import StringIO
from typing import Dict, List

import pyciemss.visuals.plots as plots
import pyciemss.visuals.vega as vega
import pyciemss.visuals.trajectories as trajectories

# Country 1: Let's start with France

### Collect data

In [2]:
DATA_PATH = "https://raw.githubusercontent.com/ciemss/program-milestones/epi-scenario-1/18-month-milestone/hackathon/epi/Scenario%201%20Supplemental/"

COUNTRY = "France" # "UK", "Afghanistan", "Colombia"
infection_data = os.path.join(DATA_PATH, COUNTRY + "_infections.csv")
death_data = os.path.join(DATA_PATH, COUNTRY + "_compartments.csv")
population_data = os.path.join(DATA_PATH, COUNTRY + "_population.csv")
contact_matrix = os.path.join(DATA_PATH, COUNTRY + "_contact_matrix.csv")

### Define functions to read and manipulate data

In [5]:
def read_data(data_location: str):
    
    # Fetch the contents of the CSV file from the URL
    response = requests.get(data_location)
    csv_content = StringIO(response.text)
    df = pd.read_csv(csv_content)
    # print(df.head())
    
    return df

def add_over_age_groups(df: pd.DataFrame, input_comp: str, output_comp: str):

    # Aggregate data by age
    df = df.drop(columns=['Unnamed: 0'])
    replicate_1_df = df[df['replicate'] == 1] # only using first run
    compartment_df = replicate_1_df[replicate_1_df["compartment"] == input_comp] # only use chosen compartment
    result = compartment_df.groupby('t')['y'].sum().reset_index()
    result.rename(columns={'t': 'Timestamp', 'y': output_comp}, inplace=True)
    result.set_index('Timestamp', inplace=True)
    result.index.name = 'Timestamp'
    # print(result.head())

    return result

### Produce dataset for calibration, calculate total population

In [6]:
# Aggregate case data by age
case_by_age_df = read_data(infection_data)
case_df = add_over_age_groups(case_by_age_df, "infections", "cases")

# Aggregate death data by age
death_by_age_df = read_data(death_data)
death_df = add_over_age_groups(death_by_age_df, "D", "deaths")

# Merge data into one DataFrame
death_column = death_df["deaths"]
data_df = pd.concat([case_df, death_column], axis=1)

data_df.to_csv(COUNTRY + "_case_death_data.csv", index=True)

# Calculate the total population
pop_data_by_age = read_data(population_data)
total_population = pop_data_by_age["n"].sum()
print(COUNTRY + " total population:", total_population)

France total population: 65273512
