## Covid19 Projection
The purpose of this notebook is to run Covid-19 case projection at Indian State level or at Country level. The outcome is the projection of the total confirmed cases for the target State or Country. 

This solution first tries to understand the approximate time to peak for the target entity (state/country), expected case rates and higher/lower bounds. It determines these parameters from countries that have exhibited similar trends in the past. 

Subsequently, it runs a simulation with the optimized parameters to generate day by day case projections. The simulation assumes 2 waves of infection surges following gaussian distribution and tries to apply that in generating the case projections. It also incorporates various factors such as transmission probability, testing efficiency, intervention impacts, attritions.

### When to use this notebook?

This notebook will run locally. Our simulation and optimization loops are not too heavy computationally. If your local resources are not adequate, we recommend to use the version of the notebook that enables training on Amazon SageMaker, ``covid19_simulator_sagemaker``.

### Install and import libraries

Uncomment to install missing libraries with the correct versions.

In [None]:
#!pip install -r requirements.txt

Import libraries used in the notebook. Additional libraries will be imported as needed in the Python files in the repository.

In [None]:
import sys
sys.path.insert(1, 'src')
import pandas as pd
import numpy as np
import random
import urllib.request
import os
# fixed_seed = 39
# random.seed(fixed_seed)
# np.random.seed(fixed_seed)

import warnings
warnings.filterwarnings("ignore")



The below package contains all the configuration parameters for public datasets as well as disease specific parameters extracted from scientific journals. For details, refer to the references cited in `readme.md` for our codebase.

In [None]:
import config

In [None]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
from simulation_orchestrator import run

### Update local data files from public resources

Prior to running our use cases for various states and countries, we will update our local data with the most up-to-date from public data sources.

In [None]:
# Function to refresh the local data file with the latest version from the web
def download_latest_data (url, local_file):
    with urllib.request.urlopen(url) as response, open(local_file, 'wb') as out_file:
        data = response.read() # a `bytes` object
        out_file.write(data)

# Mapping of online vs offline file locations to refresh
online_offline_data = list()
# Confirmed cases data maintained by Johns Hopkins University
online_offline_data.append((config.confirmed_cases_global_online, 
                            os.path.join(config.base_data_dir, config.confirmed_cases_global_offline)))
# Recovered cases data maintained by Johns Hopkins University
online_offline_data.append((config.recovered_cases_global_online, 
                            os.path.join(config.base_data_dir, config.recovered_cases_global_offline)))
# Deceased cases data maintained by Johns Hopkins University
online_offline_data.append((config.deceased_cases_global_online, 
                            os.path.join(config.base_data_dir, config.deceased_cases_global_offline)))
# Indian states specific cases maintained by COVID19INDIA (www.covid19india.org)
online_offline_data.append((config.india_states_cases_online, 
                            os.path.join(config.base_data_dir, config.india_states_cases_offline)))
# USA population data from census
online_offline_data.append((config.usa_populations_data_online, 
                            os.path.join(config.base_data_dir, config.usa_populations_data_offline)))


# Refresh the local data files with the latest versions from respective web sources
for path_pair in online_offline_data:
    try:
        download_latest_data (path_pair[0], path_pair[1])
        print ('Downloaded latest data from: {}'.format(path_pair[0]))
    except Exception as e:
        print ('Error while downloading {}: {}'.format(path_pair[0], e.__class__))


## Parse data files

Import Python code that will parse through the public data sets and extract historical data relevant for our prediction and simulation environments.

In [None]:
import state_data_loader
import country_data_loader

Run the following cell to load all USA data parsed for each state separately with the recent updates. To use already existing files stored locally, comment out.

In [None]:
# country_code = 'USA'
# state_data_loader.load_us_all(country_code)

## Examples

We will now go through few examples of how we analyze various intervension score for a given state and country. 


### Case: California, USA

In [None]:
country_code = 'USA'
states_names = ['California', 'Texas'] # This is an array so one can load multiple states when needed.
states = ['CA', 'TX'] # This is an array so one can load multiple states when needed.

# Transform and write the country/state specific data for further processing
#country_data_loader.load()
state_data_loader.load(country_code, states, latest=True)

Extract population from census data.

In [None]:
state_population = state_data_loader.load_us_population(states_names)
for idx in range(len(states)):
    print('The population of {} for age 18 and older is {}.'.format(states[idx], state_population[idx]))

Plot the historical trends.

In [None]:
input_data_file = 'data/input/Cases_{}_{}.csv'.format(country_code, states[0])
print(input_data_file)

In [None]:
df = pd.read_csv(input_data_file)

In [None]:
df.head(10)

In [None]:
df['Percent_Confirmed'] = df['Confirmed']/state_population[0]
import matplotlib.pyplot as plt
df.plot(y='Percent_Confirmed')

#### Configure options for prediction

Next, we need to determine whether we want to feed in a pattern for incidence rate based on historical patters in other states or countries. We recommend this option when there are states/countries in which the disease exposure happened earlier than the subject state/country. Otherwise, the incidence rate pattern will be based on seasonality of similar infectious diseases such as flu. The annual pattern for the seasonal flu has up to three dominant peaks CITATION. Since most of the countries has 1st peak already observed, we use the available historical COVID19 data for the 1st peak and then adjust the 2nd and 3rd peaks by scaling the 1st peak magnitude and duration using Gaussian curves. 

Let's find out whether there are other states or countries similar to our use case.

In [None]:
# True, uses other state/country data. False, uses the infection case rate of the target state/country only
#config.enable_case_rate_adjustment = False
enable_case_rate_adjustment = {'CA':False, 'TX':True}

This is set to False for most countries / states, except for the European countries where the cases peaked and declined in relatively short timespan. Setting this flag to True will ensure that a default range (e.g. 1 to 5) is used instead of the weeks-to-peak values measured from other countries. 


In [None]:
# True, . False, 
#config.use_default_wave1_weeks_range = True
use_default_wave1_weeks_range = {'CA':True, 'TX':True}

### Important parameters:
#### config.enable_case_rate_adjustment
- Setting this flag to True adjusts the infection case rate based on observations from countries with best matching case rate change pattern
- Setting this flag to False uses the infection case rate of the target state/country only

#### config.use_default_wave1_weeks_range
- Setting this flag to True will ensure that a default range (e.g. 1 to 5) is used instead of the weeks-to-peak values measured from other countries
- This could be set to False for most countries / states, except for the European countries where the cases peaked and declined in relatively short timespan

##### learn_params
- Whether to learn the optimal simulation parameters from the latest data, or to use the ones learnt during last optimization
- It's recommended to learn the parameters after fetching the latest confirmed-cases data

##### training_days
- Number of days of COVID-19 confirmed cases to use for learning / optimization the simulation parameters

##### test_days
- Latest n number of days to leave aside for testing
- Set this parameter to ensure that the simulation can learn from a period when the COVID19 case rate was increasing.
- Example 1: Let's say for a selected region COVID19 cases started to rise from 100 days before today and reached its peak in 30 days and declined to daily zero cases in another 30 days. Here, we can choose test_days = 75, as that will allow to learn the simulation parameters (fitment_days) from the growth phase of the infection. Alternatively, if we choose test_days = 5, we will try to learn simulation. parameters from a flat line and thus will get unexpected results.
- Example 2: Let's say for a selected region COVID19 cases started to rise from 60 days before today and are still rising. Here, we can choose test_days = 5, as that will allow to learn the simulation parameters (fitment_days) from the growth phase of the infection.

##### projection_days
- Number of days to project confirmed COVID19 cases for, including the test_days.

##### intv_inf_pctg
- Assumed influence of various interventions to reduce the spread of COVID19.
- should be between 0 (No influence) to 1 (Max influence)

#### country_code, state, state_population, actual_testing_capacity
- Target location specific parameters
- state, state_population parameters to represent the target country when projecting for a country

##### These parameters should be configured based on the target location and current day before running the projections.

*** Additional configurable parameters can be managed in src/config.python

In [None]:
# learn_params = True
# training_days = 14
# test_days = 15
# projection_days = 120
# intv_inf_pctg = 0.8

learn_params = {'CA':True, 'TX':True}
training_days = {'CA':14, 'TX':14}
test_days = {'CA':15, 'TX':30}
projection_days = {'CA':30, 'TX':30}
intv_inf_pctg = {'CA':0.8, 'TX':0.8}

We extract the estimated population from the census data for the number of individuals 18 and older in 2019.

In [None]:
# state_population = state_data_loader.load_us_population(states)
# for idx in range(len(states)):
#     print('The population of {} for age 18 and older is {}.'.format(states[idx], state_population[idx]))

Testing capacity? Is this needed?

In [None]:
#actual_testing_capacity = 500000
actual_testing_capacity = {'CA':500000, 'TX':500000}

What is country level projection?

In [None]:
country_level_projection = False

In [None]:
#STOP

#### Run simulation-based prediction

We are now ready to run the simulation. It will take several minutes to run for ... days.

In [None]:
for idx in range(len(states)):
    state = states[idx]
    print ('\nProjection for: {}'.format(state) + ' * ' * 25)
    config.enable_case_rate_adjustment = enable_case_rate_adjustment[state]
    config.use_default_wave1_weeks_range = use_default_wave1_weeks_range[state]
    run (country_code, state, state_population[idx], actual_testing_capacity[state], training_days[state], test_days[state], \
         projection_days[state], learn_params[state], country_level_projection, intervention_influence_pctg=intv_inf_pctg[state])
