# Process COVID Data

This Jupyter notebook processes COVID-19 case data from John Hopkins University for use in various models to predict the spread of COVID.

In [1]:
# Import Libraries
import pandas as pd
import src.processing.processing as process # Python file used to process raw data
import importlib 
importlib.reload(process)
import pprint

Lets read the COVID data downloaded from John Hopkins CSEE Github. This dataset contains the number of confirmed COVID cases, deaths, and recovered for all countries with at least one COVID case starting January 22nd.

In [2]:
# Data File Paths
paths = ['data/raw/time_series_covid19_confirmed_global.csv', 
         'data/raw/time_series_covid19_deaths_global.csv',
         'data/raw/time_series_covid19_recovered_global.csv']

# Import JHU Data
covid_confirmed, covid_deaths, covid_recovered = process.import_data(paths[0], paths[1], paths[2])

We are now ready to process this data for our models. Lets begin with the SIR model.

### SIR

The SIR model is an epidemic model that splits up a population into three categories: suseptible, infected, and recovered. This model will predict the % of people infected / recovered with / from COVID between a given time period.

We need to import country population data to process the data for the SIR model.

In [3]:
# Import World Population
pop_path = 'data/raw/country_pop.csv'
country_pop = process.pop(pop_path)

Lets make three pandas DataFrames containing the percent of suceptible, infected, and recovered people in each country. For this project, we will calculate the number of people suceptible, infected, and recovered as a % of a country's population so that it's easier to compare between countries.

In [4]:
# This function takes some time to run
s_countries, i_countries, r_countries = process.process_sir(covid_confirmed, covid_deaths, covid_recovered, country_pop)

In [5]:
s_countries

Unnamed: 0,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,...,8/22/20,8/23/20,8/24/20,8/25/20,8/26/20,8/27/20,8/28/20,8/29/20,8/30/20,8/31/20
Afghanistan,1,1,1,1,1,1,1,1,1,1,...,0.99827,0.998264,0.998258,0.998256,0.998239,0.998238,0.998238,0.998238,0.998236,0.998236
Albania,1,1,1,1,1,1,1,1,1,1,...,0.995586,0.99548,0.995388,0.995292,0.995197,0.995087,0.995,0.994936,0.994858,0.994784
Algeria,1,1,1,1,1,1,1,1,1,1,...,0.998373,0.998357,0.998343,0.998329,0.998313,0.998298,0.998282,0.998267,0.998253,0.998238
Andorra,1,1,1,1,1,1,1,1,1,1,...,0.974465,0.974465,0.974244,0.974244,0.973546,0.973546,0.973093,0.973093,0.973093,0.972342
Angola,1,1,1,1,1,1,1,1,1,1,...,0.999907,0.999906,0.999903,0.999898,0.999896,0.999883,0.99989,0.999887,0.999885,0.999883
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
West Bank and Gaza,1,1,1,1,1,1,1,1,1,1,...,0.993499,0.993265,0.993168,0.992783,0.99251,0.992337,0.992189,0.992023,0.991871,0.991697
Western Sahara,1,1,1,1,1,1,1,1,1,1,...,0.999968,0.999968,0.999968,0.999968,0.999968,0.999968,0.999968,0.999968,0.999968,0.999968
Yemen,1,1,1,1,1,1,1,1,1,1,...,0.999882,0.999881,0.999881,0.99988,0.99988,0.99988,0.999879,0.999879,0.999878,0.999877
Zambia,1,1,1,1,1,1,1,1,1,1,...,0.998855,0.998841,0.998823,0.998805,0.998784,0.998764,0.998749,0.998734,0.998707,0.998702


In [6]:
i_countries

Unnamed: 0,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,...,8/22/20,8/23/20,8/24/20,8/25/20,8/26/20,8/27/20,8/28/20,8/29/20,8/30/20,8/31/20
Afghanistan,0,0,0,0,0,0,0,0,0,0,...,0.000219686,0.000216603,0.000213341,0.000211491,0.000197029,0.000197337,0.00019726,0.000197234,0.000197054,0.000197131
Albania,0,0,0,0,0,0,0,0,0,0,...,0.00133644,0.00133609,0.00136841,0.00137953,0.00140072,0.00139899,0.0013903,0.00138439,0.0013764,0.00139516
Algeria,0,0,0,0,0,0,0,0,0,0,...,0.000245604,0.000248181,0.00025183,0.000255068,0.000256961,0.000259606,0.000261887,0.000263916,0.00026606,0.000267725
Andorra,0,0,0,0,0,0,0,0,0,0,...,0.00151427,0.00151427,0.00168252,0.00168252,0.00196726,0.00196726,0.00218728,0.00218728,0.00218728,0.00278263
Angola,0,0,0,0,0,0,0,0,0,0,...,3.73027e-05,3.82459e-05,3.78808e-05,3.66333e-05,3.80938e-05,2.96657e-05,4.068e-05,4.26881e-05,4.42399e-05,4.48788e-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
West Bank and Gaza,0,0,0,0,0,0,0,0,0,0,...,0.00158634,0.00149508,0.00157803,0.00139702,0.00133268,0.00138805,0.00149157,0.00150799,0.00159006,0.00164609
Western Sahara,0,0,0,0,0,0,0,0,0,0,...,1.67412e-06,1.67412e-06,1.67412e-06,1.67412e-06,1.67412e-06,1.67412e-06,1.67412e-06,1.67412e-06,1.67412e-06,1.67412e-06
Yemen,0,0,0,0,0,0,0,0,0,0,...,9.89071e-06,9.11957e-06,9.08604e-06,9.25368e-06,9.1531e-06,9.1531e-06,9.1531e-06,9.05251e-06,8.9184e-06,8.75076e-06
Zambia,0,0,0,0,0,0,0,0,0,0,...,3.31811e-05,4.67799e-05,3.59009e-05,3.28003e-05,2.18125e-05,2.60553e-05,2.99718e-05,2.86119e-05,1.54483e-05,1.84944e-05


In [7]:
r_countries

Unnamed: 0,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,...,8/22/20,8/23/20,8/24/20,8/25/20,8/26/20,8/27/20,8/28/20,8/29/20,8/30/20,8/31/20
Afghanistan,0,0,0,0,0,0,0,0,0,0,...,0.00075526,0.000759524,0.000764199,0.00076646,0.000782027,0.000782129,0.000782489,0.000782592,0.00078326,0.00078326
Albania,0,0,0,0,0,0,0,0,0,0,...,0.00153902,0.00159219,0.00162172,0.00166412,0.0017013,0.00175725,0.00180485,0.00183995,0.00188304,0.00191049
Algeria,0,0,0,0,0,0,0,0,0,0,...,0.00069093,0.000697292,0.00070272,0.000707919,0.000714943,0.000721351,0.000727896,0.000734487,0.000740667,0.000746938
Andorra,0,0,0,0,0,0,0,0,0,0,...,0.0120106,0.0120106,0.0120365,0.0120365,0.0122436,0.0122436,0.0123601,0.0123601,0.0123601,0.0124377
Angola,0,0,0,0,0,0,0,0,0,0,...,2.76271e-05,2.78097e-05,2.97265e-05,3.283e-05,3.28604e-05,4.38139e-05,3.45035e-05,3.49294e-05,3.55988e-05,3.58726e-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
West Bank and Gaza,0,0,0,0,0,0,0,0,0,0,...,0.00245743,0.00262005,0.00262705,0.00290983,0.00307857,0.00313745,0.00315955,0.00323441,0.00326964,0.00332852
Western Sahara,0,0,0,0,0,0,0,0,0,0,...,1.5067e-05,1.5067e-05,1.5067e-05,1.5067e-05,1.5067e-05,1.5067e-05,1.5067e-05,1.5067e-05,1.5067e-05,1.5067e-05
Yemen,0,0,0,0,0,0,0,0,0,0,...,5.40469e-05,5.49521e-05,5.51533e-05,5.52539e-05,5.55556e-05,5.56562e-05,5.59915e-05,5.61926e-05,5.65615e-05,5.68967e-05
Zambia,0,0,0,0,0,0,0,0,0,0,...,0.000555974,0.000556028,0.000570497,0.00058105,0.000596988,0.000604984,0.00061075,0.0006188,0.000638655,0.000639525


We have finished processing the data for the SIR model. We can move on to the LSTM model.

### LSTM

The LSTM neural network model uses time series data for training and making predictions. We'll use this model to predict the number of global COVID cases between a given time period.

Lets create a time series with the number of global cases starting from January 22nd by adding up the total number of cases from each country for each date.

In [8]:
# Convert JHU Data Into Global Time Series
global_time_series = pd.Series(index = covid_confirmed.columns[4:], data = [covid_confirmed.loc[:, i].sum() for i in covid_confirmed.columns[4:]])
global_time_series

1/22/20         555
1/23/20         654
1/24/20         941
1/25/20        1434
1/26/20        2118
             ...   
8/27/20    24451968
8/28/20    24733727
8/29/20    24995735
8/30/20    25221988
8/31/20    25484046
Length: 223, dtype: int64

We are done processing the data for the LSTM model. We can now move on to the Gaussian model.

### Gaussian

The Gaussian Error Function can be used to model the number of COVID deaths in a country over a period of time. For this project, we'll use this model to predict number of deaths in the US.

Lets create a time series with the number of US COVID - 19 deaths starting from January 22nd by getting the total number of deaths in the US for each date.

In [9]:
us_deaths_time_series = pd.Series(index = covid_deaths.columns[4:], data = [covid_deaths[covid_deaths['Country/Region'] == 'US'].loc[:, i].sum() for i in covid_confirmed.columns[4:]])

We have finished processing the data for the Gaussian model.

# Save Data to Processed Folder

We can save all of the processed data to the processed folder by running the cell below. 

Note: the processed code can already be found in the processed folder. Run this cell if the code gets corrupted / is missing.

In [10]:
covid_confirmed.to_csv('data/processed/General/covid_confirmed.csv')
covid_deaths.to_csv('data/processed/General/covid_deaths.csv')
covid_recovered.to_csv('data/processed/General/covid_recovered.csv')

s_countries.to_csv('data/processed/SIR/s_countries.csv')
i_countries.to_csv('data/processed/SIR/i_countries.csv')
r_countries.to_csv('data/processed/SIR/r_countries.csv')

global_time_series.to_csv('data/processed/LSTM/global_time_series.csv')
us_deaths_time_series.to_csv('data/processed/Gaussian/us_deaths_time_series.csv')