In [1]:
import pandas as pd

## Preprocessing 

Here we'll process the dataset provided by John's Hopkins University in the public Google Sheet: https://docs.google.com/spreadsheets/d/1avGWWl1J19O_Zm0NGTGy2E-fOG05i4ljRfjl87P7FiA/edit?ts=5e5e9222#gid=0

The document has been exported as a CSV and is available in `data/COVID-19.csv`.

The processed dataset will be saved to `data/COVID-19-Cleaned.csv`.

Also, when doing analysis it'll be useful to refrence the populations of countries. This is important when trying to model Logistic Growth. We can get this data as a CSV export from the world bank at the following link: http://api.worldbank.org/countries/all/indicators/SP.POP.TOTL?format=csv

We'll download the world-bank dataset automatically and save it to `data/populations.csv`

TODO: Would be good to automatically read the document from Google Docs so that this can be re-run without a manual export step


### Covid 19 Data

In [2]:
data = pd.read_csv('data/COVID-19.csv')

data.head()

Unnamed: 0,Country/Region,Province/State,Lat,Long,Case_Type,Date,Cases,Difference,Last_Update_Date
0,Afghanistan,,33.0,65.0,Confirmed,2020-01-22 0:00:00,0,0,2020-03-11 13:39:34
1,Afghanistan,,33.0,65.0,Confirmed,2020-01-23 0:00:00,0,0,2020-03-11 13:39:34
2,Afghanistan,,33.0,65.0,Confirmed,2020-01-24 0:00:00,0,0,2020-03-11 13:39:34
3,Afghanistan,,33.0,65.0,Confirmed,2020-01-25 0:00:00,0,0,2020-03-11 13:39:34
4,Afghanistan,,33.0,65.0,Confirmed,2020-01-26 0:00:00,0,0,2020-03-11 13:39:34


This dataset is represented in a "tall" or "melted" format. We'll convert it into a "wide" format along `Case_Type`. 

For the purpose of analysis we will also normalise dates from UTC format to a Y-M-D format, we will also omit `Province/State`, `Lat`, `Long`, and `Last_Update_Date` and standardize the column names into lower-case for easy manipulation in further analysis. 

#### Drop Unneeded Columns

In [3]:
clean_data = data.drop(['Province/State', 'Lat', 'Long', 'Last_Update_Date'], axis=1)

clean_data.head()

Unnamed: 0,Country/Region,Case_Type,Date,Cases,Difference
0,Afghanistan,Confirmed,2020-01-22 0:00:00,0,0
1,Afghanistan,Confirmed,2020-01-23 0:00:00,0,0
2,Afghanistan,Confirmed,2020-01-24 0:00:00,0,0
3,Afghanistan,Confirmed,2020-01-25 0:00:00,0,0
4,Afghanistan,Confirmed,2020-01-26 0:00:00,0,0


#### Rename Columns

In [4]:
clean_data = clean_data.rename(columns={
        'Country/Region': 'region',
        'Case_Type': 'case_type',
        'Date': 'date',
        'Cases': 'cumulative',
        'Difference': 'cases'
    })

clean_data.head()

Unnamed: 0,region,case_type,date,cumulative,cases
0,Afghanistan,Confirmed,2020-01-22 0:00:00,0,0
1,Afghanistan,Confirmed,2020-01-23 0:00:00,0,0
2,Afghanistan,Confirmed,2020-01-24 0:00:00,0,0
3,Afghanistan,Confirmed,2020-01-25 0:00:00,0,0
4,Afghanistan,Confirmed,2020-01-26 0:00:00,0,0


#### Lower-case `case_type` values

In [5]:
clean_data['case_type'] = clean_data['case_type'].apply(str.lower)

clean_data.head()

Unnamed: 0,region,case_type,date,cumulative,cases
0,Afghanistan,confirmed,2020-01-22 0:00:00,0,0
1,Afghanistan,confirmed,2020-01-23 0:00:00,0,0
2,Afghanistan,confirmed,2020-01-24 0:00:00,0,0
3,Afghanistan,confirmed,2020-01-25 0:00:00,0,0
4,Afghanistan,confirmed,2020-01-26 0:00:00,0,0


#### Widen Dataset by `case_type`

Since our granularity of our original data is uniquely identified by (region, province, date), we must sum along (region, date) to include the total count within a region of interest

In [6]:
clean_data = clean_data.pivot_table(
    index=['region', 'date'], 
    columns='case_type', 
    values=['cumulative', 'cases'],
    aggfunc='sum'
).reset_index()

clean_data.head()

Unnamed: 0_level_0,region,date,cumulative,cumulative,cumulative,cumulative,cases,cases,cases,cases
case_type,Unnamed: 1_level_1,Unnamed: 2_level_1,active,confirmed,deaths,recovered,active,confirmed,deaths,recovered
0,Afghanistan,2020-01-22 0:00:00,0,0,0,0,0,0,0,0
1,Afghanistan,2020-01-23 0:00:00,0,0,0,0,0,0,0,0
2,Afghanistan,2020-01-24 0:00:00,0,0,0,0,0,0,0,0
3,Afghanistan,2020-01-25 0:00:00,0,0,0,0,0,0,0,0
4,Afghanistan,2020-01-26 0:00:00,0,0,0,0,0,0,0,0


#### Flatten Column Names

Pandas created a multi-index since we used multiple columns for our values when running the pivot operation. This will make further analysis a bit tedious. We'll flatten out the column names so that the dataset can be referenced by single column names. 

In [7]:
def process_column_name(column_tuple):
    if not column_tuple[1]:
        new_name = column_tuple[0]
    else:
        new_name = '_'.join(column_tuple)
    
    return new_name

clean_data.columns = [process_column_name(t) for t in clean_data.columns.values]

clean_data.head()

Unnamed: 0,region,date,cumulative_active,cumulative_confirmed,cumulative_deaths,cumulative_recovered,cases_active,cases_confirmed,cases_deaths,cases_recovered
0,Afghanistan,2020-01-22 0:00:00,0,0,0,0,0,0,0,0
1,Afghanistan,2020-01-23 0:00:00,0,0,0,0,0,0,0,0
2,Afghanistan,2020-01-24 0:00:00,0,0,0,0,0,0,0,0
3,Afghanistan,2020-01-25 0:00:00,0,0,0,0,0,0,0,0
4,Afghanistan,2020-01-26 0:00:00,0,0,0,0,0,0,0,0


#### Standardize Dates to YYYY-MM-DD Format

In [8]:
clean_data['date'] = pd.to_datetime(clean_data['date']).dt.date

clean_data.head()

Unnamed: 0,region,date,cumulative_active,cumulative_confirmed,cumulative_deaths,cumulative_recovered,cases_active,cases_confirmed,cases_deaths,cases_recovered
0,Afghanistan,2020-01-22,0,0,0,0,0,0,0,0
1,Afghanistan,2020-01-23,0,0,0,0,0,0,0,0
2,Afghanistan,2020-01-24,0,0,0,0,0,0,0,0
3,Afghanistan,2020-01-25,0,0,0,0,0,0,0,0
4,Afghanistan,2020-01-26,0,0,0,0,0,0,0,0


#### Save Dataset

Lets save this dataset so it can be used in downstream analysis

In [9]:
clean_data.to_csv('data/COVID-19-Cleaned.csv', index=False)

Assuming all went well, we should see a file named `COVID-19-Cleaned.csv` in our `data` directory

In [10]:
!ls data

COVID-19-Cleaned.csv COVID-19.csv         populations.csv


## World Bank Data

We'll download the CSV from the world bank URL mentioned at the top of the document. We'll perform some basic cleanup (standardizing column names) and extract the most recent date (2018) for global populations. 

There'll be some mis-matches between country names in our two datasets, so we'll also try to match up country names across datasets.

In [11]:
url = 'http://api.worldbank.org/countries/all/indicators/SP.POP.TOTL?format=csv'

world_bank_population_data = pd.read_csv(url)

world_bank_population_data.head()

Unnamed: 0,"﻿""Country Name""",Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,Unnamed: 62
0,Arab World,ARB,92197750.0,94724510.0,97334440.0,100034200.0,102832800.0,105736400.0,108758600.0,111899400.0,...,363158700.0,371443500.0,379705700.0,387907700.0,396028300.0,404024400.0,411899000.0,419790600.0,,
1,Caribbean small states,CSS,4194710.0,4274060.0,4353628.0,4432217.0,4508198.0,4580374.0,4648367.0,4712526.0,...,7022387.0,7072665.0,7123332.0,7173435.0,7222212.0,7269386.0,7314990.0,7358965.0,,
2,Central Europe and the Baltics,CEB,91401760.0,92232740.0,93009500.0,93840020.0,94715800.0,95440990.0,96146340.0,97043270.0,...,104174000.0,103935300.0,103713700.0,103496200.0,103257800.0,102994300.0,102738900.0,102511900.0,,
3,Early-demographic dividend,EAR,980085300.0,1003279000.0,1027290000.0,1052060000.0,1077621000.0,1103955000.0,1131050000.0,1158974000.0,...,2951856000.0,2994853000.0,3037663000.0,3080325000.0,3122842000.0,3165142000.0,3207189000.0,3249141000.0,,
4,East Asia & Pacific,EAS,1040958000.0,1044545000.0,1059019000.0,1084796000.0,1110214000.0,1136691000.0,1166600000.0,1195270000.0,...,2221673000.0,2236819000.0,2252047000.0,2267482000.0,2282856000.0,2298514000.0,2314202000.0,2328221000.0,,


We'll create an initial 2-column dataset with country name and the most-recent population data from 2018. 

In [12]:
clean_pop_data = pd.DataFrame({
        'region': world_bank_population_data.iloc[:, 0],
        'population': world_bank_population_data['2018']
    })

clean_pop_data

Unnamed: 0,population,region
0,4.197906e+08,Arab World
1,7.358965e+06,Caribbean small states
2,1.025119e+08,Central Europe and the Baltics
3,3.249141e+09,Early-demographic dividend
4,2.328221e+09,East Asia & Pacific
5,2.081652e+09,East Asia & Pacific (excluding high income)
6,2.056064e+09,East Asia & Pacific (IDA & IBRD countries)
7,3.417832e+08,Euro area
8,9.187936e+08,Europe & Central Asia
9,4.177973e+08,Europe & Central Asia (excluding high income)


#### Matching Country Names

Since both datasets may have different region names refering to the same region, it'll be good to match these up and resolve any mis-match errors. 

The primary set of region names will be from the COVID-19 dataset. We'll find any missing countries from the World Bank dataset and repair errors if possible.

In [13]:
covid_regions = set(clean_data['region'].unique())
pop_regions = set(clean_pop_data['region'].unique())

These are the regions in the covid dataset that do not appear in the world bank dataset

In [14]:
covid_regions - pop_regions

{'Brunei',
 'Egypt',
 'French Guiana',
 'Hong Kong',
 'Iran',
 'Macau',
 'Mainland China',
 'Martinique',
 'Others',
 'Palestine',
 'Russia',
 'Saint Barthelemy',
 'Slovakia',
 'South Korea',
 'St. Martin',
 'Taiwan',
 'UK',
 'US',
 'Vatican City'}

These are the regions in the world bank dataset that do not appear in our COVID-19 dataset

In [15]:
pop_regions - covid_regions

{'American Samoa',
 'Angola',
 'Antigua and Barbuda',
 'Arab World',
 'Aruba',
 'Bahamas, The',
 'Barbados',
 'Belize',
 'Benin',
 'Bermuda',
 'Bolivia',
 'Botswana',
 'British Virgin Islands',
 'Brunei Darussalam',
 'Burundi',
 'Cabo Verde',
 'Caribbean small states',
 'Cayman Islands',
 'Central African Republic',
 'Central Europe and the Baltics',
 'Chad',
 'China',
 'Comoros',
 'Congo, Dem. Rep.',
 'Congo, Rep.',
 "Cote d'Ivoire",
 'Cuba',
 'Curacao',
 'Djibouti',
 'Dominica',
 'Early-demographic dividend',
 'East Asia & Pacific',
 'East Asia & Pacific (IDA & IBRD countries)',
 'East Asia & Pacific (excluding high income)',
 'Egypt, Arab Rep.',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Eswatini',
 'Ethiopia',
 'Euro area',
 'Europe & Central Asia',
 'Europe & Central Asia (IDA & IBRD countries)',
 'Europe & Central Asia (excluding high income)',
 'European Union',
 'Fiji',
 'Fragile and conflict affected situations',
 'French Polynesia',
 'Gabon',
 'Gambia, The',
 'Ghana'

From manual inspection, we can see mismatches occur in the COVID-19 dataset due to the use of acronyms (US/UK instead of United States/United Kingdom), short-form representations, and some omissions (Taiwan, Palestine, Vatican City and others don't appear in the World Bank Dataset). 

We will fix any mis-matches we can. We'll omit regions from the world bank dataset that may appear in the COVID-19 dataset, just be aware that there will be some missing data if doing any joins between the two datasets. 

We'll translate the names so they are standardized to the COVID-19 dataset. 

In [16]:
translation_table = {
    'United States': 'US',
    'United Kingdom': 'UK',
    'Brunei Darussalam': 'Brunei',
    'Egypt, Arab Rep.': 'Egypt',
    'Hong Kong SAR, China': 'Hong Kong',
    'Iran, Islamic Rep.': 'Iran',
    'Macao SAR, China': 'Macau',
    'China': 'Mainland China',
    'Russian Federation': 'Russia',
    'Slovak Republic': 'Slovakia',
    'Korea, Rep.': 'South Korea',
    'St. Martin (French part)': 'St. Martin'
}

def rename_region(name):
    try: 
        return translation_table[name]
    except KeyError:
        return name

In [17]:
clean_pop_data['region'] = clean_pop_data['region'].apply(rename_region)

#### Regions in COVID-19 Dataset Omitted from World Bank Dataset

These are the regions that did not appear in our World Bank dataset

In [18]:
pop_regions2 = set(clean_pop_data['region'].unique())

covid_regions - pop_regions2

{'French Guiana',
 'Martinique',
 'Others',
 'Palestine',
 'Saint Barthelemy',
 'Taiwan',
 'Vatican City'}

#### Drop Regions from World Bank Dataset

Lets drop any regions in the World Bank Dataset that do not appear in the COVID-19 Dataset

In [19]:
clean_pop_data = clean_pop_data.loc[
    clean_pop_data['region'].apply(lambda x: x in covid_regions)
]

#### Save Population Dataset 

In [20]:
clean_pop_data.to_csv('data/populations.csv', index=False, columns=['region', 'population'])

Assuming all went well, we should see a file named `populations.csv` in our data directory

In [21]:
!ls data

COVID-19-Cleaned.csv COVID-19.csv         populations.csv
