In [1]:
import pandas as pd

## Preprocessing 

Here we'll process the dataset provided by John's Hopkins University in the public Google Sheet: https://docs.google.com/spreadsheets/d/1avGWWl1J19O_Zm0NGTGy2E-fOG05i4ljRfjl87P7FiA/edit?ts=5e5e9222#gid=0

The document has been exported as a CSV. 

TODO: Would be good to automatically read the document from Google Docs so that this can be re-run without a manual export step

In [2]:
data = pd.read_csv('data/COVID-19.csv')

data.head()

Unnamed: 0,Country/Region,Province/State,Lat,Long,Case_Type,Date,Cases,Difference,Last_Update_Date
0,Afghanistan,,33.0,65.0,Confirmed,2020-01-22 0:00:00,0,0,2020-03-11 13:39:34
1,Afghanistan,,33.0,65.0,Confirmed,2020-01-23 0:00:00,0,0,2020-03-11 13:39:34
2,Afghanistan,,33.0,65.0,Confirmed,2020-01-24 0:00:00,0,0,2020-03-11 13:39:34
3,Afghanistan,,33.0,65.0,Confirmed,2020-01-25 0:00:00,0,0,2020-03-11 13:39:34
4,Afghanistan,,33.0,65.0,Confirmed,2020-01-26 0:00:00,0,0,2020-03-11 13:39:34


This dataset is represented in a "tall" or "melted" format. We'll convert it into a "wide" format along `Case_Type`. 

For the purpose of analysis we will also normalise dates from UTC format to a Y-M-D format, we will also omit `Province/State`, `Lat`, `Long`, and `Last_Update_Date` and standardize the column names into lower-case for easy manipulation in further analysis. 

#### Drop Unneeded Columns

In [3]:
clean_data = data.drop(['Province/State', 'Lat', 'Long', 'Last_Update_Date'], axis=1)

clean_data.head()

Unnamed: 0,Country/Region,Case_Type,Date,Cases,Difference
0,Afghanistan,Confirmed,2020-01-22 0:00:00,0,0
1,Afghanistan,Confirmed,2020-01-23 0:00:00,0,0
2,Afghanistan,Confirmed,2020-01-24 0:00:00,0,0
3,Afghanistan,Confirmed,2020-01-25 0:00:00,0,0
4,Afghanistan,Confirmed,2020-01-26 0:00:00,0,0


#### Rename Columns

In [4]:
clean_data = clean_data.rename(columns={
        'Country/Region': 'region',
        'Case_Type': 'case_type',
        'Date': 'date',
        'Cases': 'cumulative',
        'Difference': 'cases'
    })

clean_data.head()

Unnamed: 0,region,case_type,date,cumulative,cases
0,Afghanistan,Confirmed,2020-01-22 0:00:00,0,0
1,Afghanistan,Confirmed,2020-01-23 0:00:00,0,0
2,Afghanistan,Confirmed,2020-01-24 0:00:00,0,0
3,Afghanistan,Confirmed,2020-01-25 0:00:00,0,0
4,Afghanistan,Confirmed,2020-01-26 0:00:00,0,0


#### Lower-case `case_type` values

In [5]:
clean_data['case_type'] = clean_data['case_type'].apply(str.lower)

clean_data.head()

Unnamed: 0,region,case_type,date,cumulative,cases
0,Afghanistan,confirmed,2020-01-22 0:00:00,0,0
1,Afghanistan,confirmed,2020-01-23 0:00:00,0,0
2,Afghanistan,confirmed,2020-01-24 0:00:00,0,0
3,Afghanistan,confirmed,2020-01-25 0:00:00,0,0
4,Afghanistan,confirmed,2020-01-26 0:00:00,0,0


#### Widen Dataset by `case_type`

In [6]:
clean_data = clean_data.pivot_table(
    index=['region', 'date'], 
    columns='case_type', 
    values=['cumulative', 'cases']
).reset_index()

clean_data.head()

Unnamed: 0_level_0,region,date,cumulative,cumulative,cumulative,cumulative,cases,cases,cases,cases
case_type,Unnamed: 1_level_1,Unnamed: 2_level_1,active,confirmed,deaths,recovered,active,confirmed,deaths,recovered
0,Afghanistan,2020-01-22 0:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Afghanistan,2020-01-23 0:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Afghanistan,2020-01-24 0:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Afghanistan,2020-01-25 0:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Afghanistan,2020-01-26 0:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Flatten Column Names

Pandas created a multi-index since we used multiple columns for our values when running the pivot operation. This will make further analysis a bit tedious. We'll flatten out the column names so that the dataset can be referenced by single column names. 

In [7]:
def process_column_name(column_tuple):
    if not column_tuple[1]:
        new_name = column_tuple[0]
    else:
        new_name = '_'.join(column_tuple)
    
    return new_name

clean_data.columns = [process_column_name(t) for t in clean_data.columns.values]

clean_data.head()

Unnamed: 0,region,date,cumulative_active,cumulative_confirmed,cumulative_deaths,cumulative_recovered,cases_active,cases_confirmed,cases_deaths,cases_recovered
0,Afghanistan,2020-01-22 0:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Afghanistan,2020-01-23 0:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Afghanistan,2020-01-24 0:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Afghanistan,2020-01-25 0:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Afghanistan,2020-01-26 0:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Standardize Dates to YYYY-MM-DD Format

In [8]:
clean_data['date'] = pd.to_datetime(clean_data['date']).dt.date

clean_data.head()

Unnamed: 0,region,date,cumulative_active,cumulative_confirmed,cumulative_deaths,cumulative_recovered,cases_active,cases_confirmed,cases_deaths,cases_recovered
0,Afghanistan,2020-01-22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Afghanistan,2020-01-23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Afghanistan,2020-01-24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Afghanistan,2020-01-25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Afghanistan,2020-01-26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Save Dataset

Lets save this dataset so it can be used in downstream analysis

In [9]:
clean_data.to_csv('data/COVID-19-Cleaned.csv', index=False)

Assuming all went well, we should see a file named `COVID-19-Cleaned.csv` in our `data` directory

In [10]:
!ls data

[34mCOVID-19[m[m             COVID-19-Cleaned.csv COVID-19.csv
