# Preparing the COVID-19 Dataset

## This first notebook will focus on loading in the data from Kaggle and prepare it for an Exploratory Data Analysis

First of all, let's import all the data, make sure that there are no missing values, that everything is merged and aligned. I want to make this as simple and robust as possible so every new stream of data that comes in will be ready for analysis

### Raw Data Sources
* Main = covid_19_data.csv
* Confirmed cases = time_series_covid_19_confirmed.csv
* Death cases = time_series_covid_19_deaths.csv
* Recovered cases = time_series_covid_19_recovered.csv

Per case report:
* Line list = COVID19_line_list_data.csv
* Open line list = COVID19_open_line_list.csv

### Changes
* 19-03-2020: Start of project

In [1]:
import os
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline 

### Loading files into DataFrames

In [2]:
df = pd.read_csv(r'https://raw.githubusercontent.com/dylanye/Kaggle-COVID19-EIM-Functions/master/data/01_raw/covid_19_data.csv', index_col=0)
confirmed_df = pd.read_csv(r'https://raw.githubusercontent.com/dylanye/Kaggle-COVID19-EIM-Functions/master/data/01_raw/time_series_covid_19_confirmed.csv')
deaths_df = pd.read_csv(r'https://raw.githubusercontent.com/dylanye/Kaggle-COVID19-EIM-Functions/master/data/01_raw/time_series_covid_19_deaths.csv')
recovered_df = pd.read_csv(r'https://raw.githubusercontent.com/dylanye/Kaggle-COVID19-EIM-Functions/master/data/01_raw/time_series_covid_19_recovered.csv')

In [13]:
df['Last Update'] = pd.to_datetime(df['Last Update'])
df['ObservationDate'] = pd.to_datetime(df['ObservationDate'])

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6722 entries, 1 to 6722
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   ObservationDate  6722 non-null   datetime64[ns]
 1   Province/State   3956 non-null   object        
 2   Country/Region   6722 non-null   object        
 3   Last Update      6722 non-null   datetime64[ns]
 4   Confirmed        6722 non-null   float64       
 5   Deaths           6722 non-null   float64       
 6   Recovered        6722 non-null   float64       
dtypes: datetime64[ns](2), float64(3), object(2)
memory usage: 420.1+ KB


In [51]:
confirmed_df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/5/20,3/6/20,3/7/20,3/8/20,3/9/20,3/10/20,3/11/20,3/12/20,3/13/20,3/14/20
0,,Thailand,15.0,101.0,2,3,5,7,8,8,...,47,48,50,50,50,53,59,70,75,82
1,,Japan,36.0,138.0,2,1,2,2,4,4,...,360,420,461,502,511,581,639,639,701,773
2,,Singapore,1.2833,103.8333,0,1,3,3,4,5,...,117,130,138,150,150,160,178,178,200,212
3,,Nepal,28.1667,84.25,0,0,0,1,1,1,...,1,1,1,1,1,1,1,1,1,1
4,,Malaysia,2.5,112.5,0,0,0,3,4,4,...,50,83,93,99,117,129,149,149,197,238


In [36]:
cols = confirmed_df.keys()

### Get all the dates since the outbreak 

In [49]:
confirmed = confirmed_df.loc[:, cols[4]:cols[-1]]
deaths = deaths_df.loc[:, cols[4]:cols[-1]]
recovered = recovered_df.loc[:, cols[4]:cols[-1]]

### Extract information on cases known since outbreak

In [43]:
dates = confirmed.keys()
world_cases = []
total_deaths = [] 
mortality_rate = []
total_recovered = [] 

for i in dates:
    confirmed_sum = confirmed[i].sum()
    death_sum = deaths[i].sum()
    recovered_sum = recovered[i].sum()
    world_cases.append(confirmed_sum)
    total_deaths.append(death_sum)
    mortality_rate.append(death_sum/confirmed_sum)
    total_recovered.append(recovered_sum)

In [67]:
world_cases = np.array(world_cases).reshape(-1, 1)
total_deaths = np.array(total_deaths).reshape(-1, 1)
total_recovered = np.array(total_recovered).reshape(-1, 1)
mortality_rate = np.array(mortality_rate).reshape(-1, 1)