# COVID-19 Data and Panda Time Series

## Objectives
- Download and clean a real-world dataset
- Use the datetime datatype to resample a time series
- Plot different timeframes of a time series
- Use the datetime datatype to group data by days of the week

In [None]:
# Load libraries
import pandas as pd
import matplotlib.pyplot as plt

## Locate the Data
- This dataset is from the Center for Systems Science and Engineering at Johns Hopkins University.
- This page is a table of links to CSV files; take a look: <https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports>
- We begin by harvesting the CSV files names listed on this site and cleaning the data

In [None]:
# Get the contents of the HTML table from the GitHub Page
csv_files = pd.read_html('https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports')[0]
csv_files.head()

In [None]:
# Remove rows that don't contain a filename
csv_files = csv_files[csv_files['Name'].str.contains('.csv')]
csv_files.head()

In [None]:
# Select the name column
csv_files = list(csv_files['Name'])
csv_files

## Download the data
- We now loop through each CSV file on this page and download a separate CSV file for each day of collected data.
- With each iteration, we append the CSV filename to the end of a base URL to get a URL that pooints to that file.
- We use `read_csv()` and the file URL to download each CSV file and append the resulting dataframe to the `covid_days` list.
- `dtype` explicitly defines the datatype of a column for Pandas.
  - `Int64` is a "smart" integer type that handles NaN values well.

In [None]:
base_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/'

In [None]:
covid_days = []
for day in csv_files:
    # Status update
    print('Downloading ' + day)
    covid_days.append(pd.read_csv(base_url + day,
                                  dtype={'Confirmed':'Int64', 'Deaths': 'Int64',
                                          'Recovered': 'Int64', 'FIPS': 'Int64', 'Active': 'Int64'}))
print('Done downloading.')

## Clean the data - create a new column indicating the day
- Each day has it's own file named according to the date.
- We want to combine all the data into one dataframe.
- To distinguish date, we will add a new column called `Day` to each dataframe in our list and populate every entry in that column with the date from the filename.
- `enumerate()` provides the item and its index (sometimes we need both in a loop)

In [None]:
for i, day in enumerate(csv_files):
    covid_days[i]['Day'] = day.strip('.csv')

In [None]:
# The first day in our dataset
covid_days[0]

## Clean the data - combine all days into one dataframe
- `concat()` combines multiple dataframes

In [None]:
covid_data = pd.concat(covid_days, axis=0)
# Write the combined raw data to a CSV file for later use
covid_data.to_csv('data_raw/covid-days.csv')

In [None]:
# Let's take a peek
covid_data.head()

## Get to know our data

In [None]:
# How many entries do we have?
covid_data.shape

In [None]:
# What columns are in our dataset?
covid_data.columns

## Clean the data - repeat columns
- They added columns as time went on.
- When we combined older data with newer data, there were columns that contain the same information.
- So long as these columns don't have overlapping entries, we can safely combine them.

In [None]:
# Are `Country_Region` and `Country/Region` safe to combine -- Yes
covid_data[covid_data['Country_Region'].notnull() & covid_data['Country/Region'].notnull()]

In [None]:
# Merge entries from `Country/Region` into `Country_Region`
covid_data['Country_Region'] = covid_data['Country_Region'].where(covid_data['Country_Region'].notnull(), covid_data['Country/Region'])

In [None]:
covid_data.head()

- We repeat this process for other similar columns

In [None]:
covid_data[covid_data['Last Update'].notnull() & covid_data['Last_Update'].notnull()]
covid_data['Last_Update'] = covid_data['Last_Update'].where(covid_data['Last_Update'].notnull(), covid_data['Last Update'])

## Clean the data - finishing touches
- We will remove the data we don't need for the purposes of this exercise
- Convert the `Day` column to Panda's `datetime` type
  - This is a very powerful datatype for time series

In [None]:
# Get rid of unnesessary columns by selecting what we want
covid_data = covid_data[['Country_Region', 'Day', 'Confirmed', 'Deaths', 'Recovered', 'Active']]

In [None]:
# Convert `Day` to datetime
covid_data['Day'] = pd.to_datetime(covid_data['Day'])

In [None]:
# Let's have a look
covid_data

## Datetime and time series
- Datetime date types have a lot of hand features
- We can find the length of time by subtracting two datetime objects

In [None]:
covid_data['Day'].max() - covid_data['Day'].min()

- Let's narrow the data again and look at confirmed cases in three countries that have seen heavy media coverage

In [None]:
three_countries = covid_data[covid_data['Country_Region'].isin(['Mainland China', 'US', 'Italy'])]
three_countries

- `pivot_table` (like in Excel) is like `pivot`, but it accepts an `aggfunc` parameter that defines how we combine values
  - Notice that each country has multiple entries for each date. We will sum the values for each date in the `confirmed` column

In [None]:
confirmed = three_countries.pivot_table(index='Day', columns='Country_Region', values='Confirmed', aggfunc='sum')
confirmed

- By setting the column (Day) with the datetime datatype as the index, time series become very convenient. Let's plot a few.

In [None]:
# Drop rows with NaN values
time_series = confirmed.dropna(axis='rows')
# Plot US confirmed cases
time_series['US'].plot();

In [None]:
# Plot all three countries' confirmed cases
time_series.plot();

- Now datetime comes into its own ...

In [None]:
# Plot the timeseries up to March 9th
time_series[:'2020-03-09'].plot();

- This shows total cases. Let's look at new cases.
- `diff()` computes the difference between two time steps and allows us to understand rates of change.

In [None]:
new_cases = time_series.diff()
new_cases

In [None]:
new_cases['2020-02-01':'2020-03-10'].plot();

- We can change the granularity of our step.
- Let's plot by week instead of day.
- `resample()` lets us change the step size and define how we combine values.
  - In this case, we look at the max new cases per week.

In [None]:
new_cases['2020-02-01':'2020-03-10'].resample("W").max().plot();

- We can use datetime to group data by days of the week

In [None]:
# Get a subset of the data and treat the dates as a normal column instead of an index
new_cases = new_cases['2020-02-01':'2020-03-10'].reset_index()
new_cases.head()

In [None]:
# Combine the country columns
new_cases = new_cases.melt(id_vars='Day')
# Fix data types
new_cases['value'] = new_cases['value'].astype('Int64')
# Split, apply, combine:
# Group by day of the week then the country,
# finding averages of new cases for each category
week_days = new_cases.groupby([new_cases['Day'].dt.weekday, 'Country_Region'])['value'].mean()
week_days

In [None]:
# Final plot of mean number of new cases by day and country
week_days.plot(kind='bar',figsize=(10, 5));

## Objectives
- Download and clean a real-world dataset
- Use the datetime datatype to resample a time series
- Plot different timeframes of a time series
- Use the datetime datatype to group data by days of the week