## World and Country Covid-19 Data

This notebook shows Covid-19 case data from Our World in Data.  They offer several datasets and different formats.

The data used here is from JHU and has 1 record for each country and date.  The stable URL is    
<http://covid.ourworldindata.org/data/jhu/full_data.csv>


### Download the Covid Dataset (As Needed) 

Run this cell only if you do not have up-to-date data or want a different dataset.  

In [1]:
# Use wget to download the data file
# Unfortunately, Github doesn't use HTTP Last-modified header,
# so wget will always download the file, even if its identical to local copy.
data_url = "http://covid.ourworldindata.org/data/jhu/full_data.csv"
! [ -d data ] || mkdir data
# -N use timestamps for conditional get, -nv non-verbose, -t retries
! cd data && wget -nv -N -t 5  $data_url

Last-modified header missing -- time-stamps turned off.
2021-05-27 11:51:42 URL:http://covid.ourworldindata.org/data/jhu/full_data.csv [5787261] -> "full_data.csv" [1]


### Read the Data and Describe It (Required in order to run other cells)

Use the date field as index in DataFrame, but also keep 'date' attribute.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Read the data and show basic info

covid = pd.read_csv('data/full_data.csv', 
                     parse_dates=['date'])
# use the 'date' as index
covid.index = pd.to_datetime(covid['date'])

# describe the data
def asdate(timestamp):
    s = str(timestamp)
    k = s.index("T")
    return s[0:k] if k > 0 else s
print(f"Dataset has {len(covid):,d} records")
print(f"Start date  ", asdate(covid.index.values[0]))
print(f"End date    ", asdate(covid.index.values[-1]))
# Show columns in DataFrame, for development
#print("")
#covid.info()

Dataset has 88,543 records
Start date   2020-02-24
End date     2021-05-25


### Create Plots for a Single Country

Specify the country name as in the "location" attribute of dataset,
and the number of days to show in plot.

The daily values are noisy.  The JHU dataset contains `weekly_` attributes that sum the last 7 days of data. Compute and show moving
averages from those (instead of directly computing moving averages from daily values).

In [3]:
# Specify the name of country for date.
# For USA use:
# country = 'United States'
country = 'Australia'

# Number of most recent days to show in plot
ndays = 150

cdata = covid[covid['location']==country][-ndays:]
# 'new_cases', 'new_deaths' are very noisy
# 'weekly_cases', 'weekly_deaths' are smoother
plt.figure(figsize=[10,6])   # [width,height] in inches?
plt.title("Daily Covid Cases in "+country)
plt.bar(cdata.index, cdata['new_cases'], color='gray')
plt.plot(cdata['weekly_cases']/7, color='blue')
plt.grid(axis='y')
plt.tight_layout()   # don't add padding to ends of y-axis
plt.show()


In [4]:
# Plot of daily deaths
plt.figure(figsize=[10,6])   # [width,height] in inches?
plt.title("Covid Deaths in "+country)
plt.bar(cdata.index, cdata['new_deaths'], color='gray')
plt.plot(cdata['weekly_deaths']/7, color='red')
plt.grid(axis='y')
plt.tight_layout()   # don't add padding to ends of y-axis
plt.show()

## Vaccinations

Download the `vaccinations.csv` file as needed:

In [4]:
vaccine_url = "http://covid.ourworldindata.org/data/vaccinations/vaccinations.csv"
! [ -d data ] || mkdir data
# -N use timestamps for conditional get, -nv non-verbose, -t retries
! cd data && wget -nv -N -t 5  $vaccine_url

Last-modified header missing -- time-stamps turned off.
2021-05-27 12:06:02 URL:http://covid.ourworldindata.org/data/vaccinations/vaccinations.csv [1363789] -> "vaccinations.csv" [1]


Read the data and describe it.  There is one line for each country and date, resulting in thousands of records.

In [9]:
# Read the data and describe it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Read the data and show basic info

vacc_all = pd.read_csv('data/vaccinations.csv', 
                     parse_dates=['date'])

# describe the data
print(f"Dataset has {len(vacc_all):,d} records")
print("")
vacc_all.info()

Dataset has 22,196 records

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22196 entries, 0 to 22195
Data columns (total 12 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   location                             22196 non-null  object        
 1   iso_code                             22196 non-null  object        
 2   date                                 22196 non-null  datetime64[ns]
 3   total_vaccinations                   13374 non-null  float64       
 4   people_vaccinated                    12615 non-null  float64       
 5   people_fully_vaccinated              9900 non-null   float64       
 6   daily_vaccinations_raw               11396 non-null  float64       
 7   daily_vaccinations                   21970 non-null  float64       
 8   total_vaccinations_per_hundred       13374 non-null  float64       
 9   people_vaccinated_per_hundred        12615 non-null  fl

Select data for one country, and use the date as index.   
For Thailand there is not much data before 1 April, so I also select a range using date.

In [13]:
country = 'Hong Kong'
vacc_country = vacc_all[vacc_all['location']==country]
vacc_country.index = vacc_country['date']

# Convert units to millions of people so y-axis is easier to read
for column in ['people_vaccinated', 'people_fully_vaccinated']:
    vacc_country[column] = vacc_country[column]/1000000

# use of string to select timestamps from Pandas User Guide (online)
# section titled "Selection by label"
vacc = vacc_country.loc['20210401':]

vacc.plot.line(x='date', 
               y=['people_vaccinated', 'people_fully_vaccinated'],
               ylabel='People Vaccinated', 
               title=f'Vaccinations in {country} vs time',
               grid=True)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


<AxesSubplot:title={'center':'Vaccinations in Hong Kong vs time'}, xlabel='date', ylabel='People Vaccinated'>

In [14]:
vacc['vacc_per_million'] = vacc['daily_vaccinations_per_million'].rolling(window=7,min_periods=1).mean()
plt.figure(figsize=[10,6])
plt.title("Daily vaccinations (bar) and per million people")
plt.bar(vacc['date'], vacc['daily_vaccinations'], color='gray')
plt.plot(vacc['vacc_per_million'], color='blue')
plt.grid(axis='y')
plt.tight_layout()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
