## Thailand Covid-19 Data

This notebook shows Covid-19 case data from the Thailand Department of Desease Control. They changed the URL in 2021 (so much for compatibility).  If the download
fails, it could mean that they changed it again, so check their web site.

The README for this repository contains a description of the data and URLs for other data.  

### Download the Covid Dataset (As Needed) 

Run this cell only if you do not have up-to-date data or want a different dataset.  
The Thai data is updated once per day.
File is saved in a subdirectory named `data`.


In [3]:
# Use wget (standard Unix/Linux util) to download data file
data_url = "https://covid19.th-stat.com/json/covid19v2/getTimeline.json"
# Name of local file
data_file = data_url.split("/")[-1]
# -N use timestamps for conditional get, -nv non-verbose, -t retries
! cd data && wget -nv -N -t 5 $data_url
# use a sensible name (not "getTimeline.json")
! cd data && cp $data_file timeline.json

2021-06-28 12:21:05 URL:https://covid19.th-stat.com/json/covid19v2/getTimeline.json [84455/84455] -> "getTimeline.json" [1]


### Create DataFrame from downloaded dataset

This code creates a DataFrame and prints some info about it.

Since the current Covid data is most interesting, select a recent subset of the data. Set `ndays` to the number of most recent days to .

In [7]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates
import json
#filename = name of covid data file in JSON format
# data/timeline or timeline.json = large file of all data since 1/1/2020
# data/latest.json = smaller file for development
filename = "data/timeline.json"

# How many days of most recent data to display?
ndays = 91

# the useful Covid data is in the named element 'Data'
# so create a DataFrame from only that element.
with open(filename, 'r') as f:
    all_data = json.load(f)

covid = pd.DataFrame.from_records(all_data['Data'])

# keep only recent data
if 0 < ndays < len(covid):
    covid = covid[-ndays:]

# convert string date to Timestamp object
covid['Date'] = pd.to_datetime(covid['Date'])
# convert Timestamp to python date, save it as a new column
covid['date'] = covid['Date'].transform(pd.Timestamp.date)

# describe the data
print(f"Dataset has {len(covid):,d} records")
print(f"Start date  {covid['Date'].min():%F}")
print(f"End date    {covid['Date'].max():%F}")
print()
covid.info()

Dataset has 91 records
Start date  2021-03-22
End date    2021-06-27

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91 entries, 436 to 526
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Date             91 non-null     datetime64[ns]
 1   NewConfirmed     91 non-null     int64         
 2   NewRecovered     91 non-null     int64         
 3   NewHospitalized  91 non-null     int64         
 4   NewDeaths        91 non-null     int64         
 5   Confirmed        91 non-null     int64         
 6   Recovered        91 non-null     int64         
 7   Hospitalized     91 non-null     int64         
 8   Deaths           91 non-null     int64         
 9   date             91 non-null     object        
dtypes: datetime64[ns](1), int64(8), object(1)
memory usage: 7.2+ KB


In [8]:
covid.head(3)

Unnamed: 0,Date,NewConfirmed,NewRecovered,NewHospitalized,NewDeaths,Confirmed,Recovered,Hospitalized,Deaths,date
436,2021-03-22,73,65,7,1,27876,26663,1122,91,2021-03-22
437,2021-03-23,401,103,297,1,28277,26766,1419,92,2021-03-23
438,2021-03-24,69,107,-38,0,28346,26873,1381,92,2021-03-24


In [9]:
covid.tail(3)

Unnamed: 0,Date,NewConfirmed,NewRecovered,NewHospitalized,NewDeaths,Confirmed,Recovered,Hospitalized,Deaths,date
524,2021-06-25,3644,1751,1849,44,236291,193106,41366,1819,2021-06-25
525,2021-06-26,4161,3569,541,51,240452,196675,41907,1870,2021-06-26
526,2021-06-27,3995,2253,1700,42,244447,198928,43607,1912,2021-06-27


### Dates

The 'Date' column in the DataFrame (created by reading a file) is a string using the American mm/dd/yyyy format.  We convert it to a Pandas Timestamp using `pd.to_datetime()`.  Pandas correctly infers the date format (as shown in the head and tail output).
 
Optional named parameters to specify format are:
dayfirst=False (default) which is applicable to this data.
format="%m/%d/%y" (strftime string) does not seem to be needed here.

We then use the `Series.transform` method to transform TimeStamp objects
to Python `datetime.date` objects, which make more sense here.

**Date as Index Variable**    
For a time series its better to use the dates or timestamps as index variable (instead of an arbitrary integer).  See README for more explanation.  I didn't do that here.

### Line Plots of New Cases, Recovered Cases, and Hospitalization 

In [10]:
covid.plot.line(x='Date', 
                y=['NewConfirmed','NewHospitalized','NewRecovered'],
                ylabel="Daily Cases", title="Daily New Cases")

<AxesSubplot:title={'center':'Daily New Cases'}, xlabel='Date', ylabel='Daily Cases'>

## Daily Cases and Deaths in Separate Plots (subplot)

Show daily cases and deaths in separate plots, since their magnitudes differ greatly.

This works on my computer using jupyter core 4.7.0, jupyter-notebook 5.2.2, ipython 7.15.1, and matplotlib 3.3.3.
But Google Colab raises an error about illegal date values on x-axis.  

In [11]:
def ma(column_name: str, days=5):
    """Compute the moving average for data in a given column of covid data.
    
    Returns:
    A series containing the moving average.
    """
    return covid[column_name].rolling(window=days, min_periods=1).mean()

# tick frequency
xticks = matplotlib.dates.WeekdayLocator(interval=1)

fig, (plt1, plt2) = plt.subplots(nrows=2, sharex=True, figsize=[10,8])
# Remove axis_date() to fix error when run in Google Colab
#plt1.xaxis_date()
#covid['AvgConfirmed'] = ma('NewConfirmed')
#covid.plot.line(ax=plt1, y='AvgConfirmed', color='g')
covid.plot.bar(ax=plt1, x='date', y='NewConfirmed', legend=False, color='blue')
plt1.grid(True, axis='y')
plt1.set_title("Daily Confirmed Cases")

covid.plot.bar(ax=plt2, x='date', y='NewDeaths', legend=False, color='gray')
plt2.set_title("Daily Deaths")
plt2.xaxis.set_major_locator(xticks)
##plt2.xaxis.set_major_formatter(date_format)  # dates are wrong

### Moving Averages

Compute a 5-day simple moving average of new cases (NewConfirmed) and new deaths. To avoid "NA" values for the first few days, allow a smaller window to be used when there isn't enough data. The code uses the `Series.rolling()` method.

Originally I used a 7-day moving average, but it smoothes the rapidly changing trend too much.

Plot the moving average along with unsmoothed daily case data.


In [12]:
covid['MovingAverage'] = covid['NewConfirmed'].rolling(window=5, min_periods=1).mean()
use_pandas_plot = False

if use_pandas_plot:
    # This uses the Pandas interface to Matplotlib, but only one plot shows.
    ax = covid.plot(x='Date', y='NewConfirmed', ylabel="New Cases", 
                    title='Daily New Covid Cases', kind='bar', color='gray')
    covid.plot(x='Date', y='averageCases', kind='line', color='blue', ax=ax)
else:
    # Use matplotlib directly
    plt.figure(figsize=[10,6])   # [width,height] in inches?
    plt.title("Daily New Covid Cases")
    plt.bar(covid['Date'], covid['NewConfirmed'], color='gray')
    plt.plot(covid['Date'], covid['MovingAverage'], color='blue')
    plt.grid(True, axis='y')
    plt.tight_layout()   # don't add padding to ends of y-axis
    plt.show()


In [13]:
column = 'NewDeaths'
title = "Daily New Covid Deaths"
covid['MovingAverage'] = covid[column].rolling(window=5, min_periods=1).mean()

plt.figure(figsize=[10,6])   # [width,height] in hundreds of pixels
plt.title(title)
plt.bar(covid['Date'], covid[column], color='gray')
plt.plot(covid['Date'], covid['MovingAverage'], color='red')
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()

### Hospitalization using Multiple Y-Scales

Plot the total cases in hospital on the same graph with a plot of new hospitalizations.

What is not clear: *do hospital cases include the number of people in make-shift field hospitals?*  I kind of doubt it.


In [15]:
title = "Total in Hospital (left) and New Hospital Cases (right)"
ax = covid.plot(figsize=[10,6],
           x='Date', 
           y=['Hospitalized','NewHospitalized'], 
           ylabel='Total in Hospitals',
           color=["blue","green"],
           style=['-', '--'],    # poorly documented codes for line styles
           secondary_y=['NewHospitalized'],
           title=title
           )
# cludgy way of specifying label for right y-axis
ax.right_ax.set_ylabel("Net New Hospitalizations (net of discharged)");

This uses the Pandas interface to plot multiple series, but its cludgy.  
Using Matplotlib directly provides more control and a more consistent programming interface.

This StackOverflow post is a cleaner way to do it in Pandas:
<https://stackoverflow.com/questions/14178194/python-pandas-plotting-options-for-multiple-lines>

Even more cludgy: you can specify both line color and line style together
```python
style = ['b-', 'g--']
```
means the first series plot is solid blue, the second series in green dashed.