#### Numerical thought experiments on the COVID-19 data from Johns Hopkins University

As the COVID-19 virus (formerly known as '2019-nCoV) progresses, more and more data is becoming available. As seems to happen often, the reporting in the Uniform Legacy Media is being filtered through two or three layers of press releases and poorly-understood quotes, so I am doing some of my own data science exploration of the published data. These notes are a record of that exploration.

To start with, I have forked the [Johns Hopkins University repository](https://github.com/CSSEGISandData/COVID-19) of data from their Github, and put it in a public repository in my own github account (get a link), where I'm also going to keep these notes.

We start by importing Python packages for use below.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import collections
import seaborn as sbs
from math import log, log2, log10

`dcu_root` is the path to the root of the _Daily Case Updates_ provided by JHU.

In [2]:
dcu_root = "./daily_case_updates/"

Individual data sets in this repository are named by data and time M/D/Y and the update time in UTC. I'm giving the data sets individual names with ISO-8601 dates because it's way more convenient, especially because I don't need to think about the ambiguous data format.


In [3]:
dcu_20200212T1020Z = dcu_root + "02-12-2020_1020.csv"
dcu_20200213T2115Z = dcu_root + "02-13-2020_2115.csv"

Load the data in a Pandas data frame, creatively named `df`.

In [4]:
df = pd.read_csv(dcu_20200213T2115Z)

In [5]:
df.head()


Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Hubei,Mainland China,2020-02-14 00:13:23,51986,1426,4131
1,Guangdong,Mainland China,2020-02-14 01:23:02,1261,2,332
2,Henan,Mainland China,2020-02-14 01:13:05,1184,11,313
3,Zhejiang,Mainland China,2020-02-14 01:13:05,1155,0,367
4,Hunan,Mainland China,2020-02-14 01:23:02,988,2,352


In [6]:
mainland_china = df[df["Country/Region"] == "Mainland China"]

In [7]:
mainland_china.head()


Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Hubei,Mainland China,2020-02-14 00:13:23,51986,1426,4131
1,Guangdong,Mainland China,2020-02-14 01:23:02,1261,2,332
2,Henan,Mainland China,2020-02-14 01:13:05,1184,11,313
3,Zhejiang,Mainland China,2020-02-14 01:13:05,1155,0,367
4,Hunan,Mainland China,2020-02-14 01:23:02,988,2,352


In [8]:
everywhere_else = df[df["Country/Region"] != "Mainland China"]

In [9]:
everywhere_else.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
18,Diamond Princess cruise ship,Others,2020-02-14 00:13:23,218,0,0
30,,Singapore,2020-02-13 14:33:02,58,0,15
31,Hong Kong,Hong Kong,2020-02-13 14:53:02,53,1,1
32,,Thailand,2020-02-13 17:53:03,33,0,12
33,,Japan,2020-02-13 12:23:05,28,1,9


In [10]:
china_deaths = mainland_china["Deaths"]

In [11]:
sum(china_deaths)/sum(mainland_china.Confirmed)

0.023307905577920146

In [12]:
sum(everywhere_else.Deaths)/sum(everywhere_else.Confirmed)

0.0051635111876075735

In [13]:
len(everywhere_else.Confirmed)

43

In [14]:
sum(mainland_china.Confirmed)

63841

In [15]:
sum(df.Confirmed)

64422

So, now the question that has been on everyone's mind is whether the Chinese Government is understating the number of cases or the mortality. There's good reason to suspect the number of confirmed cases might be understated, for two reasons:
1. It appears that there are many cases of infection by COVID-19 that aren't being reported. There are several reasons for that:
    - a mild case of COVID-19 is pretty much indistinguishable from a common cold. (In fact, coronaviruses account for many of the cases of the "common cold" — "common cold" is a description of a syndrome that can be caused by many different viruses.)
    - If you have a "cold" and visit a doctor, and it's confirmed to be COVID-19, you may be confined to a makeshift quarantine hospital.
    - The Chinese government is encouraging people with only mild symptoms not to overload the already-stressed medical facilities.
2. Simulations based on cases _outside_ China (reported by [Gardner and others](https://systems.jhu.edu/research/public-health/ncov-model-2/)) and including the best observed estimates of $R_0$ suggest that the number of real infections is considerably larger — perhaps as much as 10 times larger — than the reported number of confirmed infections.

Now, Gardner's paper mentions several reasons why their estimate might be high or low, so let's look at all three cases.