# 冠狀病毒大流行

> 寫一些 Python 程式擷取摘要冠狀病毒大流行現況的四個關鍵數字，並且像[維基百科](https://en.wikipedia.org/wiki/COVID-19_pandemic)一般地描述。
>
> 標籤：程式設計，獲取載入，摘要探索

郭耀仁 <yaojenkuo@datainpoint.com>

In [1]:
# 載入專案需要使用的套件
import datetime
import pandas as pd

## TL; DR

我們定義了一個函式 `get_latest_daily_report()` 將約翰霍普金斯大學 [COVID-19 Data Repository](https://github.com/CSSEGISandData/COVID-19) 中最新的每日報告載入成為資料框，並從資料框中將四個關鍵數字摘要出來，再使用 f-Strings 以及 `format()` 方法將大流行全球現況以維基百科頁面的格式印出為一段敘述文字。

## 冠狀病毒大流行現況的四個關鍵數字

冠狀病毒大流行是由嚴重急性呼吸系統綜合症冠狀病毒引起，疫情最早於 2019 年 12 月在中國武漢發現，俗稱武漢肺炎。世界衛生組織於 2020 年 3 月 11 日宣布為全球的大型流行病。從[維基百科](https://en.wikipedia.org/wiki/COVID-19_pandemic)看到這個描述現狀的段落文字：

> As of 30 August 2020, more than **25 million** cases of COVID‑19 have been reported in more than **188** countries and territories, resulting in more than **843,000** deaths; more than **16.4 million** people have recovered.

這段文字中我們可以觀察到其中有四個和冠狀病毒大流行現況的關鍵數字：

- 全球總確診人數
- 有確診案例的國家數
- 全球總死亡人數
- 全球總痊癒人數

在這個專案中，我們打算寫一些 Python 程式擷取摘要冠狀病毒大流行現況的四個關鍵數字，並像維基百科一般地描述。

## 資料來源

資料來源是 [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19) 中的每日報告資料夾 `/csse_covid_19_data/csse_covid_19_daily_reports`，該資料夾從 2020-01-22 開始每天都有一個單獨的 CSV 檔案記錄該日的全球現況。

<img src="img/daily_report_folder.png">

## 載入最新的每日報告

CSV 檔案的命名是以 `%m-%d-%Y` Unix 時間格式、俗稱的 `mm-dd-yyyy` 格式作為檔案名稱，如果希望載入最新的每日報告，我們可以用電腦的當天日期作為檔名，但是由於資料源更新時間、時區的差異，使用當天日期很有可能沒有對應的檔案，因此我們可以寫一段程式，他的處理邏輯是：

1. 先以電腦的當天日期作為檔名是否可以載入成功
2. 如果成功這段程式的任務就完成了
3. 如果載入失敗產生錯誤訊息，就將當天日期減去 1，直到載入成功

這段程式需要 Python 的標準套件 `datetime`、第三方套件 `pandas`、`try...except...` 語法以及 `while` 語法。其中 `datetime` 可以協助我們獲得電腦的當天日期、進行日期的運算以及調整日期的文字格式。

In [2]:
latest_date = datetime.date.today()
latest_date_fmt = latest_date.strftime('%m-%d-%Y')
print(latest_date_fmt)

09-01-2020


第三方套件 `pandas` 可以協助我們將 CSV 檔案讀入成為方便摘要分析的 `DataFrame`，每日報告由於更新時間、時區差異的緣故，大概都會是昨天或者前天，因此如果貿然將當天日期作為檔名通常會獲得錯誤訊息。這時我們就可以利用 `try...except...` 將錯誤捕捉起來。

In [3]:
csv_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/{}.csv".format(latest_date_fmt)
try:
    daily_report = pd.read_csv(csv_url)
except:
    print("尚未有 {} 的每日報告。".format(latest_date_fmt))

尚未有 09-01-2020 的每日報告。


最後加入 `while` 語法，目的是只要錯誤被捕捉起來，就將當天日期減 1 成為再前一天日期，再嘗試一次載入，假若再有錯誤被捕捉，就持續減去 1 天，直到成功為止。

In [4]:
latest_date = datetime.date.today()
day_delta = datetime.timedelta(days=1)
fmt = '%m-%d-%Y'
while True:
    try:
        latest_date_fmt = latest_date.strftime(fmt)
        csv_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/{}.csv".format(latest_date_fmt)
        daily_report = pd.read_csv(csv_url)
        print("載入了 {} 的每日報告。".format(latest_date_fmt))
        break
    except:
        latest_date_fmt = latest_date.strftime(fmt)
        print("尚未有 {} 的每日報告。".format(latest_date_fmt))
        latest_date -= day_delta

尚未有 09-01-2020 的每日報告。
載入了 08-31-2020 的每日報告。


## 將載入最新的每日報告包裝成函式

將前面的程式包裝成函式，可以回傳最新每日報告以及檔案日期。

In [5]:
def get_latest_daily_report():
    """
    This function returns the latest global daily report from https://github.com/CSSEGISandData/COVID-19 and its file date.
    """
    latest_date = datetime.date.today()
    day_delta = datetime.timedelta(days=1)
    fmt = '%m-%d-%Y'
    while True:
        try:
            latest_date_fmt = latest_date.strftime(fmt)
            csv_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/{}.csv".format(latest_date_fmt)
            daily_report = pd.read_csv(csv_url)
            print("載入了 {} 的每日報告。".format(latest_date_fmt))
            break
        except:
            latest_date_fmt = latest_date.strftime(fmt)
            print("尚未有 {} 的每日報告。".format(latest_date_fmt))
            latest_date -= day_delta
    return latest_date, daily_report

In [6]:
latest_date, daily_report = get_latest_daily_report()

尚未有 09-01-2020 的每日報告。
載入了 08-31-2020 的每日報告。


In [7]:
daily_report.shape # 每日報告的外觀

(3954, 14)

In [8]:
daily_report.head() # 每日報告的前五列

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio
0,,,,Afghanistan,2020-09-01 04:28:31,33.93911,67.709953,38165,1402,29089,7674.0,Afghanistan,98.039112,3.673523
1,,,,Albania,2020-09-01 04:28:31,41.1533,20.1683,9513,284,5214,4015.0,Albania,330.565015,2.985388
2,,,,Algeria,2020-09-01 04:28:31,28.0339,1.6596,44494,1510,31244,11740.0,Algeria,101.46623,3.393716
3,,,,Andorra,2020-09-01 04:28:31,42.5063,1.5218,1176,53,908,215.0,Andorra,1522.034556,4.506803
4,,,,Angola,2020-09-01 04:28:31,-11.2027,17.8739,2654,108,1071,1475.0,Angola,8.075149,4.069329


## 摘要四個關鍵數字

成功載入最新的每日報告以後，首先可以對 `DataFrame` 中的 `Confirmed`、`Deaths` 與 `Recovered` 變數使用 `sum()` 方法獲取全球總確診人數、全球總死亡人數與全球總痊癒人數。

In [9]:
global_confirmed = daily_report['Confirmed'].sum()
global_deaths = daily_report['Deaths'].sum()
global_recovered = daily_report['Recovered'].sum()
print(global_confirmed)
print(global_deaths)
print(global_recovered)

25484767
850535
16819592


最後是對 `Country_Region` 變數使用 `nunique()` 方法獲取有確診案例的國家數。

In [10]:
n_countries_reported = daily_report['Country_Region'].nunique()
print(n_countries_reported)

188


## 像[維基百科](https://en.wikipedia.org/wiki/COVID-19_pandemic)一般地描述

最後我們可以運用將物件內容值嵌入文字的技巧，將最新的每日報告日期與四個關鍵數字放置到[維基百科](https://en.wikipedia.org/wiki/COVID-19_pandemic)的敘述之中，首先利用 `{}` 在文字敘述中空出五個位置。

In [11]:
"""As of {}, more than {} cases of COVID‑19 have been reported in more than {} countries and territories, resulting in more than {} deaths; more than {} people have recovered."""

'As of {}, more than {} cases of COVID‑19 have been reported in more than {} countries and territories, resulting in more than {} deaths; more than {} people have recovered.'

接著可以使用文字嵌入技巧在這四個 `{}` 分別對應放置已經算好的關鍵數字物件名稱，在 Python 3.6+ 主流作法有兩個：f-Strings 以及 `format()` 方法，首先是 f-strings 的寫法。

In [12]:
pandemic_status = f"""As of {latest_date}, more than {global_confirmed} cases of COVID‑19 have been reported in more than {n_countries_reported} countries and territories, resulting in more than {global_deaths} deaths; more than {global_recovered} people have recovered."""
print(pandemic_status)

As of 2020-08-31, more than 25484767 cases of COVID‑19 have been reported in more than 188 countries and territories, resulting in more than 850535 deaths; more than 16819592 people have recovered.


然後是 `format()` 的寫法。

In [13]:
pandemic_status = """As of {}, more than {} cases of COVID‑19 have been reported in more than {} countries and territories, resulting in more than {} deaths; more than {} people have recovered."""
print(pandemic_status.format(latest_date, global_confirmed, n_countries_reported, global_deaths, global_recovered))

As of 2020-08-31, more than 25484767 cases of COVID‑19 have been reported in more than 188 countries and territories, resulting in more than 850535 deaths; more than 16819592 people have recovered.


## 調整顯示的格式

不論是 f-Strings 或者 `format()` 方法都支援物件在文字中顯示格式的調整，亦即在不更動物件儲存內容的情況下改變印出的外觀。如此一來我們就可以將日期格調整成月份的名稱、在大的數字中加入千分位逗號。

In [14]:
pandemic_status = f"""As of {latest_date:%d %b %Y}, more than {global_confirmed:,} cases of COVID‑19 have been reported in more than {n_countries_reported} countries and territories, resulting in more than {global_deaths:,} deaths; more than {global_recovered:,} people have recovered."""
print(pandemic_status)

As of 31 Aug 2020, more than 25,484,767 cases of COVID‑19 have been reported in more than 188 countries and territories, resulting in more than 850,535 deaths; more than 16,819,592 people have recovered.


In [15]:
pandemic_status = """As of {:%d %b %Y}, more than {:,} cases of COVID‑19 have been reported in more than {} countries and territories, resulting in more than {:,} deaths; more than {:,} people have recovered."""
print(pandemic_status.format(latest_date, global_confirmed, n_countries_reported, global_deaths, global_recovered))

As of 31 Aug 2020, more than 25,484,767 cases of COVID‑19 have been reported in more than 188 countries and territories, resulting in more than 850,535 deaths; more than 16,819,592 people have recovered.
