# Global new coronavirus progress analysis

**Last Update : 2020-01-30 21:30**

이 글은 전 세계적으로 퍼지고 있는 신종 코로나 바이러스의 진행 상황을 시각적으로 표현한 자료입니다. 'JHU CSSE'에서 지속적으로 업데이트하고 있는 데이터를 사용하였습니다. 이 분석은 개인적인 연구 목적으로 수행되었으며, 오류가 있을 수 있습니다. 가능한 빨리 모든 국가에서 바이러스가 완전히 사라지기를 진심으로 바랍니다. 감사합니다.

This article is a visual representation of the progress of the new coronavirus that is spreading worldwide. I used data that is constantly updated in 'JHU CSSE'. This analysis was conducted for personal research purposes and may be in error. I sincerely hope that the virus will disappear completely in all countries as soon as possible. Thank you.

- Data Source  
[Dash Board](https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6)  
[Novel Coronavirus (2019-nCoV) Cases, provided by JHU CSSE](https://docs.google.com/spreadsheets/d/1yZv9w9zRKwrGTaR-YzmAqMefw4wMlaXocejdxZaTs6w/htmlview?usp=sharing&sle=true#)


- Github Repo  
[novel_coronavirus](https://github.com/WooilJeong/novel_coronavirus)

In [3]:
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Load `Novel Coronavirus (2019-nCoV) Cases, provided by JHU CSSE` Dataset

In [16]:
scope = [
'https://spreadsheets.google.com/feeds',
'https://www.googleapis.com/auth/drive',
]

json_file_name = 'gspread-266617-7512230df225.json'
credentials = ServiceAccountCredentials.from_json_keyfile_name(json_file_name, scope)

gc = gspread.authorize(credentials)
# spreadsheet_url = 'https://docs.google.com/spreadsheets/d/1yZv9w9zRKwrGTaR-YzmAqMefw4wMlaXocejdxZaTs6w/htmlview?usp=sharing&sle=true#'
# new sheet
spreadsheet_url = 'https://drive.google.com/file/d/1ZJCEHKijIMVSJvH74C-LIUJb7BAHqj6b/view?ths=true'
doc = gc.open_by_url(spreadsheet_url)

NoValidUrlKeyFound: 

## Current Dataset Status

In [17]:
sheet_list = doc.worksheets()
sheet_nm = []
for i in sheet_list:
    sheet_nm.append(i.title)
print('sheets number :', len(sheet_list))

APIError: {'code': 400, 'message': 'This operation is not supported for this document', 'status': 'FAILED_PRECONDITION'}

## Convert to pandas DataFrame

In [6]:
df_list = []
for i in sheet_nm:
    
    data = doc.worksheet(i).get_all_values()
    globals()[i] = pd.DataFrame(data[1:], columns=data[0])
    
    df_list.append(globals()[i])

In [7]:
for df in df_list:
    print(set(df.columns))

{'New Google Sheet Link (support comments): https://drive.google.com/file/d/1ZJCEHKijIMVSJvH74C-LIUJb7BAHqj6b/view?usp=sharing'}
{'Recovered', 'Deaths', 'Last Update', 'Province/State', 'Country/Region', 'Confirmed'}
{'Recovered', 'Deaths', 'Last Update', 'Province/State', 'Country/Region', 'Confirmed'}
{'Recovered', 'Deaths', 'Last Update', 'Province/State', 'Country/Region', 'Confirmed'}
{'', 'Quick note: Starting from this tab, our map is updating (almost) in real time (China data - at least once per hour; non China data - several times per day). This table is planning to be updated twice a day. The discrepancy between the map and this sheet is expected. Sorry for any confusion and inconvenience.', 'Recovered', 'Deaths', 'Last Update', 'Province/State', 'Country/Region', 'Confirmed'}
{'Recovered', 'Deaths', 'Last Update', 'Province/State', 'Country/Region', 'Confirmed'}
{'Recovered', 'Deaths', 'Last Update', 'Province/State', 'Country/Region', 'Confirmed'}
{'Recovered', 'Deaths', 'L

## Only use datasets of the same type for analysis

- That is, data after 12 PM on January 23  
- Based on selecting only datasets that have the following columns in common.

```
'Province / State',
'Country / Region',
'Last Update',
'Confirmed',
'Deaths',
'Recovered'
```

## Exclude remaining datasets

In [8]:
rm_items = ['Jan22_12am','Jan22_12pm','Jan23_12pm']
for i in rm_items:
    sheet_nm.remove(i)

## Common column list

In [9]:
col_list=[
            'Province/State',
            'Country/Region',
            'Last Update',
            'Confirmed',
            'Deaths',
            'Recovered'
         ]

## Check Outliers

If the time of the sheet title and 'Last Update' does not match, the value is replaced with the time of the sheet title.

In [12]:
for i in sheet_nm:
    if len(globals()[i]['Last Update'].unique())>1:        
        print(i)
        print(">>>", globals()[i]['Last Update'].unique(),"\n")

In [None]:
Feb01_11pm['Last Update']='2/1/2020 23:00'
Feb01_6pm['Last Update']='2/1/2020 18:00'

Jan31_2pm['Last Update']='1/31/2020 14:00'
Jan28_11pm['Last Update']='1/28/2020 23:00'
Jan25_10pm['Last Update']='1/25/2020 10:00 PM'
Jan24_12pm['Last Update']='1/24/2020 12:00 PM'

In [None]:
for i in sheet_nm:
    if len(globals()[i]['Last Update'].unique())>1:        
        print(i)
        print(">>>", globals()[i]['Last Update'].unique(),"\n")

## Data integration

In [None]:
df = pd.DataFrame()
for i in sheet_nm:
    
    try:
        print('Complete :', i)
        globals()[i] = globals()[i][col_list]
        df = pd.concat([df, globals()[i]])
        
    except:
        print('Failed :', i)

df=pd.DataFrame(df,columns=col_list)
df.index = range(len(df))

## Pre-processing

### Check dates and times in different formats

In [None]:
set(df['Last Update'])

### Standardize the dates and times

In [None]:
import datetime

date_list=[]
for i in df['Last Update']:
    
    if 'AM' in i:
        
        a=datetime.datetime.strptime(i, "%m/%d/%Y %I:%M %p")
        b=datetime.datetime.strftime(a, "%Y-%m-%d %H:%M")
        
    elif 'PM' in i:
        
        a=datetime.datetime.strptime(i, "%m/%d/%Y %I:%M %p")
        b=datetime.datetime.strftime(a, "%Y-%m-%d %H:%M")        
    
    else:
        
        a=datetime.datetime.strptime(i, "%m/%d/%Y %H:%M")
        b=datetime.datetime.strftime(a, "%Y-%m-%d %H:%M")        
        
    date_list.append(b)

date_list

df['Last Update'] = date_list

In [None]:
# Replace spaces with zeros
df['Confirmed'] = df["Confirmed"].apply(lambda x: 0 if x=="" else x)
df['Deaths'] = df["Deaths"].apply(lambda x: 0 if x=="" else x)
df['Recovered'] = df["Recovered"].apply(lambda x: 0 if x=="" else x)
df['Province/State'] = df["Province/State"].apply(lambda x: 'None' if x=="" else x)

# Data type conversion
df['Last Update'] = pd.to_datetime(df['Last Update'])
df['Confirmed'] = pd.to_numeric(df['Confirmed'])
df['Deaths'] = pd.to_numeric(df['Deaths'])
df['Recovered'] = pd.to_numeric(df['Recovered'])

# Feature Engineering
df['D/C'] = (df['Deaths']/df['Confirmed'])*100
df['R/C'] = (df['Recovered']/df['Confirmed'])*100

df = df.fillna(0)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.isna().sum()

## Save Dataset

In [None]:
import os
try:
    os.mkdir('Data')
    print('Complete')
except:
    pass
    print('Failed')

In [None]:
df.to_csv('Data/Dataset.csv',index=False,encoding='utf-8')

### Create country-specific data sets

In [None]:
# # List of Country/Region
# country_list = list(set(df['Country/Region']))
# country_list

In [None]:
# for i in country_list:
#     globals()[i.replace(' ','_')] = df[df['Country/Region']==i]
#     globals()[i.replace(' ','_')] = globals()[i.replace(' ','_')].sort_values('Last Update', ascending=True)
#     globals()[i.replace(' ','_')].index = range(len(globals()[i.replace(' ','_')]))    
#     print(i.replace(' ','_'))

## Visualization

- plot_confirmed() : It shows the trend in the number of confirmed patients.  
- plot_deaths_recovered() : It shows the trend in the number of virus deaths and recovered patients.  
- plot_dc_rc() : It shows virus death rate and recovery rate.(%)
```
d/c means Deaths per Confirmed
r/c means Recovered per Confirmed
```

In [None]:
# class corona():
    
#     def __init__(self, data_nm):
        
#         # set dataset
#         self.data_nm = data_nm
#         self.dataset = globals()[self.data_nm].copy()
        
#         # set date index
#         self.dataset.index = self.dataset['Last Update']
#         self.dataset.index = self.dataset.index.astype("category")
#         self.objects = list(self.dataset.index)
#         self.y_pos = np.arange(len(self.objects))        
    
#     def plot_confirmed(self):

#         plt.figure(figsize=(15,5))
#         plt.title(self.data_nm+'(Confirmed)', size='25', weight='bold')
#         plt.plot(self.y_pos, self.dataset['Confirmed'], color='dodgerblue', linewidth=3, marker='o')
#         plt.xticks(self.y_pos, self.objects, rotation=45)
#         plt.xlabel('Reported Time', size='20')
#         plt.ylabel('Count', size='20')
#         plt.show()

#     def plot_deaths_recovered(self):
        
#         plt.figure(figsize=(15,5))
#         plt.title(self.data_nm+'(Deaths, Recovered)', size='25', weight='bold')
#         plt.plot(self.y_pos, self.dataset['Deaths'], color='tomato', label='Deaths', linewidth=3, marker='o')
#         plt.plot(self.y_pos, self.dataset['Recovered'], color='orange', label='Recovered', linewidth=3, marker='o')
#         plt.legend(loc='upper left')
#         plt.xticks(self.y_pos, self.objects, rotation=45)
#         plt.xlabel('Reported Time', size='20')
#         plt.ylabel('Count', size='20')
#         plt.show()
        
#     def plot_dc_rc(self):
#         plt.figure(figsize=(15,5))
#         plt.title(self.data_nm+'(D/C, R/C)', size='25', weight='bold')
#         plt.plot(self.y_pos, self.dataset['D/C'], color='tomato', label='D/C', linewidth=3, marker='o')
#         plt.plot(self.y_pos, self.dataset['R/C'], color='orange', label='R/C', linewidth=3, marker='o')
#         plt.ylim(0, 100)
#         plt.legend(loc='upper left')
#         plt.xticks(self.y_pos, self.objects, rotation=45)
#         plt.xlabel('Reported Time', size='20')
#         plt.ylabel('Indicator values(%)', size='20')
#         plt.show()
        
#     def plot_world_confirmed(self):
#         plt.figure(figsize=(20,10))
#         plt.title('World Confirmed Trend('+self.data_nm+')', size='25', weight='bold')
#         for i in self.dataset.columns[1:]:
#             plt.plot(self.y_pos, self.dataset[i], label=i, linewidth=3, marker='o')
#         plt.legend(loc='upper left')
#         plt.xticks(self.y_pos, self.objects, rotation=45)
#         plt.xlabel('Reported Time', size='20')
#         plt.ylabel('Confirmed Count', size='20')
#         plt.show()

## World Status Except Some Contries

Some countries with specific regional data will be discussed later.

In [None]:
# world_list=[
    
#  'South Korea',
#  'Japan',
#  'Thailand',
#  'Singapore',
#  'France',
#  'Finland',
#  'Vietnam',
#  'Malaysia',
#  'Cambodia',
#  'Germany',
#  'Ivory Coast',
#  'Sri Lanka',
#  'United Arab Emirates',
#  'Macau',
#  'Nepal',
#  'Taiwan',
#  'Hong Kong'

# # 'Mainland China',  
# # 'US'
# # 'Australia',  
# # 'Canada',
# ]

In [None]:
# for i in world_list:
#     c = corona(i.replace(' ','_'))
#     c.plot_confirmed()
#     c.plot_deaths_recovered()
#     c.plot_dc_rc()

## Mainland China

In [None]:
# china_list = list(set(Mainland_China['Province/State']))
# for i in china_list:
#     globals()[i.replace(' ','_')] = Mainland_China[Mainland_China['Province/State']==i]
#     globals()[i.replace(' ','_')] = globals()[i.replace(' ','_')].sort_values('Last Update', ascending=True)
#     globals()[i.replace(' ','_')].index = range(len(globals()[i.replace(' ','_')]))    

In [None]:
# for i in china_list:
#     c = corona(i.replace(' ','_'))
#     c.plot_confirmed()
#     c.plot_deaths_recovered()
#     c.plot_dc_rc()

## US

In [None]:
# us_list = list(set(US['Province/State']))
# for i in us_list:
#     globals()[i.replace(' ','_')] = US[US['Province/State']==i]
#     globals()[i.replace(' ','_')] = globals()[i.replace(' ','_')].sort_values('Last Update', ascending=True)
#     globals()[i.replace(' ','_')].index = range(len(globals()[i.replace(' ','_')]))    
#     print(i.replace(' ','_'))

In [None]:
# for i in us_list:
#     c = corona(i.replace(' ','_'))
#     c.plot_confirmed()
#     c.plot_deaths_recovered()
#     c.plot_dc_rc()

## Australia

In [None]:
# australia_list = list(set(Australia['Province/State']))
# for i in australia_list:
#     globals()[i.replace(' ','_')] = Australia[Australia['Province/State']==i]
#     globals()[i.replace(' ','_')] = globals()[i.replace(' ','_')].sort_values('Last Update', ascending=True)
#     globals()[i.replace(' ','_')].index = range(len(globals()[i.replace(' ','_')]))    
#     print(i.replace(' ','_'))

In [None]:
# for i in australia_list:
#     c = corona(i.replace(' ','_'))
#     c.plot_confirmed()
#     c.plot_deaths_recovered()
#     c.plot_dc_rc()

## Canada

- Canada has very inaccurate information.

In [None]:
# canada_list = list(set(Canada['Province/State']))
# for i in canada_list:
#     globals()[i.replace(' ','_')] = Canada[Canada['Province/State']==i]
#     globals()[i.replace(' ','_')] = globals()[i.replace(' ','_')].sort_values('Last Update', ascending=True)
#     globals()[i.replace(' ','_')].index = range(len(globals()[i.replace(' ','_')]))    
#     print(i.replace(' ','_'))

In [None]:
# for i in canada_list:
#     c = corona(i.replace(' ','_'))
#     c.plot_confirmed()
#     c.plot_deaths_recovered()
#     c.plot_dc_rc()

## Comparison of Number of Confirmation by Country

In [None]:
# time_list = list(set(df['Last Update']))
# time_list = pd.DataFrame(time_list, columns=['Last Update']).sort_values('Last Update')
# time_list.index = range(len(time_list))

# total_list=world_list+china_list+us_list+australia_list+canada_list

In [None]:
# for i in total_list:
#     i=i.replace(' ','_')
#     time_list=pd.merge(time_list, globals()[i][['Last Update', 'Confirmed']], how='left', on='Last Update')
# time_list.columns=['Last Update']+total_list
# time_list=time_list.fillna(0)
# time_list.head()

In [None]:
# df_china = time_list[['Last Update']+china_list]
# china_list2 = china_list.copy()
# china_list2.remove('Hubei')
# df_china2 = time_list[['Last Update']+china_list2]
# df_world = time_list[['Last Update']+world_list]
# df_us = time_list[['Last Update']+us_list]
# df_australia = time_list[['Last Update']+australia_list]
# df_canada = time_list[['Last Update']+canada_list]

## China

In [None]:
# c = corona('df_china')
# c.plot_world_confirmed()

## China without Hubei

In [None]:
# c = corona('df_china2')
# c.plot_world_confirmed()

## Most countries

In [None]:
# c = corona('df_world')
# c.plot_world_confirmed()

## US

In [None]:
# c = corona('df_us')
# c.plot_world_confirmed()

## Australia

In [None]:
# c = corona('df_australia')
# c.plot_world_confirmed()

## Canada

In [None]:
# c = corona('df_canada')
# c.plot_world_confirmed()