NCOV 2019
===

Outbreak of Novel Corona Virus, (abbreviated as NCov), at late 2019 should be the important event which affects Data Science widely and deeply, from the basic data collecting and washing, analyzing, model fitting, to the cases estimation,  almost all the knowledge and techniques about data science are used to explore the what and when the epidemic flu could be under control.

Steps
---
In this talk, the main goals of the lecture include how to get the open data, how to so data-washing work, and making simple visualization of data.
1. [JHU Main Page](https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html?fbclid=IwAR2mWEw0X_B5jbR0Fm23t2TVJGzVqUY6ok98DzrGLMrMXCR_c5joZV5AdNU#/bda7594740fd40299423467b48e9ecf6), open this link, open [Google driver], and save the data [google sheet] to Google driver; data would be updated everyday.<br>
**Note**. 
  - After the Corona Virus being renamed as "COVID-19" by WHO, the updated JHU data could be retrieved from [JUH dashboard](https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6) and [github](https://github.com/CSSEGISandData/COVID-19).
  - Since 2020/02/17, some "NaN" data (time-series data) in JHU dataset has being modified by 0. 
- Data structure:
  - create sub-directory, named `t`; enter it and also create another sub-folder, named `tmp` within it.
      - download JUH data and put it above folder `t`, now the file structure is as follows:
        ```
        COVID-19-master/
           csse_covid_19_data/
           ...        
        t/
           NCov-1.ipynb
           ...
           tmp/
        ```   
  - ToDo.

In [None]:
# how many data here
!ls ../COVID-19-master/*/

In [None]:
# after 2020/02/19
!ls ../COVID-19-master/csse_covid_19_data/csse_covid_19_time_series

Data Manipulation Tool, Pandas
---
A straightforward definition is that time series data includes data points attached to sequential time stamps; the NCOV-19 data is the classical case and recorded by day. `Pandas` was created by `Wes Mckinney` to provide an efficient and flexible tool to work with data set, including time-series data.


Up to now, the data collected are full of uncertainty and noise; thus, we only make survey of data come from John Hopkins University, the most reliable set. The data had been clean to be much readable and friendly to process further. In the sub-folder, *time_series*, data had been also divided into three catgories, Confirmed, Deaths, and Recovered as they represent actually.

In [None]:
import pandas as pd

In [None]:
pd.__version__

In [None]:
# extract the confirmed data for instance:
csvfile="../COVID-19-master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv"

df=pd.read_csv(csvfile,index_col='Province/State')

CSV Format
---
The data is in plain txt format, and named just as what it looks like `Comma Separated Values` (also known as CSV):
<img src="../imgs/csv.png" width=90% />
1. the first line decribes what the data are: features are seperated by comma symbol;
- from the second line, each case was recorded line by line.

In [None]:
# the first data
df.tail()

In [None]:
# abstract of the Dataframe
df.info()

In [None]:
df.describe()

**{*Note*}, Before 2020/02/19** 
   
   In brief, the first three features are about location data, and the left are the number of occurences; obviously, the numbers should be in integer format and `NaN` means no occurence reported. 

**After** 
   
   The "NaN" data had been corrected. 

As usual, let us to clean the data to be reasonable:

In [None]:
df[df.index.isnull()]

ToDo's
---
1. Obviously, the part of cases occured at Diamond Princess is a big problem; all the cases belong to the Japanese occurences, the host of cruise, without doubt; but the effct after travelers back to their countries have to delist from Japan or not.
- 

In [None]:
# retrieve the dates from the dataframe
dates=list(df.columns[3:])

In [None]:
# some data is null, (with NaN):
df.iloc[:2,3:]

In [None]:
# there are some NaN in index, i.e. `Province/State`
df.tail(2)

## Data Washing
1. replace Nan column by 0 (no more require);
- reset data format of number of occurences, from float to integer (no more require);
- Replace `Province/state`-index of `Date`-index.

In [None]:
import numpy as np

In [None]:
#for i in range(len(df)):
df['PS']=np.where(df.index.isnull(),df['Country/Region'],df.index)
df.set_index('PS')
df

In [None]:
# Duplicate "NaN" - index
def index_fillna(df_,f='Country/Region'):
    """
    replace the NaN in Index by the value in f-fature
    """
    # duplicate DataFrame
    df1=df_.copy()
    # make a new feature
    df1['new']=np.where(df1.index.isnull(),df1['Country/Region'],df1.index)

    # reset index with non-NaN values
    df1.index=df1['new'].values
    # delete the new-feature
    df1=df1.drop('new', axis=1)
    df1.index.name='Province/State'
    return df1

In [None]:
# correct "NaN" index
df_s=index_fillna(df)
df_s

## Transpose the data

**Replace `Province/state`-index of `Date`-index**

Generally, time-series data uses index to represent datetime in Pandas Application. 

Thus, transpose the data by index of data (cities) to columns, and inversely columns (date time) to index.


In [None]:
# Cities
cities=list(df.index)

In [None]:
# delete un
df_f=df_s.transpose().iloc[3:,:]

In [None]:
df_f.index

Althought the index is in datetime format, but pandas loaded data by `object` format not datetime: correct, 

In [None]:
df_f.index = pd.to_datetime(df_f.index)

In [None]:
# rename the Index's name
df_f.index.name='Date'

In [None]:
df_f.index

Visualization
---
Python provides many versatile visualization packages, from static to animation, even interaction functionality is supported.   

In [None]:
# Make some display-style by matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [None]:
plt.figure(figsize=(12,6))
plt.title("NCOVID-19",size=16)
for r in ['Anhui','Beijing']:
   plt.plot(df_f[r],label=r)
plt.legend()
plt.xticks(rotation=45);

Simpler...
---

In [None]:
df_f[['Anhui','Beijing']].plot(figsize=[12,6])

Practices and Exercises
---
1. Get the last data of NCOP. 