# Analysis of Novel Coronavirus 2019 

### Content
+ Introduction: Novel coronavirus 2019
+ Data description
+ Formulation of research question
+ Data preparation: cleaning and shaping

## 1. Introduction: COVID-19

2019-nCoV or COVID-19 (2019 Novel Coronavirus) is a virus identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. Early on, many of the patients in the outbreak in Wuhan, China reportedly had some link to a large seafood and animal market, suggesting animal-to-person spread. However, a growing number of patients reportedly have not had exposure to animal markets, indicating person-to-person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people.

There is an ongoing investigation to determine more about this outbreak. This is a rapidly evolving situation and information will be updated as it becomes available. The latest situation summary updates are available on CDC’s web page for COVID-19.

Source(https://www.cdc.gov/library/researchguides/2019NovelCoronavirus.html)

## 2. Data description

This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus. Please note that this is a time series data and so the number of cases on any given day is the cumulative number.

The data is available from 22 Jan, 2020.

The detailed description of the variables in this dataset is below:
+ Sno - Serial number
+ ObservationDate - Date of the observation in MM/DD/YYYY
+ Province/State - Province or state of the observation (Could be empty when missing)
+ Country/Region - Country of observation
+ Last Update - Time in UTC at which the row is updated for the given province or country. (Not standardised and so please clean before using it)
+ Confirmed - Cumulative number of confirmed cases till that date
+ Deaths - Cumulative number of of deaths till that date
+ Recovered - Cumulative number of recovered cases till that date

## 3. Formulation of research question

Research questions for the project are below:
1. Analyze the highest spread of coronavirus by period of time
2. Analyze the provinces and countries by the levels of infection
3. Analyze the ratio of the number of deaths to recovered cases
4. Analyze the daily change of confirmed cases by regions
5. Analyze the relation between variables in the dataset 

## 4. Data preparation: cleaning and shaping

### Exploring dataset

In [1]:
# import modules
import numpy as np
import pandas as pd

In [2]:
# import dataset
cov = pd.read_csv('covid_19_data.csv')

In [3]:
# number of rows, columns, data types, memory usage information
cov.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116805 entries, 0 to 116804
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   SNo              116805 non-null  int64  
 1   ObservationDate  116805 non-null  object 
 2   Province/State   81452 non-null   object 
 3   Country/Region   116805 non-null  object 
 4   Last Update      116805 non-null  object 
 5   Confirmed        116805 non-null  float64
 6   Deaths           116805 non-null  float64
 7   Recovered        116805 non-null  float64
dtypes: float64(3), int64(1), object(4)
memory usage: 7.1+ MB


So we obtain the following information:
1. There are 116805 records and 8 columns
2. There are missing values in Province/State column

In [4]:
# dropping
cov = cov.dropna()

In [5]:
# show dataset
cov

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...
116800,116801,09/23/2020,Zaporizhia Oblast,Ukraine,2020-09-24 04:23:38,3149.0,49.0,1158.0
116801,116802,09/23/2020,Zeeland,Netherlands,2020-09-24 04:23:38,1270.0,72.0,0.0
116802,116803,09/23/2020,Zhejiang,Mainland China,2020-09-24 04:23:38,1282.0,1.0,1272.0
116803,116804,09/23/2020,Zhytomyr Oblast,Ukraine,2020-09-24 04:23:38,5191.0,92.0,2853.0


In [6]:
# describe numerical columns
cov.describe()

Unnamed: 0,SNo,Confirmed,Deaths,Recovered
count,81452.0,81452.0,81452.0,81452.0
mean,63020.428056,19503.8,802.709105,11143.04
std,32304.171786,60445.57,2780.366795,70524.53
min,1.0,0.0,0.0,0.0
25%,37698.75,315.0,3.0,0.0
50%,64318.5,2622.0,52.0,349.0
75%,90562.25,10649.5,406.0,3655.25
max,116805.0,1242770.0,42072.0,2670256.0


From the description above we can obtain:
- the total count of confirmed and recovered cases, and deaths;
- mean value for each of the columns;
- min and max values.

### Data cleaning and shaping

In [14]:
# check if there are still any NaN values
df=pd.DataFrame(cov)
print(df[df.isnull().any(axis=1)])
print(df.isnull().any(axis=1).sum())

Empty DataFrame
Columns: [SNo, ObservationDate, Province/State, Country/Region, Last Update, Confirmed, Deaths, Recovered]
Index: []
0


We checked for missing values and found the sum of total number of them. From this we can obtain that there are no NaN values in the dataset 

In [15]:
# check if there any duplicates
duplicateRows = df[df.duplicated()]
print(duplicateRows)
print(duplicateRows.sum())

Empty DataFrame
Columns: [SNo, ObservationDate, Province/State, Country/Region, Last Update, Confirmed, Deaths, Recovered]
Index: []
SNo                0.0
ObservationDate    0.0
Province/State     0.0
Country/Region     0.0
Last Update        0.0
Confirmed          0.0
Deaths             0.0
Recovered          0.0
dtype: float64


Duplicates are not found in the dataset

In [10]:
# number of confirmed,deaths,recovered cases have to be a whole number
df['Confirmed'] = df['Confirmed'].astype(int)
df['Deaths'] = df['Deaths'].astype(int)
df['Recovered'] = df['Recovered'].astype(int)
df

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1,0,0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14,0,0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6,0,0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1,0,0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0,0,0
...,...,...,...,...,...,...,...,...
116800,116801,09/23/2020,Zaporizhia Oblast,Ukraine,2020-09-24 04:23:38,3149,49,1158
116801,116802,09/23/2020,Zeeland,Netherlands,2020-09-24 04:23:38,1270,72,0
116802,116803,09/23/2020,Zhejiang,Mainland China,2020-09-24 04:23:38,1282,1,1272
116803,116804,09/23/2020,Zhytomyr Oblast,Ukraine,2020-09-24 04:23:38,5191,92,2853
