<h1>Novel Coronavirus 2019-nCoV Data Preprocessing</h1>


# Introduction

This Notebook only performs data preprocessing on the dataset with daily updates of coronavirus information.

We will perform the following operations:
* Check missing data; perform missing data imputation, if needed;
* Check last update of the daily data;
* Check multiple country names; fix the multiple country names, where needed;
* Check multiple province/state;
* Check if a country/region appears as well as province/state; 
* Deep-dive in the case of US states, cities, counties and unidentified places.
* Export curated data.

## Load packages & data

In [1]:
import numpy as np
import pandas as pd
import datetime as dt
data_df = pd.read_csv("..//input//novel-corona-virus-2019-dataset//covid_19_data.csv")

## Glimpse the data

In [2]:
print(f"Data: rows: {data_df.shape[0]}, cols: {data_df.shape[1]}")
print(f"Data columns: {list(data_df.columns)}")

print(f"Days: {data_df.ObservationDate.nunique()} ({data_df.ObservationDate.min()} : {data_df.ObservationDate.max()})")
print(f"Country/Region: {data_df['Country/Region'].nunique()}")
print(f"Province/State: {data_df['Province/State'].nunique()}")
print(f"Confirmed all: {sum(data_df.groupby(['Province/State'])['Confirmed'].max())}")
print(f"Recovered all: {sum(data_df.loc[~data_df.Recovered.isna()].groupby(['Province/State'])['Recovered'].max())}")
print(f"Deaths all: {sum(data_df.loc[~data_df.Deaths.isna()].groupby(['Province/State'])['Deaths'].max())}")

print(f"Diagnosis: days since last update: {(dt.datetime.now() - dt.datetime.strptime(data_df.ObservationDate.max(), '%m/%d/%y')).days} ")

Data: rows: 4935, cols: 8
Data columns: ['SNo', 'ObservationDate', 'Province/State', 'Country/Region', 'Last Update', 'Confirmed', 'Deaths', 'Recovered']
Days: 50 (01/22/2020 : 03/11/20)
Country/Region: 128
Province/State: 251
Confirmed all: 87141
Recovered all: 62073
Deaths all: 3289
Diagnosis: days since last update: 3 


Comment: the dataset was not updates since few days ago (time to last run this Notebook).

In [3]:
data_df.head()

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1,0,0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14,0,0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6,0,0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1,0,0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0,0,0


In [4]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 8 columns):
SNo                4935 non-null int64
ObservationDate    4935 non-null object
Province/State     3120 non-null object
Country/Region     4935 non-null object
Last Update        4935 non-null object
Confirmed          4935 non-null int64
Deaths             4935 non-null int64
Recovered          4935 non-null int64
dtypes: int64(4), object(4)
memory usage: 308.6+ KB


There are no missing data other than `Province/Region` - which makes sense, since for some of the Countries/Regions there is only Country/Region level data available.

## Check multiple countries names

In [5]:
country_sorted = list(data_df['Country/Region'].unique())
country_sorted.sort()
print(country_sorted)

[' Azerbaijan', "('St. Martin',)", 'Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain', 'Bangladesh', 'Belarus', 'Belgium', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Cambodia', 'Cameroon', 'Canada', 'Channel Islands', 'Chile', 'Colombia', 'Congo (Kinshasa)', 'Costa Rica', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Dominican Republic', 'Ecuador', 'Egypt', 'Estonia', 'Faroe Islands', 'Finland', 'France', 'French Guiana', 'Georgia', 'Germany', 'Gibraltar', 'Greece', 'Holy See', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Ivory Coast', 'Jamaica', 'Japan', 'Jordan', 'Kuwait', 'Latvia', 'Lebanon', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macau', 'Mainland China', 'Malaysia', 'Maldives', 'Malta', 'Martinique', 'Mexico', 'Moldova', 'Monaco', 'Mongolia', 'Morocco', 'Nepal', 'Netherlands', 'Ne

<font color='red'>Comment</font>: we can observe that there are few countries with duplicate name, as following:

* ` Azerbaijan` & `Azerbaijan`;
* `Holly See` & `Vatican City`;
* `Ireland` & `Republic of Ireland`;
* `St. Martin` & `('St. Martin',)`.

For `UK` & `North Ireland` we will need a clarification, since theoretically `North Ireland` is a part of `UK`. 

## Fix duplicated countries names

In [6]:
data_df.loc[data_df['Country/Region']=='Holy See', 'Country/Region'] = 'Vatican City'
data_df.loc[data_df['Country/Region']==' Azerbaijan', 'Country/Region'] = 'Azerbaijan'
data_df.loc[data_df['Country/Region']=='Republic of Ireland', 'Country/Region'] = 'Ireland'
data_df.loc[data_df['Country/Region']=="('St. Martin',)", 'Country/Region'] = 'St. Martin'

## Check duplicate Province/State names

In [7]:
province_sorted = list(data_df.loc[~data_df['Province/State'].isna(), 'Province/State'].unique())
province_sorted.sort()
print(province_sorted)

[' Montreal, QC', ' Norfolk County, MA', 'Alameda County, CA', 'Alaska', 'Alberta', 'Anhui', 'Arizona', 'Arkansas', 'Ashland, NE', 'Bavaria', 'Beijing', 'Bennington County, VT', 'Bergen County, NJ', 'Berkeley, CA', 'Berkshire County, MA', 'Boston, MA', 'British Columbia', 'Broward County, FL', 'Calgary, Alberta', 'California', 'Carver County, MN', 'Channel Islands', 'Charleston County, SC', 'Charlotte County, FL', 'Chatham County, NC', 'Cherokee County, GA', 'Chicago', 'Chicago, IL', 'Chongqing', 'Clark County, NV', 'Clark County, WA', 'Cobb County, GA', 'Collin County, TX', 'Colorado', 'Connecticut', 'Contra Costa County, CA', 'Cook County, IL', 'Cruise Ship', 'Davidson County, TN', 'Davis County, UT', 'Delaware', 'Delaware County, PA', 'Denmark', 'Denver County, CO', 'Diamond Princess cruise ship', 'District of Columbia', 'Douglas County, CO', 'Douglas County, NE', 'Douglas County, OR', 'Edmonton, Alberta', 'El Paso County, CO', 'Fairfax County, VA', 'Fairfield County, CT', 'Faroe Is

<font color='red'>Comment</font>: we can observe that there are few provinces with duplicate name or, for US - data at both county level and at state level & China with both province and independent territories. Here we show just few examples:

* ' Norfolk County, MA' & 'Norfolk County, MA' - duplicate county name;
*  'Providence County, RI' &  'Providence, RI' - duplicate county name from US;
* 'France' - country name;
* 'Washington' & 'Washington D.C.' & 'District of Columbia' - duplicate state name?;
* 'Clark County, W' & 'Washington' (state)?
* 'New York', 'New York City, NY', 'New York County, NY - possible duplicate for NYC?  
* 'King County, WA', 'Kittitas County, WA but also Washington (state)?

There are multiple attributions of `None` or `Unassigned Location`: `Unassigned Location (From Diamond Princess)`, `Unassigned Location, VT`, `Unassigned Location, WA`, `Unknown Location, MA`.

There are multiple mentions of `from Diamond Princess`. Let's list them as well:


In [8]:
diamond_list = list(data_df.loc[data_df['Province/State'].str.contains("Diamond", na=False), 'Province/State'].unique())
diamond_list.sort()
print(diamond_list)

['Diamond Princess cruise ship', 'From Diamond Princess', 'Lackland, TX (From Diamond Princess)', 'Omaha, NE (From Diamond Princess)', 'Travis, CA (From Diamond Princess)', 'Unassigned Location (From Diamond Princess)']


## Check Country/Region & Province/State intersection


We check now if a territory is marked both as a Country/Region and as a Province/State.

In [9]:
province_ = list(data_df.loc[~data_df['Province/State'].isna(), 'Province/State'].unique())
country_ = list(data_df['Country/Region'].unique())

common_province_country = set(province_) & set(country_)
print(common_province_country)

{'Taiwan', 'Hong Kong', 'Faroe Islands', 'Denmark', 'Georgia', 'France', 'Channel Islands', 'Saint Barthelemy', 'UK', 'Gibraltar', 'Macau'}


Let's check now the name of the country when the province is in the common list of provinces and countries.

In [10]:
for province in list(common_province_country):
    country_list = list(data_df.loc[data_df['Province/State']==province, 'Country/Region'].unique())
    print(province, country_list)

Taiwan ['Taiwan']
Hong Kong ['Hong Kong', 'Mainland China']
Faroe Islands ['Denmark']
Denmark ['Denmark']
Georgia ['US']
France ['France']
Channel Islands ['UK']
Saint Barthelemy ['France']
UK ['UK']
Gibraltar ['UK']
Macau ['Macau', 'Mainland China']


The analysis of the provinces and countries list should be interpreted as folllowing:


* Macau, Hong Kong appears both as independent Countries and as part of Mainland China; this is not correct.
* France & Saint Barthelemy appears as provinces of France. This is not correct because Saint Barthelemy appears as well as an independent state. It must probably be fixed by replacing Saint Barthelemy as part of France, where appears as independent Country.
* UK, Gibraltar & Channel Islands appears both as countries and as part from UK. It should be corrected by setting Gibraltar * Channel Islands as part of UK where appears as independent state;
* Faroe Islands appears both as a state and as a part of Denmark. Should be corrected by setting only as a Province/State;
* Georgia is both a state in US and an independent country. This is not an error.

## Check US states & counties

In US we have both data at state level and at county level.   
This might mislead when building statistics since we do not know for example if the statistic for Washington (State) includes also the data from King County, WA (a county from Washington state where is also Seattle).  


Let's check first the list of counties in US.

In [11]:
counties_us = list(data_df.loc[(~data_df['Province/State'].isna()) & \
                               data_df['Province/State'].str.contains("County,", na=False) &\
                               (data_df['Country/Region']=='US'), 'Province/State'].unique())
counties_us.sort()
print(counties_us)

[' Norfolk County, MA', 'Alameda County, CA', 'Bennington County, VT', 'Bergen County, NJ', 'Berkshire County, MA', 'Broward County, FL', 'Carver County, MN', 'Charleston County, SC', 'Charlotte County, FL', 'Chatham County, NC', 'Cherokee County, GA', 'Clark County, NV', 'Clark County, WA', 'Cobb County, GA', 'Collin County, TX', 'Contra Costa County, CA', 'Cook County, IL', 'Davidson County, TN', 'Davis County, UT', 'Delaware County, PA', 'Denver County, CO', 'Douglas County, CO', 'Douglas County, NE', 'Douglas County, OR', 'El Paso County, CO', 'Fairfax County, VA', 'Fairfield County, CT', 'Fayette County, KY', 'Floyd County, GA', 'Fort Bend County, TX', 'Fresno County, CA', 'Fulton County, GA', 'Grafton County, NH', 'Grant County, WA', 'Harford County, MD', 'Harris County, TX', 'Harrison County, KY', 'Hendricks County, IN', 'Honolulu County, HI', 'Hudson County, NJ', 'Humboldt County, CA', 'Jackson County, OR ', 'Jefferson County, KY', 'Jefferson County, WA', 'Johnson County, IA', 

Let's check now also the list of locations that are not counties but are not states names.

In [12]:
cities_places_us = list(data_df.loc[(~data_df['Province/State'].isna()) & \
                               (~data_df['Province/State'].str.contains("County,", na=False)) &\
                               (data_df['Province/State'].str.contains(",", na=False)) &\
                               (data_df['Country/Region']=='US'), 'Province/State'].unique())
cities_places_us.sort()
print(cities_places_us)

['Ashland, NE', 'Berkeley, CA', 'Boston, MA', 'Chicago, IL', 'Hillsborough, FL', 'Jefferson Parish, LA', 'Lackland, TX', 'Lackland, TX (From Diamond Princess)', 'Los Angeles, CA', 'Madison, WI', 'New York City, NY', 'Omaha, NE (From Diamond Princess)', 'Orange, CA', 'Portland, OR', 'Providence, RI', 'San Antonio, TX', 'San Benito, CA', 'San Mateo, CA', 'Santa Clara, CA', 'Sarasota, FL', 'Seattle, WA', 'Tempe, AZ', 'Travis, CA', 'Travis, CA (From Diamond Princess)', 'Umatilla, OR', 'Unassigned Location, VT', 'Unassigned Location, WA', 'Unknown Location, MA', 'Washington, D.C.']


Few entries are not actual places, as following: `Lackland, TX (From Diamond Princess)`, `Omaha, NE (From Diamond Princess)` `Unassigned Location, VT`, `Unassigned Location, WA`, `Unknown Location, MA`.

Let's check now the states names.

In [13]:
states_us = list(data_df.loc[(~data_df['Province/State'].isna()) & \
                               (~data_df['Province/State'].str.contains("County,", na=False)) &\
                               (~data_df['Province/State'].str.contains(",", na=False)) &\
                               (data_df['Country/Region']=='US'), 'Province/State'].unique())
states_us.sort()
print(states_us)
print(len(states_us))

['Alaska', 'Arizona', 'Arkansas', 'California', 'Chicago', 'Colorado', 'Connecticut', 'Delaware', 'Diamond Princess cruise ship', 'District of Columbia', 'Florida', 'Georgia', 'Grand Princess', 'Grand Princess Cruise Ship', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Unassigned Location (From Diamond Princess)', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']
55


There are few items here that are not states: Chicago (city in Illinois), Grand Princess (Diamond Princess?), Unassigned Location (From Diamond Princess).

# Export the data

We will export the curated data.

In [14]:
data_df.to_csv("covid_19_data.csv", index=False)