<a id="top"/>


### Come back to [Home](FinalProjectReport.ipynb)


# COVID-19 DATA PROCESSING

>The content of this section is:
>1. [Choosing and Obtaining data](#obtain) 
>1. [Data processing](#proc) 
>1. [Dataset's stats](#info)

>In this section we perform the initial preprocessing of the data to a shape which is best for use in further analysis. We will show:
* where we obtain our data
* convert the data to a form that it will be easier to use

>Here we import the modules that we will need in order to extract the data to manipulate them 

In [1]:
#!pip install geopy
import pandas as pd
from geopy.geocoders import Nominatim
import numpy as np
import pickle
import json
import csv
from collections import Counter, defaultdict, deque
from datetime import datetime, timedelta
import urllib
geolocator = Nominatim(user_agent="covid19")

<a id="obtain"/>

# COVID-19 Data

### Choosing the dataset and obtaining it
> We included COVID-19 data in order to have a complete and up to date overview of the disease while the major economic events of the 2020 financial crisis evolved. The purpose is to allow users to correlate the spread of the virus with the financial disparity and drive their own conclusions regarding market sentiment.

> We obtained the data directly from: European Centre for Disease Prevention and Control (An agency of the European Union) [link](https://www.ecdc.europa.eu/en). The downloadable data file is updated daily and contains the latest available public data on COVID-19. Each row/entry contains the number of new cases reported per day and per country. [**Folder**](https://opendata.ecdc.europa.eu/covid19/casedistribution/)

In [2]:
with urllib.request.urlopen("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv") as url:
    europe_data = pd.read_csv(url)

In [3]:
europe_data.head(1)

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018,continentExp
0,14/05/2020,14,5,2020,259,3,Afghanistan,AF,AFG,37172386.0,Asia


### Get latitude and longitude of the countries

In [4]:
def get_lat_lon_countries(countries: list = None) -> dict:
    try:
        with open('countries_lat_lon.pickle', 'rb') as handle:
            d = pickle.load(handle)
    except FileNotFoundError:
        d = dict()
        problems = []
        for c in countries:
            try:
                print(c)
                loc = geolocator.geocode(c.replace("_", " "))
                d[c] = (loc.latitude, loc.longitude)   
            except:
                pass
        with open('countries_lat_lon.pickle', 'wb') as handle:
            pickle.dump(d, handle, protocol=pickle.HIGHEST_PROTOCOL)
    return d

In [5]:
COUNTRIES = sorted(europe_data.countriesAndTerritories.unique())

In [6]:
loc = geolocator.geocode("French_Polynesia".replace("_", " "))
"French_Polynesia coord:", (loc.latitude, loc.longitude) 

('French_Polynesia coord:', (-16.03442485, -146.0490931059517))

In [7]:
countries_lat_lon = get_lat_lon_countries(COUNTRIES)
COUNTRIES = sorted(countries_lat_lon.keys())
# We remove the countries of which we did not retrieve lat and lon
europe_data = europe_data[europe_data.countriesAndTerritories.isin(COUNTRIES)]

In [8]:
np.uint64(500)

500

<a id="proc"/>

# Data Processing

In [9]:
df = europe_data.rename(
    columns={ 
        "dateRep": "date",
        "countriesAndTerritories": "country", 
        "geoId":"countryCode", 
        "cases":"confirmed",
        "popData2018": "pop",
        "continentExp": "continent"
    }
)
df = df[["date","country","countryCode", "confirmed", "deaths", "pop", "continent"]]
df = df.dropna()

df_ori = df.copy()

Adding information about latitude and longitude:

In [10]:
get_lat = lambda country: countries_lat_lon[country][0]
get_lon = lambda country: countries_lat_lon[country][1]
df["lat"] = df.country.apply(get_lat)
df["lon"] = df.country.apply(get_lon)

Setting suitable types:

In [11]:
df["date"] = pd.to_datetime(df["date"], format="%d/%m/%Y")
df["pop"] = abs(df["pop"]).astype('int64')
df["confirmed"] = abs(df["confirmed"]).astype('int64')
df["deaths"] = abs(df["deaths"]).astype('int64')
df["lat"] = df["lat"].astype(str)
df["lon"] = df["lon"].astype(str)
df = df.sort_values(by=["country", "date"]).reset_index(drop=True)
df.dtypes

date           datetime64[ns]
country                object
countryCode            object
confirmed               int64
deaths                  int64
pop                     int64
continent              object
lat                    object
lon                    object
dtype: object

Changing country codes for greece and united kindom:

In [12]:
df.loc[df['countryCode'] == 'EL', 'countryCode'] = 'GR'
df.loc[df['countryCode'] == 'UK', 'countryCode'] = 'GB'

In [13]:
COUNTRIES = sorted(df.country.unique())
print("Used countries (%d):" % len(COUNTRIES), ", ".join(COUNTRIES))

Used countries (200): Afghanistan, Albania, Algeria, Andorra, Angola, Antigua_and_Barbuda, Argentina, Armenia, Aruba, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Benin, Bermuda, Bhutan, Bolivia, Bosnia_and_Herzegovina, Botswana, Brazil, British_Virgin_Islands, Brunei_Darussalam, Bulgaria, Burkina_Faso, Burundi, Cambodia, Cameroon, Canada, Cape_Verde, Cayman_Islands, Central_African_Republic, Chad, Chile, China, Colombia, Congo, Costa_Rica, Cote_dIvoire, Croatia, Cuba, Curaçao, Cyprus, Czechia, Democratic_Republic_of_the_Congo, Denmark, Djibouti, Dominica, Dominican_Republic, Ecuador, Egypt, El_Salvador, Equatorial_Guinea, Estonia, Eswatini, Ethiopia, Faroe_Islands, Fiji, Finland, France, French_Polynesia, Gabon, Gambia, Georgia, Germany, Ghana, Gibraltar, Greece, Greenland, Grenada, Guam, Guatemala, Guernsey, Guinea, Guinea_Bissau, Guyana, Haiti, Holy_See, Honduras, Hungary, Iceland, India, Indonesia, Iran, Iraq, Ireland, Isle_of_Ma

In [14]:
df.tail(2)

Unnamed: 0,date,country,countryCode,confirmed,deaths,pop,continent,lat,lon
16589,2020-05-13,Zimbabwe,ZW,0,0,14439018,Africa,-18.4554963,29.7468414
16590,2020-05-14,Zimbabwe,ZW,0,0,14439018,Africa,-18.4554963,29.7468414


### Filling missing dates

We get the the most updated date common between all countries

In [15]:
MAX_DATE = min(df.groupby("country").max()['date'])
print("Most updated date: %s" % MAX_DATE.date())

Most updated date: 2020-05-13


In [16]:
df = df[df.date <= MAX_DATE]

In [17]:
def get_all_dates_list(min_date, max_date):
    temp = min_date
    dates = []
    while temp <= max_date:
        dates.append(temp)
        temp += timedelta(days=1)
    return dates

The following function returns the dataset related to ONE country with all the dates inside `$range_dates`. If the dates are missed in the "original" one then the related rows will be filled with the policy:
 - `country, countryCode, pop, continent, lat, lon` get the values from the first non-null row (of the original)
 - `confirmed` and `deaths` are set to `0`.

In [18]:
def get_country_with_all_dates(country, df, range_dates):
    temp_df = df[df.country == country]
    temp_df = temp_df.set_index("date").reindex(range_dates)
    record = temp_df.dropna().to_dict('records')[0] # get first record in which there are no null values
    record["confirmed"] = 0
    record["deaths"] = 0
    temp_df = temp_df.fillna(value=record)
    return temp_df

In [19]:
get_country_with_all_dates("Northern_Mariana_Islands", df, get_all_dates_list(min(df.date), MAX_DATE))

Unnamed: 0_level_0,country,countryCode,confirmed,deaths,pop,continent,lat,lon
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2019-12-31,Northern_Mariana_Islands,MP,0.0,0.0,56882.0,Oceania,14.149020499999999,145.21345248318923
2020-01-01,Northern_Mariana_Islands,MP,0.0,0.0,56882.0,Oceania,14.149020499999999,145.21345248318923
2020-01-02,Northern_Mariana_Islands,MP,0.0,0.0,56882.0,Oceania,14.149020499999999,145.21345248318923
2020-01-03,Northern_Mariana_Islands,MP,0.0,0.0,56882.0,Oceania,14.149020499999999,145.21345248318923
2020-01-04,Northern_Mariana_Islands,MP,0.0,0.0,56882.0,Oceania,14.149020499999999,145.21345248318923
...,...,...,...,...,...,...,...,...
2020-05-09,Northern_Mariana_Islands,MP,0.0,0.0,56882.0,Oceania,14.149020499999999,145.21345248318923
2020-05-10,Northern_Mariana_Islands,MP,1.0,0.0,56882.0,Oceania,14.149020499999999,145.21345248318923
2020-05-11,Northern_Mariana_Islands,MP,0.0,0.0,56882.0,Oceania,14.149020499999999,145.21345248318923
2020-05-12,Northern_Mariana_Islands,MP,3.0,0.0,56882.0,Oceania,14.149020499999999,145.21345248318923


In [20]:
def get_all_countries_with_all_dates(dataframe_to_normalize):
    df = dataframe_to_normalize.copy()
    countries = sorted(df.country.unique())
    range_dates = get_all_dates_list(min(df.date), max(df.date))
    new_df = get_country_with_all_dates(countries[0], df, range_dates)
    for country in countries[1:]:
        new_df = new_df.append(get_country_with_all_dates(country, df, range_dates))
    new_df = new_df.reset_index()
    new_df = new_df.rename(columns={"index": "date"})
    return new_df     

In [21]:
print("This is the shape BEFORE filling the new date range:", df.shape)

This is the shape BEFORE filling the new date range: (16392, 9)


In [22]:
df = get_all_countries_with_all_dates(df)

In [23]:
print("This is the shape AFTER filling the new date range:", df.shape)

This is the shape AFTER filling the new date range: (27000, 9)


### Creation of the cumulative and normalized columns
Firstly we create the cumulative columns for confirmed case and deaths, respectively, confirmed_cum, deaths_cum. 

The cumulative rows'value is calculted as: given a value `V2` in a date `D2`, the cumulative value `C2` in `D2 = C1 + V2` (where `C1` is the cumulative value in the previous date `D1`). In other words, `C2` represent the sum of all the values `V` till the date `D2` (included).

In [24]:
df[["confirmed_cum", "deaths_cum"]] = df.sort_values(by="date").groupby('country')[["confirmed", "deaths"]].cumsum()

Secondly, we create the normalized columns for `"confirmed_cum", "deaths_cum", "confirmed", "deaths"`. The column is normalized based on the country population and then multiplied by 1M.
For example: `"confirmed_norm"` expresses the `Total confirmed cases / 1M population`

In [25]:
for col in ["confirmed_cum", "deaths_cum", "confirmed", "deaths"]:
    df[col+"_norm"] = df[col]/(df["pop"]/1000000)

In [26]:
df.tail()

Unnamed: 0,date,country,countryCode,confirmed,deaths,pop,continent,lat,lon,confirmed_cum,deaths_cum,confirmed_cum_norm,deaths_cum_norm,confirmed_norm,deaths_norm
26995,2020-05-09,Zimbabwe,ZW,1.0,0.0,14439018.0,Africa,-18.4554963,29.7468414,35.0,4.0,2.423988,0.277027,0.069257,0.0
26996,2020-05-10,Zimbabwe,ZW,1.0,0.0,14439018.0,Africa,-18.4554963,29.7468414,36.0,4.0,2.493244,0.277027,0.069257,0.0
26997,2020-05-11,Zimbabwe,ZW,0.0,0.0,14439018.0,Africa,-18.4554963,29.7468414,36.0,4.0,2.493244,0.277027,0.0,0.0
26998,2020-05-12,Zimbabwe,ZW,1.0,0.0,14439018.0,Africa,-18.4554963,29.7468414,37.0,4.0,2.562501,0.277027,0.069257,0.0
26999,2020-05-13,Zimbabwe,ZW,0.0,0.0,14439018.0,Africa,-18.4554963,29.7468414,37.0,4.0,2.562501,0.277027,0.0,0.0


### Merge the dataset with the GDP one

In [27]:
df_gdp = pd.read_csv("gdp_csv_processed.csv", index_col=0)
df_gdp.head(2)

Unnamed: 0,countryCode3,gdp_year,gdp,countryCode
0,MCO,2018,185741.279992,MC
1,LIE,2016,165028.245029,LI


In [28]:
temp = pd.merge(df, df_gdp.reset_index().set_index('countryCode'), how='left', left_on='countryCode', right_index=True)

In [29]:
print("These are the countries without GDP:", ", ".join(temp[temp.isnull().any(axis=1)].country.unique()))

These are the countries without GDP: British_Virgin_Islands, Gibraltar, Guernsey, Holy_See, Jersey, Montserrat, Sint_Maarten, Taiwan


Since since they are really small states they should not influence the analysis we want to conduct. We will fill these data with -1 (since it is out of their natural domain).

In [30]:
df = temp.fillna(-1)
df

Unnamed: 0,date,country,countryCode,confirmed,deaths,pop,continent,lat,lon,confirmed_cum,deaths_cum,confirmed_cum_norm,deaths_cum_norm,confirmed_norm,deaths_norm,index,countryCode3,gdp_year,gdp
0,2019-12-31,Afghanistan,AF,0.0,0.0,37172386.0,Asia,33.7680065,66.2385139,0.0,0.0,0.000000,0.000000,0.000000,0.0,204.0,AFG,2018.0,520.896603
1,2020-01-01,Afghanistan,AF,0.0,0.0,37172386.0,Asia,33.7680065,66.2385139,0.0,0.0,0.000000,0.000000,0.000000,0.0,204.0,AFG,2018.0,520.896603
2,2020-01-02,Afghanistan,AF,0.0,0.0,37172386.0,Asia,33.7680065,66.2385139,0.0,0.0,0.000000,0.000000,0.000000,0.0,204.0,AFG,2018.0,520.896603
3,2020-01-03,Afghanistan,AF,0.0,0.0,37172386.0,Asia,33.7680065,66.2385139,0.0,0.0,0.000000,0.000000,0.000000,0.0,204.0,AFG,2018.0,520.896603
4,2020-01-04,Afghanistan,AF,0.0,0.0,37172386.0,Asia,33.7680065,66.2385139,0.0,0.0,0.000000,0.000000,0.000000,0.0,204.0,AFG,2018.0,520.896603
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26995,2020-05-09,Zimbabwe,ZW,1.0,0.0,14439018.0,Africa,-18.4554963,29.7468414,35.0,4.0,2.423988,0.277027,0.069257,0.0,158.0,ZWE,2018.0,2146.996385
26996,2020-05-10,Zimbabwe,ZW,1.0,0.0,14439018.0,Africa,-18.4554963,29.7468414,36.0,4.0,2.493244,0.277027,0.069257,0.0,158.0,ZWE,2018.0,2146.996385
26997,2020-05-11,Zimbabwe,ZW,0.0,0.0,14439018.0,Africa,-18.4554963,29.7468414,36.0,4.0,2.493244,0.277027,0.000000,0.0,158.0,ZWE,2018.0,2146.996385
26998,2020-05-12,Zimbabwe,ZW,1.0,0.0,14439018.0,Africa,-18.4554963,29.7468414,37.0,4.0,2.562501,0.277027,0.069257,0.0,158.0,ZWE,2018.0,2146.996385


<a id="info"/>

# Dataset's stats


In [31]:
get_number_unique_values = lambda df, col: len(df[col].unique())
def print_info(df, main_feature):
    print("Number of %s:" % main_feature, get_number_unique_values(df, main_feature))
    print("\nNumber of days:", get_number_unique_values(df, "date"))
    print("\nFrom:", min(df.date), "to", max(df.date))
    features = df.columns.to_list()
    print("\nWe have %d features:" % len(features), features)
    print("\nThe total number of (rows, cols) is:", df.shape)
    print("\nIn memory occupies: ~%d MB\n" % (df.memory_usage(index=True).sum() / (2**16)))
    print(df.head(1))

### Original Dataset

In [32]:
print_info(df_ori, "country")

Number of country: 200

Number of days: 136

From: 01/01/2020 to 31/12/2019

We have 7 features: ['date', 'country', 'countryCode', 'confirmed', 'deaths', 'pop', 'continent']

The total number of (rows, cols) is: (16591, 7)

In memory occupies: ~16 MB

         date      country countryCode  confirmed  deaths         pop  \
0  14/05/2020  Afghanistan          AF        259       3  37172386.0   

  continent  
0      Asia  


### Processed Dataset

In [33]:
print_info(df, "country")

Number of country: 200

Number of days: 135

From: 2019-12-31 00:00:00 to 2020-05-13 00:00:00

We have 19 features: ['date', 'country', 'countryCode', 'confirmed', 'deaths', 'pop', 'continent', 'lat', 'lon', 'confirmed_cum', 'deaths_cum', 'confirmed_cum_norm', 'deaths_cum_norm', 'confirmed_norm', 'deaths_norm', 'index', 'countryCode3', 'gdp_year', 'gdp']

The total number of (rows, cols) is: (27000, 19)

In memory occupies: ~62 MB

        date      country countryCode  confirmed  deaths         pop  \
0 2019-12-31  Afghanistan          AF        0.0     0.0  37172386.0   

  continent         lat         lon  confirmed_cum  deaths_cum  \
0      Asia  33.7680065  66.2385139            0.0         0.0   

   confirmed_cum_norm  deaths_cum_norm  confirmed_norm  deaths_norm  index  \
0                 0.0              0.0             0.0          0.0  204.0   

  countryCode3  gdp_year         gdp  
0          AFG    2018.0  520.896603  


In [34]:
df.describe()

Unnamed: 0,confirmed,deaths,pop,confirmed_cum,deaths_cum,confirmed_cum_norm,deaths_cum_norm,confirmed_norm,deaths_norm,index,gdp_year,gdp
count,27000.0,27000.0,27000.0,27000.0,27000.0,27000.0,27000.0,27000.0,27000.0,27000.0,27000.0,27000.0
mean,156.745926,10.802852,37678080.0,4390.272,290.69937,249.856462,11.149589,7.872003,0.376141,98.98,1936.885,17816.659888
std,1308.312539,98.98053,141686700.0,38938.83,2688.046641,999.324263,66.304672,45.938483,2.962948,63.547928,395.581569,26574.353602
min,0.0,0.0,1000.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0
25%,0.0,0.0,1298724.0,0.0,0.0,0.0,0.0,0.0,0.0,43.75,2018.0,2023.631192
50%,0.0,0.0,7355830.0,1.0,0.0,0.035603,0.0,0.0,0.0,96.5,2018.0,6804.513274
75%,6.0,0.0,26718740.0,152.0,3.0,44.346577,0.402619,0.876084,0.0,154.25,2018.0,23260.593459
max,48529.0,4928.0,1392730000.0,1369964.0,82387.0,19416.900992,1213.556312,4000.0,177.593607,210.0,2018.0,185741.279992


In [35]:
df.head(1)

Unnamed: 0,date,country,countryCode,confirmed,deaths,pop,continent,lat,lon,confirmed_cum,deaths_cum,confirmed_cum_norm,deaths_cum_norm,confirmed_norm,deaths_norm,index,countryCode3,gdp_year,gdp
0,2019-12-31,Afghanistan,AF,0.0,0.0,37172386.0,Asia,33.7680065,66.2385139,0.0,0.0,0.0,0.0,0.0,0.0,204.0,AFG,2018.0,520.896603


In [36]:
df.to_json("datacovid.json",orient='index')
df.to_csv("datacovid.csv", index=False)

In [37]:
get_unix_timestamp_str = lambda x: str(int(x.timestamp()*1000))
get_unix_timestamp_str(datetime.today())

'1589488488466'

In [38]:
temp_df = df.set_index(["country", "date"]).to_dict(orient='index')
dict_country_date = {country: {str(date.date()): row} for (country, date), row in temp_df.items()}

In [39]:
temp_df = df.copy()
temp_df["countryCodeGB"] = temp_df["countryCode"].apply(lambda x: x if x != 'GB' else 'UK')
temp_df = temp_df.set_index(["country", "date"]).to_dict(orient='index')
dict_country_date_unix = {country: {get_unix_timestamp_str(date): row} for (country, date), row in temp_df.items()}

In [40]:
temp_df = df[['date', 'country', 'countryCode', 'confirmed_cum', 'deaths_cum']]
temp_df = temp_df.rename(columns={"confirmed_cum": "value", "deaths_cum": "deaths", 'country': 'name','countryCode': 'id'}).set_index('date')
datacovid_for_heatmap = defaultdict(list)
for date, row in temp_df.iterrows():
    datacovid_for_heatmap[get_unix_timestamp_str(date)].append(row.to_dict())

In [41]:
with open('datacovid_country_date.json', 'w') as fp:
    json.dump(dict_country_date, fp)
with open('datacovid_country_date_unix.json', 'w') as fp:
    json.dump(dict_country_date_unix, fp)
with open('datacovid_for_heatmap.json', 'w') as fp:
    json.dump(datacovid_for_heatmap, fp)

#### Come back to the [Back to the top](#top)

#### Come back to [Home](FinalProjectReport.ipynb)