# Distribution of COVID-19 cases worldwide analysis

![](https://d.newsweek.com/en/full/1571542/coronavirus-covid19-virus-stock-getty.jpg)

I've used a data set provided by the European Union updated on 2 May 2020.
Data are availabe at this link: https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide
The file is updated daily and contains the latest available public data on COVID-19. Each row/entry contains the number of new cases and deaths reported per day and per country

Are provided the follow variables
dateRep: date of reporting
day: day of reporting                       
month: month of reporting                     
year: year of reporting                      
cases: number of confirmed cases                     
deaths: number of deaths                     
countriesAndTerritories: name of State   
geoId: ISO country code with 2 characters                     
countryterritoryCode: ISO country code with 3 characters      
popData2018: population for each country updated at 2018               
continentExp: name of Continent

The goal is to have an overview of this pandemic virus on how is distribuited around the world. 

### Prepare Workspace

In [270]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=True)

In [271]:
# import data set
df = pd.read_csv("covid_19.csv",encoding='ISO-8859-1')

### Summarize data

In [272]:
# Look at dimension of data set and types of each attribute
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14450 entries, 0 to 14449
Data columns (total 11 columns):
dateRep                    14450 non-null object
day                        14450 non-null int64
month                      14450 non-null int64
year                       14450 non-null int64
cases                      14450 non-null int64
deaths                     14450 non-null int64
countriesAndTerritories    14450 non-null object
geoId                      14401 non-null object
countryterritoryCode       14282 non-null object
popData2018                14304 non-null float64
continentExp               14450 non-null object
dtypes: float64(1), int64(5), object(5)
memory usage: 1.2+ MB


In [273]:
# Summarize attribute distributions of the data frame
df.describe(include='all')

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018,continentExp
count,14450,14450.0,14450.0,14450.0,14450.0,14450.0,14450,14401,14282,14304.0,14450
unique,124,,,,,,209,208,204,,6
top,02/05/2020,,,,,,Italy,NL,BRA,,Europe
freq,208,,,,,,124,124,124,,4873
mean,,16.377509,3.101038,2019.995363,228.899654,16.500415,,,,54941230.0,
std,,9.000564,1.270264,0.067937,1596.065098,124.43305,,,,183123200.0,
min,,1.0,1.0,2019.0,-1430.0,0.0,,,,1000.0,
25%,,9.0,2.0,2020.0,0.0,0.0,,,,2789533.0,
50%,,17.0,3.0,2020.0,1.0,0.0,,,,9942334.0,
75%,,24.0,4.0,2020.0,30.0,1.0,,,,37172390.0,


In [274]:
# Take a peek at the first rows of the data
df.head(10)

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018,continentExp
0,02/05/2020,2,5,2020,164,4,Afghanistan,AF,AFG,37172386.0,Asia
1,01/05/2020,1,5,2020,222,4,Afghanistan,AF,AFG,37172386.0,Asia
2,30/04/2020,30,4,2020,122,0,Afghanistan,AF,AFG,37172386.0,Asia
3,29/04/2020,29,4,2020,124,3,Afghanistan,AF,AFG,37172386.0,Asia
4,28/04/2020,28,4,2020,172,0,Afghanistan,AF,AFG,37172386.0,Asia
5,27/04/2020,27,4,2020,68,10,Afghanistan,AF,AFG,37172386.0,Asia
6,26/04/2020,26,4,2020,112,4,Afghanistan,AF,AFG,37172386.0,Asia
7,25/04/2020,25,4,2020,70,1,Afghanistan,AF,AFG,37172386.0,Asia
8,24/04/2020,24,4,2020,105,2,Afghanistan,AF,AFG,37172386.0,Asia
9,23/04/2020,23,4,2020,84,4,Afghanistan,AF,AFG,37172386.0,Asia


### Handling variables

In [275]:
# Drop columns not used
df_new = df.drop(['day','month','year'], axis=1)

In [276]:
# Rename some features for a practical use
df_new = df_new.rename(columns={"dateRep":"date","countriesAndTerritories":"country","popData2018":"pop",
                                "continentExp":"continent","countryterritoryCode":"ccode"}) 

In [277]:
# format date
df_new['date'] = pd.to_datetime(df_new['date'], format='%d/%m/%Y')

In [278]:
# peek a sample
df_new.sample(10)

Unnamed: 0,date,cases,deaths,country,geoId,ccode,pop,continent
4171,2020-04-26,37,0,El_Salvador,SV,SLV,6420744.0,America
13138,2020-04-12,33,3,Thailand,TH,THA,69428524.0,Asia
11045,2020-04-14,333,12,Romania,RO,ROU,19473936.0,Europe
11808,2020-03-17,1,0,Seychelles,SC,SYC,96762.0,Africa
10741,2020-01-10,0,0,Philippines,PH,PHL,106651922.0,Asia
1623,2020-03-23,0,0,Benin,BJ,BEN,11485048.0,Africa
1596,2020-04-19,0,0,Benin,BJ,BEN,11485048.0,Africa
11738,2020-04-02,160,0,Serbia,RS,SRB,6982084.0,Europe
6876,2020-02-29,4,0,Israel,IL,ISR,8883800.0,Asia
12412,2020-03-23,3646,394,Spain,ES,ESP,46723749.0,Europe


### Grouping data per country

In [279]:
# groupby country
df_country = df_new.groupby(['country','pop','continent','ccode'], as_index=False)['cases', 'deaths'].sum()

In [280]:
# dropping not matching countries
df_country = df_country.dropna()

In [281]:
# new columns: population in million, cases x population and deaths x population
df_= df_country.copy()
df_['pop(ml)'] = round((df_['pop']/10**6),2)
df_['cases x pop(ml)'] = round((df_['cases']/df_['pop(ml)']),2)
df_['deaths x pop(ml)'] = round((df_['deaths']/df_['pop(ml)']),2)

In [282]:
df_.sample(5)

Unnamed: 0,country,pop,continent,ccode,cases,deaths,pop(ml),cases x pop(ml),deaths x pop(ml)
18,Belize,383071.0,America,BLZ,18,2,0.38,47.37,5.26
184,Togo,7889094.0,Africa,TGO,123,9,7.89,15.59,1.14
199,Vietnam,95540395.0,Asia,VNM,270,0,95.54,2.83,0.0
51,Denmark,5797446.0,Europe,DNK,9311,460,5.8,1605.34,79.31
164,Sierra_Leone,7650154.0,Africa,SLE,136,7,7.65,17.78,0.92


### Top countries with the most cases and deaths per population

In [283]:
# select countries with population > 1 million
df_ = df_[(df_['pop(ml)'] > 1)]

In [284]:
# ranking countries with cases x population
df_c = df_.sort_values(['cases x pop(ml)'], ascending = False).reset_index(drop=True)
print('Top 15 countries with the most cases per population (ml)')
df_c.drop(columns = ['deaths', 'deaths x pop(ml)','pop','continent','ccode']).head(15).style.background_gradient(cmap='cool')


Top 15 countries with the most cases per population (ml)


Unnamed: 0,country,cases,pop(ml),cases x pop(ml)
0,Qatar,14096,2.78,5070.5
1,Spain,215216,46.72,4606.51
2,Ireland,20833,4.85,4295.46
3,Belgium,49032,11.42,4293.52
4,Switzerland,29622,8.52,3476.76
5,Italy,207428,60.43,3432.53
6,United_States_of_America,1103781,327.17,3373.72
7,Singapore,17101,5.64,3032.09
8,United_Kingdom,177454,66.49,2668.88
9,Portugal,25351,10.28,2466.05


In [285]:
# ranking countries with deaths per population
df_d = df_.sort_values(['deaths x pop(ml)'], ascending = False).reset_index(drop=True)
print('Top 15 countries with the most deaths per population (ml)')
df_d.drop(columns = ['cases', 'cases x pop(ml)','pop','continent','ccode']).head(15).style.background_gradient(cmap='Reds')

Top 15 countries with the most deaths per population (ml)


Unnamed: 0,country,deaths,pop(ml),deaths x pop(ml)
0,Belgium,7703,11.42,674.52
1,Spain,24824,46.72,531.34
2,Italy,28236,60.43,467.25
3,United_Kingdom,27510,66.49,413.75
4,France,24594,66.99,367.13
5,Netherlands,4893,17.23,283.98
6,Ireland,1265,4.85,260.82
7,Sweden,2653,10.18,260.61
8,United_States_of_America,65068,327.17,198.88
9,Switzerland,1434,8.52,168.31


### Grouping data per continent

In [286]:
# groupby continent
df_continent = df_new.groupby(['continent'], as_index=False)['cases', 'deaths','pop'].sum()

In [287]:
# dropping NA rows
df_continent = df_continent.dropna()

In [288]:
# new columns: population in million, cases x population and deaths x population
df_cont= df_continent.copy()
df_cont['pop(ml)'] = round((df_cont['pop']/10**6),2)
df_cont['cases x pop(ml)'] = round((df_cont['cases']/df_cont['pop(ml)']),2)
df_cont['deaths x pop(ml)'] = round((df_cont['deaths']/df_cont['pop(ml)']),2)

### Continent ranking by cases and deaths per population

In [289]:
# select countries with population > 1 million
df_cont = df_cont[(df_cont['pop(ml)'] > 1)]

In [290]:
# ranking continent with cases x population
df_ca = df_cont.sort_values(['cases x pop(ml)'], ascending = False).reset_index(drop=True)
print('Continent ranking with the most cases per population (ml)')
df_ca.drop(columns = ['deaths', 'deaths x pop(ml)','pop']).style.background_gradient(cmap='summer')

Continent ranking with the most cases per population (ml)


Unnamed: 0,continent,cases,pop(ml),cases x pop(ml)
0,Europe,1338546,84646.4,15.81
1,America,1390414,103574.0,13.42
2,Oceania,8163,4103.88,1.99
3,Asia,529022,510412.0,1.04
4,Africa,40759,83142.1,0.49


In [291]:
# ranking continent with deaths per population
df_de = df_cont.sort_values(['deaths x pop(ml)'], ascending = False).reset_index(drop=True)
print('Continent ranking with the most deaths per population (ml)')
df_de.drop(columns = ['cases', 'cases x pop(ml)','pop']).style.background_gradient(cmap='winter')

Continent ranking with the most deaths per population (ml)


Unnamed: 0,continent,deaths,pop(ml),deaths x pop(ml)
0,Europe,137047,84646.4,1.62
1,America,80684,103574.0,0.78
2,Asia,18883,510412.0,0.04
3,Oceania,120,4103.88,0.03
4,Africa,1690,83142.1,0.02


### Distribution of cases and deaths in the world

To visualize the distribution in the world has been used Choropleth Maps by plotly graph_objects

In [292]:
fig = px.choropleth(df_c, locations="ccode",
                    color="cases x pop(ml)",
                    hover_name="country",
                    color_continuous_scale=px.colors.sequential.Plotly3)

layout = go.Layout(
    title=go.layout.Title(
        text="Covid-19 cases per population (million)",
        x=0.5
    ),
    font=dict(size=14),
    width = 750,
    height = 350,
    margin=dict(l=0,r=0,b=0,t=30)
)

fig.update_layout(layout)

fig.show()

In [293]:
fig = px.choropleth(df_d, locations="ccode",
                    color="deaths x pop(ml)",
                    hover_name="country",
                    color_continuous_scale=px.colors.sequential.Agsunset)

layout = go.Layout(
    title=go.layout.Title(
        text="Covid-19 deaths per population (million)",
        x=0.5
    ),
    font=dict(size=14),
    width = 750,
    height = 350,
    margin=dict(l=0,r=0,b=0,t=30)
)

fig.update_layout(layout)

fig.show()

### Grouping data per country and date

To visualize the time series I've grouped data per country and date and then I've realized a pivot table to have the right format for the time series

In [294]:
# groupby country and date
ts_country = df_new.groupby(['country','date'], as_index=False)['cases','deaths'].sum()

In [295]:
# dropping NA rows
ts_country = ts_country.dropna()

In [296]:
ts_country.sample(5)

Unnamed: 0,country,date,cases,deaths
7448,Kosovo,2020-04-03,13,0
13652,United_Arab_Emirates,2020-02-07,0,0
10711,Philippines,2020-03-23,0,0
11101,Romania,2020-03-16,26,0
2085,Brunei_Darussalam,2020-04-15,0,0


In [297]:
# create pivot table for cases
covid_c = ts_country.pivot(index='date', columns='country', values='cases')

In [298]:
# select countries to visualize time series
covid_cases = covid_c[['Italy','Spain','United_Kingdom','Germany','France','United_States_of_America','Belgium',
                       'Switzerland','Netherlands','Sweden']]

In [299]:
# cumulative time series 
covid_cases.sort_index().cumsum().iplot(title = 'Time series of cumulative cases per country')

In [300]:
# time series per day
covid_cases.iplot(title = 'Time series of cases per day and per country')

In [301]:
# create pivot table for deaths
covid_d = ts_country.pivot(index='date', columns='country', values='deaths')

In [302]:
# select countries to visualize time series
covid_deaths = covid_d[['Italy','Spain','United_Kingdom','Germany','France','United_States_of_America','Belgium',
                        'Switzerland', 'Netherlands','Sweden']]

In [303]:
# cumulative time series
covid_deaths.sort_index().cumsum().iplot(title = 'Time series of cumulative deaths per country')

In [304]:
# time series per day
covid_deaths.iplot(title = 'Time series of deaths per day and per country')

In [305]:
# groupby continent and date
ts_continent = df_new.groupby(['continent','date'], as_index=False)['cases','deaths'].sum()

In [306]:
# dropping NA rows
ts_continent = ts_continent.dropna()

In [307]:
ts_continent.sample(10)

Unnamed: 0,continent,date,cases,deaths
501,Oceania,2020-01-05,0,0
105,Africa,2020-04-14,769,46
128,America,2020-01-04,0,0
187,America,2020-03-03,19,4
50,Africa,2020-02-19,0,0
116,Africa,2020-04-25,1854,48
38,Africa,2020-02-07,0,0
463,Europe,2020-03-31,29043,2894
544,Oceania,2020-02-17,0,0
103,Africa,2020-04-12,714,52


In [308]:
# create pivot table for cases
covid_C = ts_continent.pivot(index='date', columns='continent', values='cases')

In [309]:
# select countries to visualize time series
covid_C_cases = covid_C[['Africa','America','Asia','Europe','Oceania']]

In [310]:
# cumulative time series 
covid_C_cases.sort_index().cumsum().iplot(title = 'Time series of cumulative cases per continent')

In [311]:
# time series per day
covid_C_cases.iplot(title = 'Time series of cases per day and per continent')

In [312]:
# create pivot table for deaths
covid_C_deaths = ts_continent.pivot(index='date', columns='continent', values='deaths')

In [313]:
# select countries to visualize time series
covid_D = covid_C_deaths[['Africa','America','Asia','Europe','Oceania']]

In [314]:
# cumulative time series
covid_D.sort_index().cumsum().iplot(title = 'Time series of cumulative deaths per continent')

In [315]:
# time series per day
covid_D.iplot(title = 'Time series of deaths per day and per continent')