# Death Cause by Country

*Table of Contents*

1.   Initialize the project
2.   Import the datasets
3.   Transforming the Death Cause Reason by Country Dataset
4.   The World Map Background
5.   Selections (add_selection & transform_filter)
6.   The First Viz
7.   The Second Viz


By **Erik Salsborn** & **Anders Lundkvist**



# 1. Initialize the Project

In [1]:
import altair as alt
import pandas as pd
import io
import numpy as np
import sys
import seaborn as sns
from vega_datasets import data

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')


# 2. Import the datasets


In [2]:
# Death Cause Reason by Country
df_dcr = pd.read_csv("https://dl.dropbox.com/s/ag3ox1ye3xognle/DeathCauseReasonbyCountry_LaddaUpp-4.csv?dl=0")
# OLD https://dl.dropbox.com/s/k959wjskwnjsn3d/DeathCauseReasonbyCountry.csv?dl=0
# ClEANED https://dl.dropbox.com/s/ag3ox1ye3xognle/DeathCauseReasonbyCountry_LaddaUpp-4.csv?dl=0


# Country Coordinates
df_ccw = pd.read_csv('https://dl.dropbox.com/s/k8weq4ybd23mi78/country-coordinates-world-updatedLADDAUPPNM-3.csv?dl=0')
# OLD https://dl.dropbox.com/s/p6bumbl5he1pblb/country-coordinates-world.csv?dl=0
# CLEANED https://dl.dropbox.com/s/k8weq4ybd23mi78/country-coordinates-world-updatedLADDAUPPNM-3.csv?dl=0


# Population by Country
df_pbc = pd.read_csv('https://dl.dropbox.com/s/uzlir3ovrymqya9/population_by_country_2020.csv?dl=0')


# Continent by Country
df_cbc = pd.read_csv('https://dl.dropbox.com/s/q4dkgkur131xofj/countryContinent-LaddaUPP-4.csv?dl=0', encoding="ISO-8859-1")
# OLD https://dl.dropbox.com/s/7i3c2gl2l8lqxj3/countryContinent.csv?dl=0
# CLEANED https://dl.dropbox.com/s/q4dkgkur131xofj/countryContinent-LaddaUPP-4.csv?dl=0

## Lets have a look at all the datasets

1. Country Coordinates:

In [3]:
df_ccw.head()

Unnamed: 0,latitude,longitude,Country
0,33.93911,67.709953,Afghanistan
1,41.153332,20.168331,Albania
2,28.033886,1.659626,Algeria
3,-14.270972,-170.132217,American Samoa
4,42.546245,1.601554,Andorra


2. Population by Country:

In [4]:
df_pbc.head()

Unnamed: 0,Country (or dependency),Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,China,1440297825,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
1,India,1382345085,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
2,United States,331341050,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
3,Indonesia,274021604,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
4,Pakistan,221612785,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %


3. Country by Contintent:

In [5]:
df_cbc.head()

Unnamed: 0,country,code_2,code_3,country_code,iso_3166_2,continent,sub_region,region_code,sub_region_code
0,Afghanistan,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,142.0,34.0
1,Åland Islands,AX,ALA,248,ISO 3166-2:AX,Europe,Northern Europe,150.0,154.0
2,Albania,AL,ALB,8,ISO 3166-2:AL,Europe,Southern Europe,150.0,39.0
3,Algeria,DZ,DZA,12,ISO 3166-2:DZ,Africa,Northern Africa,2.0,15.0
4,American Samoa,AS,ASM,16,ISO 3166-2:AS,Oceania,Polynesia,9.0,61.0


And finally, let's have a look at 4. Death Cause Reason by Country:

In [6]:
df_dcr.head()

Unnamed: 0,Country Name,Covid-19 Deaths,Cardiovascular diseases,Respiratory diseases,Kidney diseases,Neonatal disorders,Meningitis,Malaria,Interpersonal violence,HIV/AIDS,...,Neoplasms,"Fire, heat",Drowning,Drug use disorders,Road injuries,Environmental heat and cold exposure,Self-harm,Conflict and terrorism,Diabetes,Unnamed: 32
0,Afghanistan,2201.0,61995,7082,5637,23701,1563,530,5015,318,...,21247,485,1687,406,8254,59,1613,24295,4817,
1,Albania,1181.0,12904,815,329,161,13,0,57,2,...,4705,18,36,29,243,4,152,0,175,
2,Algeria,2762.0,97931,7528,8201,8756,292,0,459,264,...,23816,782,526,526,11051,40,1515,13,5328,
3,Andorra,84.0,169,39,16,0,0,0,0,3,...,230,0,0,0,8,0,8,0,9,
4,Angola,33.0,25724,3934,2464,18189,2520,10784,974,16802,...,12791,513,793,80,9253,114,1928,16,4033,


# 3. Transforming the Death Cause Reason by Country Dataset

To facilitate the visualization of the number of deaths attributed to each cause in every country, we aim to reformat the Death Cause Reason dataset such that each row represents a country, the cause of death, and the corresponding number of deaths. This transformation can be achieved using the "melt" operation.

In [7]:
df_dcr_melted = pd.melt(df_dcr,id_vars=['Country Name'],
var_name='Cause',
value_name='Number of Cases')

# Drop any NaN's and 0s
df_dcr_melted = df_dcr_melted.dropna()
df_dcr_melted = df_dcr_melted[df_dcr_melted['Number of Cases'] != 0.0]

Now, lets have a look at the transformed Death Cause Reason by Country

In [8]:
df_dcr_melted.head()

Unnamed: 0,Country Name,Cause,Number of Cases
0,Afghanistan,Covid-19 Deaths,2201.0
1,Albania,Covid-19 Deaths,1181.0
2,Algeria,Covid-19 Deaths,2762.0
3,Andorra,Covid-19 Deaths,84.0
4,Angola,Covid-19 Deaths,33.0


Now, as every country has a different population, we aim to display the number of cases per capita. Therefore, we need to create a new column: Cases per Capita. Let's revisit the population dataset once again.

In [9]:
df_pbc.head()

Unnamed: 0,Country (or dependency),Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,China,1440297825,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
1,India,1382345085,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
2,United States,331341050,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
3,Indonesia,274021604,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
4,Pakistan,221612785,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %


Let's merge the Population (2020) column of each country into the Death Cause Reason by Country dataset.

In [10]:
# Rename the column 'Country (or dependency)' to 'Country Name' to match the df_dcr_melted
df_pbc = df_pbc.rename(columns={'Country (or dependency)': 'Country Name'})

df_dcr_melted = pd.merge(df_dcr_melted, df_pbc[['Country Name', 'Population (2020)']], on='Country Name')

df_dcr_melted.head()

Unnamed: 0,Country Name,Cause,Number of Cases,Population (2020)
0,Afghanistan,Covid-19 Deaths,2201.0,39074280
1,Afghanistan,Cardiovascular diseases,61995.0,39074280
2,Afghanistan,Respiratory diseases,7082.0,39074280
3,Afghanistan,Kidney diseases,5637.0,39074280
4,Afghanistan,Neonatal disorders,23701.0,39074280


Excellent! With the addition of the new column "Population (2020)" to  Death Cause Reason by Country dataset, our next step is to create another column called "Cases Per Capita."

In [11]:
df_dcr_melted['Cases Per Capita'] = df_dcr_melted['Number of Cases'] / df_dcr_melted['Population (2020)']
df_dcr_melted.head()

Unnamed: 0,Country Name,Cause,Number of Cases,Population (2020),Cases Per Capita
0,Afghanistan,Covid-19 Deaths,2201.0,39074280,5.6e-05
1,Afghanistan,Cardiovascular diseases,61995.0,39074280,0.001587
2,Afghanistan,Respiratory diseases,7082.0,39074280,0.000181
3,Afghanistan,Kidney diseases,5637.0,39074280,0.000144
4,Afghanistan,Neonatal disorders,23701.0,39074280,0.000607


Fantastic! Now that all the data is ready for visualization, let's generate a world map background where each country will be represented by dots.

# 4. Creation of our Selectors

As our prototype states I want to have a couple of different selections and filters. These are the three types:  

1.   Being able to select a dot/country in the geoChart
2.   Dropdown-menu for filtering by Cause of Death
3.   Being able to select a continent in the legend

Let start of by creating the first one. Which is fairly straight forward. All we need to do is to create a selection_single using the Altair library and make the selection based on the country, which in our case is the Country Name column

In [13]:
selector = alt.selection_single( fields=['Country Name'], empty='all')

The second one is to filter by different causes of death using a dropdown-menu. This is pretty similar to our first selector, however, this needs to be an actual dropdown-menu. Therefor we will first create an bindning_select, that will create a dropdown-menu for us to use and then apply that to our selector called dropselect.

In [None]:
input_dropdown = alt.binding_select(
    options = ['Covid-19 Deaths','Cardiovascular diseases','Respiratory diseases ','Kidney diseases','Neonatal disorders ','Meningitis ','Malaria ','Interpersonal violence','HIV/AIDS','Tuberculosis','Maternal disorders','Lower respiratory infections','Alcohol use disorders','Diarrheal diseases','Poisoning','Nutritional deficiencies',' Alzheimers disease','Parkinsons disease',' Acute hepatitis','Digestive diseases',' Cirrhosis and other chronic liver diseases','Protein-energy malnutrition','Neoplasms',"Fire, heat",'Drowning','Drug use disorders','Road injuries','Environmental heat and cold exposure','Self-harm',' Conflict and terrorism','Diabetes '], name="Death cause: ")

dropSelect = alt.selection_single(fields=['Cause'], init={'Cause':'Covid-19 Deaths'},
                                 bind=input_dropdown)


Our last selection is a legend containing all continents that can ve selected and used for filtering. Now lets use legend in our bind parameter to make it an actual legend.

In [26]:
selection = alt.selection_single(fields=['continent'], bind='legend', )

Now all our selectors are ready to be able applied to our charts

# 5. The Geographical Chart

The geographical chart needs to be dots that each represents a different country, are different in size depending on the cases per capita column and filtered by the cause of death. These dots should also be placed based on their actual world coordinates.

We will create a mark_circle chart
For the longitude and latitude postions we will use the countries actual coordinates
For the size we will make it depend on cases per capita
The color depend on the continent it belongs to
And finally we will apply a tooltip to display Country Name, Cases per Capita and the number of cases

In [27]:
geoChart = alt.Chart(df_dcr_melted
).transform_filter(
  dropSelect

).transform_filter(
  selection

).transform_lookup( # lookup Country from the other dataset and recieve the position for the country
    lookup='Country Name',
    from_=alt.LookupData(df_ccw,'Country', ['longitude', 'latitude'])

).transform_lookup( # lookup country from the other dataset and get the continent of the country
    lookup='Country Name',
    from_=alt.LookupData(df_cbc,'country', ['continent'])

).mark_circle().encode(
    longitude='longitude:Q',
    latitude='latitude:Q',
    size=alt.Size('Cases Per Capita:Q', title='Cases Per Capita', legend=alt.Legend(orient='right')),
    color= alt.Color('continent:N', title='Continents', legend=alt.Legend(orient='right')),
    opacity = alt.condition(selector, alt.value(1), alt.value(0.17)),
    tooltip=['Country Name:N', 'Cases Per Capita:Q', 'Number of Cases:Q']

).add_selection(
  dropSelect

).add_selection(
  selector

).add_selection(
  selection
)


Lets have a look at what we just did

In [21]:
geoChart

Looks good, but we are missing a world background. Lets add that.

To ensure coherence in our geographical plot, the dots symbolizing each country on the map must be positioned atop a world map. Let's generate a TopoJSON feature collection that encapsulates information about countries across the globe.

In [28]:
# Source of the background
source = alt.topo_feature(data.world_110m.url, 'countries')

# Making the background
world_background = alt.Chart(source).mark_geoshape(
    fill='white',
    stroke='lightgrey'
).properties(
    width=900,
    height=500
).project('naturalEarth1')

world_background

Now lets combine the geoChart and the map





In [29]:
world_background + geoChart

Now lets add the bar charts down below

Lets begin with the top-list bar chart.
To make this bar chart we use the mark_bar with X as Cases per Capita and y as Country Name. We then apply tooltip for countryname continent and cases per Capita as well as the total number of cases.

We then also apply filters as well


In [32]:
barChart_toplist = alt.Chart(df_dcr_melted
).properties(
    width=200

).transform_filter(
    selection

).transform_filter(
    dropSelect

).transform_lookup( # lookup Country from the other dataset and get the continent of each country

    lookup='Country Name',
    from_=alt.LookupData(df_cbc,'country', ['continent'])

).mark_bar().encode(
    x = alt.X('Cases Per Capita:Q', title='Cases Per Capita'),
    y = alt.Y('Country Name:N', title='Country' , sort='-x'),
    tooltip=['Country Name:N', 'continent:N', 'Cases Per Capita:Q', 'Number of Cases:Q'],
    color = 'continent:N' ,
    opacity = alt.condition(selector, alt.value(1), alt.value(0.17))

).transform_window(
    rank='rank(Cases Per Capita)',
    sort=[alt.SortField('Cases Per Capita', order='descending')]

).transform_filter(
    (alt.datum.rank < 17)

).properties(
    title="Toplist"

).add_selection(
  dropSelect

).add_selection(
  selector
)


Lets have a look at our bar chart

In [34]:
alt.vconcat(world_background + geoChart , barChart_toplist)

Looks good. Now finally, lets add the last element of our first visualization, the top causes bar chart

To do this we use similarly as the previous bar chart mark_bar

In [35]:

#----------- The Top-Causes Bar Chart ------
barChart_commoncauses = alt.Chart(df_dcr_melted
).properties(
    width=200

).mark_bar(

).transform_filter(
    selector

).transform_aggregate( # We use aggregate to count the number of cases for each cause
    totalIncidents='sum(Cases Per Capita):Q',
    groupby=["Cause"]

).encode(
    x = alt.X('totalIncidents:Q', title='Cases Per Capita'),
    y = alt.Y('Cause:N', title='Death Cause', sort='-x'),
    tooltip=['Cause:N','totalIncidents:Q'],
    opacity=alt.condition(selector, alt.value(0), alt.value(1))

).transform_window(
    rank='rank(totalIncidents)',
    sort=[alt.SortField('totalIncidents', order='descending')]

).transform_filter(
    (alt.datum.rank < 17)

).properties(
    title= 'Most Common Causes of Death For Selected Country'
)


# ----------- Plotting Visualization One --------
# world_background + geoPlot | barChart | barChart_commoncauses
alt.vconcat(world_background + geoChart , barChart_toplist |  barChart_commoncauses)


In [15]:
# ----------- The Second Viz ----------------
############################################

# Consists of one chart:
      # The Scatter Chart

# ----------- The Scatter Chart ----------------
scatterChart = alt.Chart(df_dcr_melted

).transform_filter(
  dropSelect

).transform_lookup( # lookup Country from the other dataset and get the populaiton of each country
    lookup='Country Name',
    from_=alt.LookupData(df_pbc,'Country Name', ['Population (2020)'])

).transform_lookup( # lookup Country from the other dataset and get the continent of each country
    lookup='Country Name',
    from_=alt.LookupData(df_cbc,'country', ['continent'])

).mark_point(size=70).encode(
    x = alt.X('Population (2020):Q', axis=alt.Axis(title='Population')),
    y = alt.Y('Number of Cases:Q', axis=alt.Axis(title='Number of Cases')),
    tooltip = ['Country Name:N', 'continent:N', 'Population (2020):Q','Number of Cases:Q'],
    color = alt.Color('continent:N', title='Continents'),
    opacity = alt.condition(selection, alt.value(1), alt.value(0.17))

).properties(
    width = 760,
    height = 560
)


# ----------- Plotting Visualization Two --------
scatter_plots = scatterChart.add_selection(dropSelect).add_selection(selection).interactive() # + scatter_plot.transform_regression('Population (2020)', 'Number of Cases').mark_line()
scatter_plots