![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Finteresting-problems&branch=main&subPath=notebooks/gapminder-ufm-tfr.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Exploring Gapminder Data

[Watch on YouTube](https://www.youtube.com/watch?v=o8JEHJaDg4o&list=PL-j7ku2URmjZYtWzMCS4AqFS5SXPXRHwf)

Gapminder is a non-profit website that uses large global data sets to promote understanding of the world.

In addition to a [quiz](http://forms.gapminder.org/s3/test-2018) that makes the point "almost nobody knows the basic global facts", there is a photo project called "Dollar Street" to help dispell quality of life stereotypes and a book called "[Factfulness: Ten Reasons We're Wrong About the World – and Why Things Are Better Than You Think](https://www.goodreads.com/book/show/34890015-factfulness)".

One of the famous data visualizations from [Gapminder.org](https://www.gapminder.org) is a bubble chart of [life expectancy versus income over time](https://www.gapminder.org/tools/#$chart-type=bubbles).

Here is a version of that visualization in Python.

In [None]:
import plotly.express as px
import warnings
warnings.filterwarnings("ignore", module="plotly.express") # suppress warnings

gapminder_df = px.data.gapminder()
px.scatter(gapminder_df, x='gdpPercap', y='lifeExp', 
           animation_frame='year', size='pop', size_max=100, 
           color='continent', hover_name='country', 
           log_x=True, range_x=[100,100000], range_y=[25, 90])

We can look at changes in life expectancy over the years as a line graph.

In [None]:
px.line(gapminder_df, x='year', y='lifeExp', color='country', title='Life Expectancy Over Time')

Click or double-click on countries in the legend to hide or show particular lines.

## GDP

We can quickly see some events events that significantly affected life expectancy, such as [Rwanda in 1992](https://en.wikipedia.org/wiki/Rwandan_genocide) or [Cambodia in 1977](https://en.wikipedia.org/wiki/Cambodian_genocide).

Using Gapminder data we can display GDP per capita for a particular year to compare countries.

In [None]:
import pandas as pd

gdpppc_excel_link = 'https://github.com/Gapminder-Indicators/gdppc_cppp/blob/master/gdppc_cppp-by-gapminder.xlsx?raw=true'
gdpppc = pd.read_excel(gdpppc_excel_link, sheet_name='countries_and_territories')
gdpppc.dropna(axis=0, how='any', inplace=True) # drop rows (countries) for which there are no observations
px.bar(gdpppc.sort_values(2018, ascending=False), x='geo.name', y=2018, title='GDP Per Capita (2018)')

We can zoom in and mouse over the graph to see more information.

There are, of course, many other interesting data sets on the site. For example adult literacy rate.

In [None]:
alr_spreadsheet_key = '12O0Bo85Dd-9bNq6p5KwXduPET1cRETP-mKy3ZK4q_xo' # from the URL
alr_spreadsheet_gid = '0' # the first sheet
alr_csv = 'https://docs.google.com/spreadsheets/d/'+alr_spreadsheet_key+'/export?gid='+alr_spreadsheet_gid+'&format=csv'
alr = pd.read_csv(alr_csv)
alr

Unfortunately there are quite a few gaps in that data set, but it is difficult to collect these types of data on a large scale.

# Question

It would be informative to compare data sets to see if there are correlations. Do we see a [correlation](https://www.ncbi.nlm.nih.gov/books/NBK233807/) between child (under five) mortality rate and children born per woman (total fertility rate)?

## Under Five Mortality

### Retrieve Data

We'll start by getting the Gapminder data for child mortality rate, called **under five mortality**, and storing it in a dataframe called `ufm`.

In [None]:
ufm_spreadsheet_key = '1KqOcaDdM1rWQD8TnAEZpYDDQRJqlxqU7t_KT55pgd4U'
ufm_spreadsheet_gid = '1535646753' # the first sheet
ufm = pd.read_csv('https://docs.google.com/spreadsheets/d/'+ufm_spreadsheet_key+'/export?gid='+ufm_spreadsheet_gid+'&format=csv')
ufm

### Adjust Dataframe

To make it these data easier to graph, we'll rename the column `Under five mortality` to be just `Country`, and then set `Country` as the dataframe index.

In [None]:
ufm.rename(columns={'Under five mortality':'Country'}, inplace=True)
ufm.set_index('Country', inplace=True)
ufm

### Clean Data

Now let's drop any rows that don't have data.

In [None]:
ufm.dropna(axis=0, how='all', inplace=True)
ufm


### Graph Data

To graph these data we'll first transpose the dataframe (using `T`) so that years are on the x-axis.

In [None]:
px.line(ufm.transpose(), title='Under Five Mortality Over Time').update_layout(xaxis_title='Year', yaxis_title='Under Five Mortality')

Apart from some disconcerting spikes, there seems to have been a downward trend in child mortality over time.

We can zoom in on sections of the graph to look closer, and select individual countries by clicking or double-clicking on the names in the legend.

### Graph a Subset of the Data

We can also generate a graph with just certain countries that we may be interested in.

In [None]:
ufm_subset = ufm[ufm.index.isin(['Canada', 'United States', 'Mexico'])]
px.line(ufm_subset.transpose(), title='Under Five Mortality Over Time').update_layout(xaxis_title='Year', yaxis_title='Under Five Mortality')

### Averages by Region

To look at averages per continent, we'll need a data set correlating countries to regions.

In [None]:
geonames = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--geo_entity_domain/master/ddf--entities--geo--country.csv')
geonames

The two columns we are interested in are the name of the country and `world_6region` (although we could instead use `world_4region` if we prefer).

In [None]:
geonames_filtered = geonames[['name', 'world_6region']].rename(columns={'name':'Country', 'world_6region':'Continent'})
geonames_filtered

### Merge Region Data and Calculate Mean

Let's merge that into our `ufm` dataframe and calculate the average (mean) by continent.

In [None]:
ufm_continent = ufm.merge(geonames_filtered, on='Country')
ufm_continent_mean = ufm_continent.groupby('Continent').mean(numeric_only=True)
ufm_continent_mean

### Graph Mean by Continent

In [None]:
px.line(ufm_continent_mean.transpose(), title='Under Five Mortality Over Time').update_layout(xaxis_title='Year', yaxis_title='Under Five Mortality')

That makes it simpler to see trends, and shows a decrease in child mortality over the last 50 years.

## Total Fertility Rate

Now let's do the same process with the number of children born per woman, the **total fertility rate**.

### Retrieve and Process Data

In [None]:
tfr_spreadsheet_key = '1oq3r8W7ajenKFgoAYoOf2MXeTWWNPpudR-Fo5m2-o30'
#tfr_spreadsheet_key = '1yhCv2YRWk5DqsyLN2g-AJBA66EYClPj5vvUBnfslf-I' # with projections
tfr_spreadsheet_gid = '0' # the first sheet
tfr = pd.read_csv('https://docs.google.com/spreadsheets/d/'+tfr_spreadsheet_key+'/export?gid='+tfr_spreadsheet_gid+'&format=csv')
tfr.rename(columns={'Total fertility rate':'Country'}, inplace=True)
tfr.set_index('Country', inplace=True)
tfr.dropna(axis=0, how='all', inplace=True)
tfr

### Generate Graph

In [None]:
px.line(tfr.transpose(), title='Under Five Mortality Over Time').update_layout(xaxis_title='Year', yaxis_title='Under Five Mortality')

### Mean by Continent

Again, it looks like a general decline recently but there are a lot of lines on that graph. Let's find the mean by continent.

In [None]:
tfr_continent = tfr.merge(geonames_filtered, on='Country')
tfr_continent_mean = tfr_continent.groupby('Continent').mean(numeric_only=True)
px.line(tfr_continent_mean.transpose(), title='Total Fertility Rate Over Time').update_layout(xaxis_title='Year', yaxis_title='Total Fertility Rate')

## Correlations?

### Correlations by Continent

Observing the declines in *under five mortality* and *total fertility rate*, let's see if they look like they are correlated.

In [None]:
px.line(ufm_continent_mean.transpose(), title='Under Five Mortality Over Time').update_layout(xaxis_title='Year', yaxis_title='Under Five Mortality').show()
px.line(tfr_continent_mean.transpose(), title='Total Fertility Rate Over Time').update_layout(xaxis_title='Year', yaxis_title='Total Fertility Rate').show()

Total fertility rate seems to lag behind child mortality rate, but there does appear to be a correlation.

### Correlations by Country

We may also want to look at this for a particular country or list of countries.

In [None]:
countries = ['Canada']

px.line(ufm.loc[countries].transpose(), title='Under Five Mortality Over Time').update_layout(xaxis_title='Year', yaxis_title='Under Five Mortality').show()
px.line(tfr.loc[countries].transpose(), title='Total Fertility Rate Over Time').update_layout(xaxis_title='Year', yaxis_title='Total Fertility Rate').show()

## Calculating Correlations

To find how closely correlated these two data sets are, we'll calculate the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) for each country. A value of 0 implies no correlation, 1 would be a strong positive correlation and -1 would be a strong negative correlation.

In [None]:
correlations_dictionary = {}
missing_data = []
for country in ufm.index.tolist():
    try:
        correlations_dictionary[country] = ufm.loc[country].corr(tfr.loc[country], method='pearson')
    except:
        missing_data.append(country)
print('Missing data for: ', missing_data)
correlations = pd.DataFrame.from_dict(correlations_dictionary, orient='index', columns=['Correlation'])
correlations

In [None]:
print('Average correlation:', correlations.mean()[0])

In [None]:
print('Median correlation:', correlations.median()[0])

In [None]:
print('Countries with the highest correlation:')
correlations.sort_values('Correlation', ascending=False).head(10)

In [None]:
fig = px.bar(correlations, title='Correlations Between Under Five Mortality and Total Fertility Rate')
fig.update_layout(xaxis_title='Country', yaxis_title='Correlation', showlegend=False)
fig.show()

# Conclusion

Based on these data about child (under five) mortality and total fertility rate (children per woman), it looks like there is probably a correlation. Of course we remember that [correlation does not imply causation](https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation), but there is potential for further study.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)