###### Content under Creative Commons Attribution license CC-BY 4.0, code under BSD 3-Clause License © 2017 L.A. Barba, N.C. Clementi

In [None]:
import numpy
import pandas
from matplotlib import pyplot
%matplotlib inline

#Import rcParams to set font styles
from matplotlib import rcParams

#Set font style and size 
rcParams['font.family'] = 'serif'
rcParams['font.size'] = 16

## Load and inspect the data

We found a website called [The Python Graph Gallery](https://python-graph-gallery.com), which has a lot of data visualization examples. 
Among them is a [Gapminder Animation](https://python-graph-gallery.com/341-python-gapminder-animation/), an animated GIF of bubble charts in the style of Hans Rosling. 
We're not going to repeat the same example, but we do get some ideas from it and re-use their data set. 
The data file is hosted on their website, and we can read it directly from there into a `pandas` dataframe.

In [None]:
# Read a dataset for life expectancy from a CSV file hosted online
url = 'https://python-graph-gallery.com/wp-content/uploads/gapminderData.csv'
life_expect = pandas.read_csv(url)

The first thing to do always is to take a peek at the data. 
Using the `shape` attribute of the dataframe, we find out how many rows and columns it has. In this case, it's kind of big to print it all out, so to save space we'll print a small portion of `life_expect`.
You can use a slice to do this, or you can use the [`DataFrame.head()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method, which returns by default the first 5 rows.

In [None]:
life_expect.shape

In [None]:
life_expect.head()

You can see that the columns hold six types of data: the country, the year, the population, the continent, the life expectancy, and the per-capita gross domestic product (GDP). 
Rows are indexed from 0, and the columns each have a **label** (also called an index). Using labels to access data is one of the most powerful features of `pandas`.

In the first five rows, we see that the country repeats (Afghanistan), while the year jumps by five. We guess that the data is arranged in blocks of rows for each country.

We can get a useful summary of the dataframe with the [`DataFrame.info()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) method: it tells us the number of rows and the number of columns (matching the output of the `shape` attribute) and then for each column, it tells us the number of rows that are populated (have non-null entries) and the type of the entries; finally it gives a breakdown of the types of data and an estimate of the memory used by the dataframe.

In [None]:
life_expect.info()

The dataframe has 1704 rows, and every column has 1704 non-null entries, so there is no missing data. Let's find out how many entries of the same year appear in the data. 
In [Lesson 1](http://go.gwu.edu/engcomp2lesson1) of this module, you already learned to extract a column from a data frame, and use the [`series.value_counts()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) method to answer our question.

In [None]:
life_expect['year'].value_counts()

We have an even 142 occurrences of each year in the dataframe. The distinct entries must correspond to each country. It also is clear that we have data every five years, starting 1952 and ending 2007. We think we have a pretty clear picture of what is contained in this data set. What next?

## Grouping data for analysis

We have a dataframe with a `country` column, where countries repeat in blocks of rows, and a `year` column, where sets of 12 years (increasing by 5) repeat for every country. Tabled data commonly has this interleaved structure. And data analysis often involves grouping the data in various ways, to transform it, compute statistics, and visualize it.

With the life expectancy data, it's natural to want to analyze it by year (and look at geographical differences), and by country (and look at historical differnces). 

In [Lesson 2](http://go.gwu.edu/engcomp2lesson2) of this module, we already learned how useful it was to group the beer data by style, and calculate means within each style. Let's get better acquainted with the powerful `groupby()` method for dataframes. First, grouping by the values in the `year` column:

In [None]:
by_year = life_expect.groupby('year')

In [None]:
type(by_year)

Notice that the type of the new variable `by_year` is different: it's a _GroupBy_ object, which—without making a copy of the data—is able to apply operations on each of the groups.

The [`GroupBy.first()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.first.html) method, for example, returns the first row in each group—applied to our grouping `by_year`, it shows the list of years (as a label), with the first country that appears in each year-group.

In [None]:
by_year.first()

All the year-groups have the same first country, Afghanistan, so what we see is the life expectancy and per-capita income in Afghanistan for all the available years.
Let's save that into a new dataframe, and make a line plot of the population and per-capita income over the years.

In [None]:
Afghanistan = by_year.first()

In [None]:
Afghanistan['pop'].plot(figsize=(8,4),
                       title='Population of Afghanistan');

In [None]:
Afghanistan['lifeExp'].plot(figsize=(8,4),
                       title='Life expectancy of Afghanistan');

In [None]:
Afghanistan.describe()

In [None]:
by_country = life_expect.groupby('country')

In [None]:
by_country.first()

The first year for all groups-by-country is 1952. Let's save that first group into a new dataframe, and keep playing with it.

In [None]:
year1952 = by_country.first()

In [None]:
type(year1952)

In [None]:
year1952.head()

In [None]:
year1952['pop'].min()

In [None]:
populations = year1952['pop'].values

In [None]:
year1952.plot.scatter(figsize=(12,8), 
                       x='gdpPercap', y='lifeExp', s=populations/60000, 
                       title='Life expectancy in the year 1952',
                       edgecolors="white")
pyplot.xscale('log');

Matplotlib [colormaps](https://matplotlib.org/examples/color/colormaps_reference.html) offer several options for _qualitative_ data, using discrete colors mapped to a sequence of numbers. We'd like to use the `Accent` colormap to code countries by continent. But we need a numeric code to assign to each continent, so it can be mapped to a color.

In [None]:
pandas.Categorical(year1952['continent'])

In [None]:
colors = pandas.Categorical(year1952['continent']).codes

In [None]:
year1952.plot.scatter(figsize=(12,8), 
                         x='gdpPercap', y='lifeExp', s=populations/60000, 
                         c=colors, cmap='Accent',
                         title='Life expectancy in the year 1952',
                         logx = 'True',
                         ylim = (25,85),
                         edgecolors="white",
                         alpha=0.6);

In [None]:
fig = pyplot.figure(figsize=(12,8))
axis = fig.add_subplot(1,1,1)

axis.spines["top"].set_visible(False)       
axis.spines["right"].set_visible(False)    
axis.spines["left"].set_visible(False) 

axis.set_title('Life expectancy in the years 1952–2007, across 142 countries')

for key, group in by_country:
    axis.plot(group['year'], group['lifeExp'], alpha=0.4)

Something catastrophic happened to one country in 1977, and to another country in 1992.
Let's investigate.


In [None]:
type(by_year.get_group(1977))

In [None]:
type(by_year['lifeExp'].get_group(1977))

We can find the minimum value of the life expectancy at the specific years of interest.

In [None]:
min_lifeExp1977 = by_year['lifeExp'].get_group(1977).min()
min_lifeExp1977

In [None]:
min_lifeExp1992 = by_year['lifeExp'].get_group(1992).min()
min_lifeExp1992

Those values of life expectancy are just terrible. We'd like to know, of course, what countries experienced the dramatic drops in life expectancy.

In [None]:
life_expect[life_expect['lifeExp'] == min_lifeExp1977].index[0]

In [None]:
life_expect['country'][221]

In [None]:
life_expect[life_expect['country'] == 'Cambodia']

We searched online to learn what was happening in Cambodia to cause such a drop in life expectancy in the 1970s. Indeed, Cambodia experienced a _mortality crisis_ due to several factors that combined into a perfect storm: war, ethnic cleansing and migration, collapse of the health system, and cruel famine [1].
It's hard for a country to keep vital statistics under such circumstances, and certainly there are uncertainties in the data for Cambodia in the 1970s.
However, various sources report a life expectancy there in 1977 that was _under 20 years_.
See, for example, the World Bank's interactive web page on [Cambodia](https://data.worldbank.org/country/cambodia).

There is something strange with the data from the The Python Graph Gallery. Is it wrong?
Maybe they are giving us _average_ life expectancy in a five-year period.
Let's look at the other dip in life expectancy, in 1992.

In [None]:
life_expect[life_expect['lifeExp'] == min_lifeExp1992].index[0]

In [None]:
life_expect['country'][1292]

In [None]:
life_expect[life_expect['country'] == 'Rwanda']

The World Bank's interactive web page on [Rwanda](https://data.worldbank.org/country/rwanda) gives a life expectancy of 28.1 in 1992, and even lower in 1993, at 27.6 years. 
This doesn't match the value from the data set we sourced from The Python Graph Gallery, which gives 23.6—and since this value is _lower_ than the minimum value given by the World Bank, we conclude that the discepancy is not caused by 5-year averaging.

In [None]:
for y in life_expect.year.unique():
    frame = life_expect[ life_expect.year == y ]
    minpop = frame['pop'].min()

In [None]:
url = 'http://docs.google.com/spreadsheet/pub?key=phAwcNAVuyj2tPLxKvvnNPA&output=xlsx'
life_expect2 = pandas.read_excel(url)

In [None]:
# drop the columns for years 1800 to 1949
dropyears = list(range(1800,1950))
life_expect2 = life_expect2.drop(dropyears, axis=1)

In [None]:
life_expect2.shape

In [None]:
lifeExp2_clean = life_expect2.dropna()

In [None]:
lifeExp2_clean.shape

## References

1. US National Research Council Roundtable on the Demography of Forced Migration; H.E. Reed, C.B. Keely, editors.  Forced Migration & Mortality (2001), National Academies Press, Washington DC; Chapter 5: The Demographic Analysis of Mortality Crises: The Case of Cambodia, 1970-1979, Patrick Heuveline. Available at: https://www.ncbi.nlm.nih.gov/books/NBK223346/

In [None]:
# Execute this cell to load the notebook's style sheet, then ignore it
from IPython.core.display import HTML
css_file = '../../style/custom.css'
HTML(open(css_file, "r").read())