# Healy chapter 3 figures (2023-08-15)

_by A. Maurits van der Veen_  

_Modification history:_  
_2022-08-28 - Initial version_  
_2022-09-02 - Clean-up_  
_2023-08-15 - Minor updates_  

This notebook provides python code to parallel the R code in Healy's Data Visualization book. 

It uses the `plotnine` module in python, which replicates most of ggplot.


## 0. General preparation

The code relies on several python modules, which may need to be installed first. To do so, uncomment the next code snippet.

- `matplotlib` is the baseline plotting program
- `plotnine` is the main module replicating ggplot
- `mizani` provides axis label formatting (should get automatically installed along with plotnine)
- `pyreadr` reads R-format datasets


In [None]:
# !pip install matplotlib plotnine pyreadr


In [None]:
import math
import numpy as np

import pandas as pd
# import geopandas as gpd  # Not necessary until chapter 7

import matplotlib.pyplot as plt
%matplotlib inline

from plotnine import *  # alternative: import plotnine as p9 and always use prefix
from mizani.formatters import currency_format

import pyreadr


## Chapter 3 - Make a plot

The headings below correspond to chapter sections in the book. If a heading number is skipped (as is the case for 3.1 and 3.2 here), that is because there are no figures in those sections.

In [None]:
# Retrieve & load gapminder data

localfolder = '/Users/yourname/Downloads/'  # Change to local path

remotefolder = 'https://github.com/jennybc/gapminder/blob/main/data/'

targetfile = 'gapminder.rdata'
pyreadr.download_file(remotefolder + targetfile + '?raw=true', 
                      localfolder + targetfile)
newdata = pyreadr.read_r(localfolder + targetfile)
gapminder = newdata['gapminder']


In [None]:
gapminder.head()

### 3.3 Mappings link data to things you see

In [None]:
# Figure 3.3 -- don't panic, this is supposed to be an empty chart!

p = ggplot(data = gapminder,
           mapping = aes(x = 'gdpPercap', y = 'lifeExp'))
p

### 3.4 Build your plots layer by layer

In [None]:
# Figure 3.4

p + geom_point()

In [None]:
# Figure 3.5 -- note that plotnine uses a different default method

p + geom_smooth()

In [None]:
# Figure 3.6, but now specifying the method we're interested in
# Note: this requires the package scikit-misc

p + geom_point() + geom_smooth(method = 'loess')


In [None]:
# Figure 3.7, produced using Healy's code from figure 3.6 
# (again, because R's ggplot and plotnine use a different default method)

p + geom_point() + geom_smooth()

In [None]:
# Figure 3.8 (slightly different method)

p + geom_point() + geom_smooth(method = 'glm') + scale_x_log10()

In [None]:
# Figure 3.9 (slightly different method): addition of dollar signs to x axis

# Note that the code for the dollar signs is slighly different
# Adding a \ after the plus sign allows us to run across multiple lines

p + geom_point() +\
    geom_smooth(method = 'glm') +\
    scale_x_log10(labels = currency_format(digits=0, big_mark=','))

### 3.5 Mapping aesthetics vs. setting them

In [None]:
# Attempt at figure 3.10 -- note: this causes an error in plotnine (which is fine!)
# The problem is that 'purple' is not a variable in the dataframe

# In ggplot, in contrast, although the code is not correct, 
# a default interpretation ensures that something does get plotted

p = ggplot(data = gapminder,
           mapping = aes(x = 'gdpPercap', y = 'lifeExp', color = 'purple'))

p + geom_point() +\
    geom_smooth(method = 'loess') + \
    scale_x_log10(labels = currency_format(digits=0, big_mark=','))


In [None]:
# Figure 3.11

p = ggplot(data = gapminder,
           mapping = aes(x = 'gdpPercap', y = 'lifeExp'))

p + geom_point(color='purple') + \
    geom_smooth(method = 'loess') + \
    scale_x_log10(labels = currency_format(digits=0, big_mark=','))

In [None]:
# Figure 3.12 -- Note that the size parameter in plotnine is calibrated differently!!

p + geom_point(alpha=0.3) + \
    geom_smooth(color='orange', se=False, size=8, method='lm') + \
    scale_x_log10(labels = currency_format(digits=0, big_mark=','))

In [None]:
# Figure 3.12 again, with a different size parameter

p + geom_point(alpha=0.3) + \
    geom_smooth(color='orange', se=False, size=2, method='lm') + \
    scale_x_log10(labels = currency_format(digits=0, big_mark=','))

In [None]:
# An attempt at figure 3.13
# This fails because plotnine does not handle subtitles
# matplotlib does allow them, so if we want it is easy to add them at the end using plt
# (see below)

try:
    p + geom_point(alpha=0.3) + \
        geom_smooth(se=False, size=2, method='lm') + \
        scale_x_log10(labels = currency_format(digits=0, big_mark=',')) + \
        labs(x = 'GDP per capita',
             y = 'Life expectancy in years',
             title = 'Economic growth and life expectancy',
             subtitle = 'Data points are country-years',
             caption = 'Source: Gapminder')
except Exception as e:
    print(e)

In [None]:
# Figure 3.13 -- Instead of a subtitle, simply add a second line to the title
# (by inserting '\n')

p + geom_point(alpha=0.3) + \
    geom_smooth(se=False, size=2, method='lm') + \
    scale_x_log10(labels = currency_format(digits=0, big_mark=',')) + \
    labs(x = 'GDP per capita',
         y = 'Life expectancy in years',
         title = 'Economic growth and life expectancy\nData points are country-years',
         caption = 'Source: Gapminder')


In [None]:
# Figure 3.13 -- Extract into matplotlib,
# then add title (suptitle) and subtitle (title), and adjust fontsize of the latter
# This can easily be adjusted further
# Optimal positioning may take some playing around with y values

p_object = \
p + geom_point(alpha=0.3) + \
    geom_smooth(se=False, size=2, method='lm') + \
    scale_x_log10(labels = currency_format(digits=0, big_mark=',')) + \
    labs(x = 'GDP per capita',
         y = 'Life expectancy in years',
         caption = 'Source: Gapminder')

fig = p_object.draw() # get the matplotlib figure object
ax = fig.axes[0] # get the matplotlib axis (may be more than one if faceted)

fig.suptitle("Economic growth and life expectancy", y=1)
ax.set_title('Data points are country-years', y=0.95, fontsize=10)

fig

Healy drops the labels from the next few figures, but there is no reason to.

In [None]:
# Figure 3.14

p2 = ggplot(data = gapminder,
            mapping = aes(x = 'gdpPercap', y = 'lifeExp', 
                          color = 'continent'))

p2 + geom_point(alpha=0.3) + \
    geom_smooth(method='loess') + \
    scale_x_log10(labels = currency_format(digits=0, big_mark=',')) + \
    labs(x = 'GDP per capita',
         y = 'Life expectancy in years',
         title = 'Economic growth and life expectancy\nData points are country-years',
         caption = 'Source: Gapminder')


In [None]:
# Figure 3.15 -- adding a "fill" aesthetic to fill the error ribbon
# Note the difference in legend compared to figure 3.14

p3 = ggplot(data = gapminder,
            mapping = aes(x = 'gdpPercap', y = 'lifeExp', 
                          color = 'continent', fill = 'continent'))

p3 + geom_point(alpha=0.3) + \
    geom_smooth(method='loess') + \
    scale_x_log10(labels = currency_format(digits=0, big_mark=',')) + \
    labs(x = 'GDP per capita',
         y = 'Life expectancy in years',
         title = 'Economic growth and life expectancy\nData points are country-years',
         caption = 'Source: Gapminder')


In [None]:
# Figure 3.16 -- commenting out the labels in order to show the difference, and removing continent fill
# Note the difference in legend compared to figures 3.14 & 3.15 (back to dots only)

p + geom_point(mapping = aes(color = 'continent')) + \
    geom_smooth(method='loess') + \
    scale_x_log10(labels = currency_format(digits=0, big_mark=',')) # + \
#     labs(x = 'GDP per capita',
#          y = 'Life expectancy in years',
#          title = 'Economic growth and life expectancy\nData points are country-years',
#          caption = 'Source: Gapminder')


In plotnine it is less straightforward to define variable transformations on the fly than it is in R's ggplot.  
Here we define a logged variable ahead of time.

In [None]:
gapminder['logpop'] = gapminder.apply(lambda row: math.log(row['pop']), axis = 1)

In [None]:
# Figure 3.17 -- note that plotnine uses a different default color scale

p + geom_point(mapping = aes(color = 'logpop')) + \
    geom_smooth(method = 'loess') + \
    scale_x_log10(labels = currency_format(digits=0, big_mark=',')) # + \
#     labs(x = 'GDP per capita',
#          y = 'Life expectancy in years',
#          title = 'Economic growth and life expectancy\nData points are country-years',
#          caption = 'Source: Gapminder')


### 3.7 Save your work


In [None]:
# Save a figure as a variable, rather than displaying it

myfigure = \
p + geom_point(mapping = aes(color = 'logpop')) + \
    geom_smooth(method='loess') + \
    scale_x_log10(labels = currency_format(digits=0, big_mark=',')) + \
    labs(x = 'GDP per capita',
         y = 'Life expectancy in years',
         title = 'Economic growth and life expectancy\nData points are country-years',
         caption = 'Source: Gapminder')


In [None]:
# To set the size of a figure, use the theme option

myfigure + theme(figure_size = (8, 3))

In [None]:
# To save a figure, simply call the 'save' method.
# The extension specified will drive the format in which it is saved

myfigure.save(filename = 'myfigure.png')

In [None]:
myfigure.save(filename = 'myfigure.pdf')

In [None]:
# You can of course specify the full pathname
import os

figurefolder = localfolder + 'Figures/'
os.makedirs(figurefolder)  # create folder and any necessary parentfolders

In [None]:
# It is also possible to adjust figure size at this point

myfigure.save(filename = figurefolder + 'myfigure.jpg', 
              width=8, height=10, units='in')

### This completes the figures for chapter 3