### GESIS Fall Seminar in Computational Social Science 2022
### Introduction to Computational Social Science with Python
# Day 5-2: Plotting Data with Matplotlib and Seaborn

## Overview

* Basic plotting in Python
* Pyplot vs the object-oriented approach
* Customising plots and figures
* Attractive plots with Seaborn

### PSA: The focus of this notebook is on learning the basic commands for plotting in Python. As such, many of the figures in this notebook do not strictly follow many of the guidelines from the previous notebook. Think about how you would improve the figures yourself!

## Basic plotting in Python
* `matplotlib` is the core library for plotting figures in Python.
* Lots of libraries build on matplotlib for domain-specific plotting.
* Comprehensive documentation and tutorials [here](https://matplotlib.org/).

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [None]:
# Plot a simple figure
plt.figure() # Create a figure

plt.plot([1,2,4,8,9,13], [0,5,6,7,10,5]) # plot data

plt.show() # show figure

In [None]:
# Add some commands to label the plot

plt.figure()
plt.plot([1,2,4,8,9,13], [0,5,6,7,10,5])

plt.title('This is a title') # add a title and labels
plt.xlabel('This is a x axis label')
plt.ylabel('This is a y axis label')
plt.text(6, 4, 'This is text at point (6,4)')

plt.show()

In [None]:
# Plotting labelled data

data = {'age': [10, 13, 16, 18, 21, 25],
        'height': [140, 150, 160, 167, 168, 168]}

plt.figure()
plt.scatter('age', 'height', data=data) # custom color and size
plt.show()

# Equivalent plot
plt.figure()
plt.scatter(data['age'], data['height']) # custom color and size
plt.show()

In [None]:
# More complex plot

plt.figure(figsize = (10,5)) # specify figure size

plt.plot([1,2,4,8,9,13], [0,5,6,7,10,5], label = 'line1') # add a label

plt.plot([2,3,4,7,10,12], [1,4,5,9,4,2], label = 'line2') # add more plots to same figure
plt.scatter([2, 5, 9], [9, 5, 7], label = 'scatter')
plt.hist([1,1,3,4,5,6,3,3,5,3,7,7,2,12,1,4,9,8,7,9,8,10], label = 'hist')

plt.text(6, 4, 'This is text at point (6,4)') # add text

plt.title('This is a title') # add title and labels
plt.xlabel('This is a x axis label')
plt.ylabel('This is a y axis label')

plt.xlim(1,20) # change axis limits
plt.ylim(-1,11)

plt.xscale('log') # change x axis scale
plt.yticks([0,5,10], [0,'five',10]) # change y ticks

plt.legend() # show the legend

plt.savefig('demofig.png') # png for fixed resolution image
plt.savefig('demofig.pdf') # pdf for vector graphics (high quality, greater compatibility)
plt.savefig('demofig.svg') # svg for vector graphics (high quality, more natural image format, smaller filesize)

plt.show()

### Frequently used pyplot commands

| Command | Meaning | Argument(s) |
| -------- | ------- | ----- |
| [`plt.figure()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html) | Create a figure | see docs |
| [`plt.show()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.show.html) | Display figure | see docs |
| [`plt.title()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.title.html)  | Set a title for Axes | string |
| [`plt.text()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.text.html)     | Add text to Axes | (x, y, string) |
| [`plt.legend()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html) | Create legend | see docs |
| [`plt.colorbar()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.colorbar.html)     | Create colorbar | see docs |
| [`plt.xlabel()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xlabel.html), [`plt.ylabel()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.ylabel.html)       | Label x/y axis | string |
| [`plt.xlim()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xlim.html), [`plt.ylim()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.ylim.html)     | Set x/y axis limits | (min, max) |
| [`plt.xscale()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xscale.html), [`plt.yscale()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.yscale.html)      | Set x/y axis scale | {"linear", "log", "symlog", "logit", ...} |
| [`plt.xticks()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xticks.html), [`plt.yticks()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.yticks.html)      | Set x/y axis tick locations + labels |  array  of floats, array of strings |
| [`plt.savefig()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html) | Save figure | string filename |

### Key elements in matplotlib
* `figure`: The overall figure that contains one or more plots.
* `axes`: An individual plot within a figure (this is the set of x, y (sometimes z) axes together)

![Figure anatomy](figs/figanatomy.png "Figure anatomy")

See the example below for how to handle subplots with `plt.subplot()`:

In [None]:
plt.figure(figsize = (10,5)) # specify figure size

plt.subplot(1, 2, 1) # Create subplots with 1 row, 2 columns, and plot on axes 1

plt.plot([1,2,4,8,9,13], [0,5,6,7,10,5]) # Usual plotting

plt.title('This is a title for axes 1') # separate labels for axes 1
plt.xlabel('This is a x axis label for axes 1')
plt.ylabel('This is a y axis label for axes 1')

plt.subplot(1, 2, 2)  # address subplots with 1 row, 2 columns, and plot on axes 2

plt.scatter([2,3,4,7,10,12], [1,4,5,9,4,2]) # Usual plotting
plt.plot([1,2,5,8,9,14], [2,4,2,7,8,5])

plt.title('This is a title for axes 2') # separate labels for axes 2
plt.xlabel('This is a x axis label for axes 2')
plt.ylabel('This is a y axis label for axes 2')

plt.suptitle('This is a super title') # add a 'super title'
plt.show()

## 🏋️‍♀️ PRACTICE

In [None]:
# Q1: Read the file pageviews_2022.h5. Use pyplot to plot the data of Kanye West and Taylor Swift in a single plot.

df = pd.read_hdf('data/pageviews_2022.h5')
df['date'] = pd.to_datetime(df['date'])
plt.figure(figsize=(10,6))
plt.plot(df['date'], df[['Taylor Swift', 'Kanye West']], label=['Taylor Swift', 'Kanye West'])
plt.legend()
plt.ylabel('Wikipedia Page Views')
plt.xlabel('Date')
plt.show()

In [None]:
# Q2: Create a horizontal barplot (plt.barh) of the data from Europe & Central Asia countries in broadband2020.csv
# Default commands will show the data, but probably in an uniterpretable way.
# Consider how to improve this (e.g. with fig dimensions, font sizes, ordering, etc.)

df = pd.read_csv('data/broadband2020.csv', index_col=0)
df = df[df['Region']=='Europe & Central Asia']

s = df.sort_values('Broadband').dropna()

plt.figure(figsize=(5, 12))
plt.barh(y=s['Country Name'], width=s['Broadband'])
plt.margins(y=0.01)
plt.title('Broadband Internet Penetration in Europe & Central Asia')
plt.xlabel('Fixed Broadband subscriptions per 100 people')
plt.xlim(0, 100)
plt.show()

In [None]:
# Q3: Read the file GDPxLife.csv
# Create a figure with 2x2 subplots
# Plot 1 should include a histogram of life expectancy
# Plot 2 should include a scatter plot of GDP per capita vs life expectancy
# Plot 3 should include a 'hist2d' plot of life expectancy vs GDP per capita
# Plot 4 should include a histogram of GDP per capita

gdplifex = pd.read_csv('data/GDPxLife.csv')

plt.figure()
plt.subplot(2,2,1)
plt.hist(gdplifex['Life expectancy'], bins=20)
plt.title('Life expectancy histogram')
plt.xlabel('Life Expectancy (years)')

plt.subplot(2,2,2)
plt.title('GDP per capita vs Life Expectancy')
plt.scatter(gdplifex['GDP per capita'], gdplifex['Life expectancy'])
# plt.xscale('log')
plt.xlabel('GDP per capita (US$)')
plt.ylabel('Life Expectancy (years)')

plt.subplot(2,2,3)
plt.title('GDP per capita vs Life Expectancy')
plt.hist2d(gdplifex['Life expectancy'], gdplifex['GDP per capita'], bins=20)
# plt.yscale('log')
plt.xlabel('Life Expectancy (years)')
plt.ylabel('GDP per capita (US$)')

plt.subplot(2,2,4)
plt.title('GDP per capita histogram')
plt.hist(gdplifex['GDP per capita'], bins=20)
# plt.xscale('log')
plt.xlabel('GDP per capita (US$)')

plt.suptitle('Distributions of GDP per capita and Life Expectancy')
plt.tight_layout()
plt.show()

## Pyplot vs the object-oriented approach
* Pyplot is useful for basic figures and handles most things automatically.
* Matplotlib can also be used in an _object oriented_ way.
* Figures and axes are objects that can be explicitly handled, rather than relying on pyplot magic.
* The object-oriented approach allows for much more complex and customisable figures.
* Important to know the difference! (and understand when reading docs/stackoverflow)

In [None]:
# plt.subplots() creates a figure and axes object
fig, ax = plt.subplots() # default nrows=1, ncols=1

ax.plot([1,2,4,8,9,13], [0,5,6,7,10,5]) # we plot our data on the axes

ax.set_title('This is a title') # add a title and labels
ax.set_xlabel('This is a x axis label')
ax.set_ylabel('This is a y axis label')
ax.text(6, 4, 'This is text at point (6,4)')

plt.show() # and show the figure

In [None]:
# Axes methods allow us to extract information from the plot too

print(ax.get_title())
print(ax.get_xlabel())

### Anatomy of a figure

![Figure anatomy](figs/anatomy.png "Figure anatomy")

In [None]:
# Let's try 2 subplots

fig, axs = plt.subplots(1, 2, figsize=(10,5)) # create a figure and sets of axes in 1x2 arrangement

axs[0].plot([1,2,4,8,9,13], [0,5,6,7,10,5]) # plot data on axs[0]
axs[0].set_title('This is a title for axes 1') # add a title and labels to 'axs[0]'
axs[0].set_xlabel('This is a x axis label')
axs[0].set_ylabel('This is a y axis label')
axs[0].text(6, 4, 'This is text at point (6,4)')


axs[1].scatter([2,3,4,7,10,12], [1,4,5,9,4,2]) # plot data on axs[1]
axs[1].plot([1,2,5,8,9,14], [2,4,2,7,8,5])
axs[1].set_title('This is a title for axes 2') # separate labels for 'axs[1]'
axs[1].set_xlabel('This is a x axis label for axes 2')
axs[1].set_ylabel('This is a y axis label for axes 2')

fig.suptitle('This is the figure suptitle') # add some figure labels
fig.supxlabel('This is the figure supxlabel')
fig.supylabel('This is the figure supylabel')

fig.tight_layout() # tidy the layout

plt.show()

In [None]:
# Let's try a 2x2 plot

# Generate some mathematical data
xrange = np.arange(-10, 10, 0.1)
sinx = np.sin(xrange)
cosx = np.cos(xrange)
sin2x = np.sin(xrange)**2
cos2x = np.cos(xrange)**2

# create a figure and sets of axes in 2x2 arrangement with shared x and y axes
fig, axs = plt.subplots(2, 2, figsize=(10,5), sharex='col', sharey='row')

axs[0, 0].plot(xrange, sinx) # plot data on axs[0, 0]
axs[0, 0].set_title(r'$sin(x)$') 

axs[0, 1].plot(xrange, sin2x) # plot data on axs[0, 1]
axs[0, 1].set_title(r'$sin^2(x)$') 

axs[1, 0].plot(xrange, cosx) # plot data on axs[1, 0]
axs[1, 0].set_title(r'$cos(x)$') 

axs[1, 1].plot(xrange, cos2x) # plot data on axs[1, 1]
axs[1, 1].set_title(r'$cos^2(x)$') 


fig.suptitle('Trigonometric functions') # add figure titles
fig.supxlabel(r'$x$')
fig.supylabel(r'$f(x)$')

fig.tight_layout() # tidy the layout


plt.show()

## 🏋️‍♀️ PRACTICE

In [None]:
# Q4: Reproduce the figure below using object oriented matplotlib

songstreams = pd.read_csv('data/songs_data.csv')

plt.figure()
plt.hist(songstreams['streams'], bins=20)
plt.title('Distribution of Top 200 Spotify Streamed Songs')
plt.xlabel('Number of Streams')
plt.ylabel('Count')
plt.show()


In [None]:
# Q5: Reproduce the figure below using pyplot

monkeypox = pd.read_csv('data/owid-monkeypox-data.csv')
monkeypox['date'] = pd.to_datetime(monkeypox['date'])

fig, ax = plt.subplots(figsize=(9,5))
for country in monkeypox.columns[1:]:
    ax.plot(monkeypox['date'], monkeypox[country], label=country)
    
ax.set_yscale('log')
ax.set_ylim(0.1, 1000)
ax.legend(loc=4)
ax.set_xlabel('Date')
ax.set_ylabel('Number of New Cases (7 Day Rolling Average)')
ax.set_title('Monkeypox Cases in DE, UK, US')

plt.show()

In [None]:
# Q6: Use object oriented matplotlib to reproduce the figure in the image below
# Hint: you may need to use the hist 'orientation' argument

gdplifex = pd.read_csv('data/GDPxLife.csv')

![gdplifex](figs/gdplifex.svg "gdplifex")

## Customising plots and figures
* We can specify many plot features such as color, size, shape, opacity, fonts to make our figures shine.
* We can also encode more dimensions of information in these features.

In [None]:
# We can set fixed values for various features 

plt.figure()
plt.scatter([2,3,4,7,10,12], [1,4,5,9,4,2], s=100, c='r', marker='+')
plt.plot([1,2,5,8,9,14], [2,4,2,7,8,5], ls=':', lw='5', marker='o', ms=10)
plt.hist([1,1,3,4,5,6,3,3,5,3,7,7,2,12,1,4,9,8,7,9,8,10], alpha=0.4)
plt.show()

In [None]:
# We can also encode information in the color, size, shape, opacity, etc... of points

data = {'age': [10, 13, 16, 18, 21, 25],
        'height': [140, 150, 160, 167, 168, 168],
        'weight': [32, 40, 80, 120, 102, 152],
        'score':[0, 6, 21, 9, 15, 25]}


# Equivalent plot
plt.figure()
plt.scatter(data['age'], data['height'], c=data['score'], s=data['weight']) # custom color and size
plt.colorbar() # add colorbar
plt.show()

### Frequently used plotting arguments



| Argument | Meaning | Value | [`plt.plot`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html) | [`plt.scatter`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) | [`plt.hist`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) | [`plt.bar`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html) |
| -------- | ------- | ----- | ---------- | ------------- | ---------- | --------- |
| `alpha`  | opacity | float 0-1 | ✅ | ✅ | ✅ | ✅ | 
| `c`      | color | float / str / rgb / more... | ✅ | ✅ | ✅ | ✅ |
| `ls`     | linestyle | {'-', '--', '-.', ...} | ✅ | ❌ | ✅ | ✅ |
| `lw`     | linewidth | float | ✅ | ❌ | ✅ | ✅ |
| `marker` | marker style        | {'o', 'v', '^', ...} | ✅ | ✅ | ❌ | ❌ |
| `ms`     | marker size | float | ✅ | ❌ | ❌ | ❌ |
| `s`	   | point size | float | ❌ | ✅ | ❌ | ❌ |

We can also set a fixed theme for all plots in a document with `plt.style.use()`, e.g.:

In [None]:
plt.style.use('ggplot') # try bmh, fivethirtyeight, many more...

plt.figure(figsize = (10,5)) # specify figure size

plt.subplot(1, 2, 1) # Create subplots with 1 row, 2 columns, and plot on axes 1
plt.plot([1,2,4,8,9,13], [0,5,6,7,10,5])
plt.title('This is a title for axes 1') # separate labels for axes 1
plt.xlabel('This is a x axis label for axes 1')
plt.ylabel('This is a y axis label for axes 1')

plt.subplot(1, 2, 2)  # address subplots with 1 row, 2 columns, and plot on axes 2
plt.scatter([2,3,4,7,10,12], [1,4,5,9,4,2])
plt.plot([1,2,5,8,9,14], [2,4,2,7,8,5])
plt.title('This is a title for axes 2') # separate labels for axes 2
plt.xlabel('This is a x axis label for axes 2')
plt.ylabel('This is a y axis label for axes 2')

plt.suptitle('This is a super title') # add a 'super title'
plt.tight_layout()
plt.show()

## 🏋️‍♀️ PRACTICE

In [None]:
# Q7: Reproduce as close as you can the figure in the image below
# Data is from climatechange.csv



![Global temperature](figs/globaltemp.svg "Global temperature")

In [None]:
# Q8: Read the data in worldbankdata.h5.
# The dataframe contains yearly measures (1980-2020) of 6 population level measures:
# 'Access to electricity (% of population)',
# 'CO2 emissions (metric tons per capita)',
# 'Urban population (% of total population)',
# 'Population growth (annual %)',
# 'Renewable energy consumption (% of total final energy consumption)',
# 'Energy use (kg of oil equivalent per capita)'
#
# Plot whatever elements you see fit on a single Axes.
# Clearly not all information can be visualised in a single plot, so make choices about the
# year(s), country(s), indicator(s) you find interesting.


## Attractive plots with Seaborn
* Seaborn is a library that builds on matplotlib by automating the appearance and creation of important plot elements.
* Seaborn also tightly integrates with pandas.
* Some functionality in seaborn is similar to what you may have seen in ggplot2 for R.
* Again, lots of good documentation, examples, and tutorials [here](https://seaborn.pydata.org/).

Seaborn possesses both **figure**-level functions and **axes**-level functions.
* Figure-level functions automate a  lot of the plotting, and can generate plots across several axes.
* Axes-level functions are like-for-like replacements for matplotlib functions like `hist`, `scatter`, `plot`.

![Figure axes](figs/figaxes.png "Figure axes")

In [None]:
# Let's try a few axes-level functions, used just like the matplotlib variants

import seaborn as sns
plt.style.use('default')

songstreams = pd.read_csv('data/songs_data.csv')

# Pyplot style
plt.figure()
sns.histplot(songstreams['streams']) # seaborn's version of plt.hist()
plt.show()

In [None]:
# Read and clean data
monkeypox = pd.read_csv('data/owid-monkeypox-data.csv')
monkeypox['date'] = pd.to_datetime(monkeypox['date'])
monkeypox = monkeypox.set_index('date')

# Object-oriented style plot

fig, ax = plt.subplots(figsize=(10,5))
ax = sns.lineplot(data=monkeypox) # seaborn's version of plt.plot()
plt.show()

In [None]:
# Seaborn handles labelled data quite neatly.
# Seaborn likes using long form data from pandas, so lets reshape
monkeypox_long = pd.melt(monkeypox.reset_index(), value_vars=['Germany', 'United Kingdom', 'United States'],
                        id_vars='date', var_name='Country', value_name='Rolling case number')

display(monkeypox_long)

# Object-oriented style plot
fig, ax = plt.subplots(figsize=(10,5))
ax = sns.lineplot(data=monkeypox_long, x='date', y='Rolling case number', hue='Country') # plotting labelled data
ax.set_title('We can set titles, labels, etc just like in matplotlib')
plt.show()

In [None]:
# Now lets look at some figure level functions

# Load the GDP / life expectancy data
gdplifex = pd.read_csv('data/GDPxLife.csv')

# figure level functions create the whole figure and any necessary axes
sns.jointplot(data=gdplifex, x='GDP per capita', y='Life expectancy')
plt.show()

Seaborn's figure-level functions typically return a "`FacetGrid`" object which is a bundle of the matplotlib Figure object, Axes objects, and some Seaborn specific objects/methods.
This is important to understand when integrating Seaborn with object-oriented matplotlib.

In [None]:
# Load a dataset on taxi journeys (lots of interesting columns!)
taxis = sns.load_dataset("taxis")
display(taxis)

In [None]:
# Figure-level function in object-oriented style

# long form, labelled data, lets vary data by, color (hue), column 
g = sns.relplot(data=taxis, x='distance', y='total', hue='color', col='payment',
            s=5, alpha=0.3, palette={'green':'g', 'yellow':'y'})

g.set_titles(col_template="{col_name} payment") # Use Seaborn to set Axes titles across columns
g.set_axis_labels("Journey distance (miles)", "Journey total price ($)") # Use Seaborn to set axis labels

g.fig.suptitle("My super title", y=1.05) # Access the matplotlib fig object in the FacetGrid and add suptitle
g.legend.set(title="Taxi Color") # Access the matplotlib legend object in the FacetGrid and set title

g.axes[0, 0].text(15, 125, 'Text demo') # Access the first row first col matplotlib Axes and add some text

plt.show()

In [None]:
# lmplot is a cool function that automatically plots a linear regressuion fit line with your data

g = sns.lmplot(data=taxis, x='distance', y='total', hue='color', col='payment',
               palette={'green':'g', 'yellow':'y'}, scatter_kws={"s": 2, "alpha":0.3})

g.set_titles(col_template="{col_name} payment") # Use Seaborn to set Axes titles across columns
g.set_axis_labels("Journey distance (miles)", "Journey total price ($)") # Use Seaborn to set axis labels

g.fig.suptitle("My super title", y=1.05) # Access the matplotlib fig object in the FacetGrid and add suptitle
g.legend.set(title="Taxi Color") # Access the matplotlib legend object in the FacetGrid and set title


plt.show()

## 🏋️‍♀️ PRACTICE

In [None]:
# Q9: Use a **figure-level** plot from seaborn to plot data of choice from the NY taxis dataset
# Feel free to transform the data beforehand (e.g., creating a new column, aggregating)
# (relplot, displot, catplot, lmplot, pairplot, jointplot...)
