# Data Visualization in Python

Can Şerif Mekik

PhD Candidate <br/>
Department of Cognitive Science <br/>
Rensselaer Polytechnic Institute

March 24, 2022

<table align="left">
<tr>
<td><img src=https://github.com/cmekik/CDSI_DViP/blob/main/CDSI_Fac.of.Sc_logo.png?raw=1 alt="CDSI Logo" width="300"/></td>
<td><img src=https://github.com/cmekik/CDSI_DViP/blob/main/mcgill_ccr_approval_croppedforblock_0.png?raw=1 alt="CCR Approved Logo" width="300"/></td>
</tr>
</table>

## Introductory Remarks

This workshop assumes minimal working knowledge of Python. 

We will learn the basics of using the `matplotlib` library for producing beautiful data visualizations.

Matplotlib is the standard plotting package in Python. It is very flexible, capable of creating basic 2D plots, 3D plots, and even animations.

Working knowledge of the Pandas package is an asset, although it is not required. 

We will use `pandas` to prepare our data for plotting.

This workshop is heavily inspired by Ben Root's [Anatomy of Matplotlib](https://github.com/matplotlib/AnatomyOfMatplotlib/tree/master/).

### Useful Resources

The [Matplotlib Cheatsheet](https://matplotlib.org/cheatsheets/cheatsheets.pdf) is an excellent two-page summary of essential `matplotlib` features.

The [Official Matplotlib Docs](https://matplotlib.org/stable/index.html) are the single best resource for information short of the source code. It contains tutorials, reference documentation and more.

The [Example Gallery](https://matplotlib.org/stable/gallery/index.html) is particularly helpful when you have something specific in mind.

### Contents

1. Setup
2. Basic Concepts
3. Essential Plotting Methods
4. Controlling Figure Appearance
5. More Advanced Plotting: Grouping, Multiple Subplots & Faceting  
6. Conclusion

## Setup

We will create and adjust various graphs using the matplotlib library.

To follow the workshop on your own machine, you should have Anaconda already installed.

https://www.anaconda.com/products/individual

This will automatically include the necessary dependencies.

Our data set is a subset of Semra Sevi's Canadian Federal Elections dataset.

You can find a copies of the dataset and code at the following address.

https://github.com/cmekik/CDSI_DViP

### Getting Ready to Code

`Jupyter` is a python tool for rich interactive coding that ships with Anaconda.

This presentation uses `Jupyter` notebook, in fact!

To get set, create a new folder in which you will work and copy the materials into it.

Then launch your machines console, navigate to your folder, activate your conda environment, and run the following.

```jupyter notebook```

This should launch Jupyter notebook in your browser. When it does, you can open the notebook.

#### Installing and Importing pandas

`pandas` comes pre-packaged in Anaconda.

You can always install it using the following pip command: ```pip install pandas```

If you have conda, but not pandas, you can also do: ```conda install pandas```

In [None]:
import pandas as pd

df = pd.read_csv(
    "cleaned.csv",
    dtype={ # clean up datatypes a bit
        "province": "category",
        "riding": "category",
        "birth_year": "Int64",
        "gender": "category",
        "censuscategory": "category",
        "party_major_group": "category",
        "elected": "category",
    })

# Inspect the variable metadata
df.info()

#### Installing and Importing matplotlib

`matplotlib` comes pre-packaged in Anaconda.

You can always install it using the following pip command: ```pip install matplotlib```

If you have conda, but not pandas, you can also do: ```conda install matplotib```

When used with jupyter notebooks, matplotlib exhibits some specialized behavior. 

To get it to behave more like it would in a script, we run the following code snippet.

In [None]:
import matplotlib as mpl
mpl.use("nbagg") # must be run before any other mpl functions

The standard way to start using matplotlib is to run the following snippet.

In [None]:
import matplotlib.pyplot as plt

## Basic Concepts

Matplotlib figures are complex objects, so it is important to have a general understanding of their structure.

We'll start with that.

### Figure Creation vs. Display

Before getting into theory, let's take a quick look at the basic mechanics of creating and displaying figures.

In [None]:
# Create some data
X, Y = [i for i in range(20)], [i ** 2 for i in range(20)]

plt.plot(X, Y) # Create a line plot of the data on 'current' figure
plt.show() # Show the result

This quick example hides a lot of detail. 

It's not uncommon to write code like this when trying to quickly understand what's going on with a dataset.

You should retain a few things from this example:
- We didn't have to explicitly create or configure a new figure because matplotlib always tracks a 'current' figure (similar to matlab). This is very convenient, but is not always the best option.
- To display or render a figure that we have created we **must** call `plt.show()` or something similar.
- Matplotlib figures are interactive: you can zoom, pan, save, etc.

### Anatomy of a Figure

To get a sense for the different components of a matplotlib figure, take a look at the image below. 

<img src="https://github.com/cmekik/CDSI_DViP/blob/main/anatomy_of_a_figure.png?raw=1" alt="Anatomy of a Figure" width="500"/>

This image is provided in the matplotlib docs, and its source code can be found [here](https://matplotlib.org/stable/gallery/showcase/anatomy.html).

Taking a step back, here are the main components of a figure:
- The `Figure` object, which contains all figure components
- `Axes`/`Subplots` which house the individual axes of each subplot 
- `XAxis`, `YAxis` in each subplot, which house data about individual axes (tick marks etc.)
- Other stuff

### Explicitly Initializing and Closing Figures.

Instead of using plotting functions from pyplot, like `plt.plot()`, it is better to create and work with `Axes` objects.

This style is more explicit and, ultimately, more flexible.

In [None]:
# Create some data
X, Y = [i for i in range(20)], [i ** 2 for i in range(20)]

fig, ax = plt.subplots() # Create a new figure with 1 subplot
# This function returns a tuple. The first element is the new figure.
# The second element is an `Axes` object if only one subplot is requested,
# Otherwise it is an array of `Axes` objects.

ax.set( # Add some labels
    title="Plot of $y = x^2$",
    xlabel="$x$",
    ylabel="$y$")

ax.plot(X, Y) # Create a line plot of the data on 'current' figure

fig.show() # Show the result

You can save the figure we just created as follows.

In [None]:
fig.savefig("example.png")

If you create and save a lot of figures programmatically, you might have to explicitly close figure to save memory, as below.

This is an unfortunate quirk of matplotlib. There are other workarounds that may be more or less suitable depending on your situation, but they are a bit more advanced and not covered in further detail here.

Notice that when you run the `plt.close()` on `fig`, it loses interactivity.

In [None]:
plt.close(fig)

## Essential Plotting Methods

Let's survey how to create some common plot types. 

We'll focus on typical statistical plots. See the [plot types](https://matplotlib.org/stable/plot_types/index.html) page for a more complete listing (e.g., quivers). 

### `ax.hist()`

[Reference Documentation]()

Plots a histogram.

Great for getting a sense of the distribution of a continuous variable.

In [None]:
fig, ax = plt.subplots()
ax.hist(df.percent_votes.dropna(), bins="auto")
plt.show()

#### Exercise 1

Create a histogram of the margins of victory.

Try different integer values for the `bins` parameter. What do you observe?

In [None]:
# Setup

margin = df.margin.dropna()

In [None]:
# Your code goes here


### `ax.boxplot()`

[Reference Documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.boxplot.html?highlight=boxplot#matplotlib.axes.Axes.boxplot)

Creates boxplots.

Classic plot for succinctly reporting non-parametric descriptives about a variable.

Plots, for each variable, the median and interquartile range. 

Marks outliers that are more than 1.5xIQR less than Q1 or more than Q3.

In [None]:
# If we wanted, we could produce a for only one variable.
# This code demonstrates how to put multiple boxplots on the same axes.

data = (df[["percent_votes", "margin"]]
    .rename(columns={"percent_votes": "Vote Share (%)", "margin":"Margin"})
    .dropna())

fig, ax = plt.subplots()

ax.set(
    title="Boxplots for Vote Share and Margin",
    ylabel="Percentage")

ax.boxplot(
    [data[s] for s in data], 
    labels=data.columns)

fig.show()

#### Exercise 2

Create a boxplot for the number of votes. Add a title and labels.

In [None]:
# Exercise setup

votes = df.votes.dropna() # Vote variable

In [None]:
# Your code goes here


### `ax.violinplot()`

[Reference Documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.violinplot.html?highlight=violinplot#matplotlib.axes.Axes.violinplot)

Violin plots combine the information contained in histograms and boxplots in one neat plot.

In [None]:
# Setup

data = (
    df[["percent_votes", "margin"]]
    .rename(columns={"percent_votes": "Vote Share (%)", "margin":"Margin"})
    .dropna())

In [None]:
# Plot

# If we wanted, we could produce a violin plot for only one variable.
# This code demonstrates how to put multiple boxplots on the same axes.

fig, ax = plt.subplots()

ax.set(
    title="Violin plots for Vote Share and Margin",
    ylabel="Percentage",
    xticks=[1, 2],
    xticklabels=data.columns)

ax.violinplot(
    [data[s] for s in data],
    showmedians=True,
    quantiles=[[0.25, 0.75], [0.25, 0.75]])

fig.show()

#### Exercise 3

Create a violin plot for the number of votes. Include marks for the medians as well as the Q33 and Q66.

In [None]:
# Setup

votes = df.votes.dropna() # votes variable

In [None]:
# Your code goes here


### `ax.bar()`

[Reference Documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html)

Bar charts, optionally with error bars.

Ideal for plotting statistics for data grouped by categories.

Default is vertical, but can construct horizontal bars using [`ax.barh()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.barh.html).

In [None]:
# Setup

gs = df.groupby("gender").percent_votes
pct_votes_x_gender = (
    pd.concat(
        [gs.mean().rename("mean"), gs.sem().rename("sem")], 
        axis=1)
    .reset_index())

print(pct_votes_x_gender)

In [None]:
# Plot

fig, ax = plt.subplots()

ax.set(
    title="Mean Vote Share by Gender Since 1990",
    xlabel="Gender",
    ylabel="Vote Share (%)")

ax.bar(
    x="gender", 
    height="mean", 
    yerr="sem", 
    data=pct_votes_x_gender, 
    linewidth=1,
    capsize=3,
    color="0.7",
    edgecolor="k"
    )

fig.show()

#### Exercise 4

Create a bar graph showing the mean vote share by incumbency status with error bars.

In [None]:
# Setup

df["incumbency"] = (df
     .sort_values(["id", "year"])
     .groupby("id")
     .elected
     .shift()
     .cat.rename_categories({"Elected": "Incumbent", "Not elected": "Nonincumbent"}))


gs = df.groupby("incumbency").percent_votes
pct_votes_x_incumbency = ( # Use this as your data
    pd.concat(
        [gs.mean().rename("mean"), gs.sem().rename("sem")], 
        axis=1)
    .reset_index())

print(pct_votes_x_incumbency)

In [None]:
# Your code goes here


### `ax.errorbar()`

Another graph for presenting estimates by category is [`ax.errorbar()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.errorbar.html).

Here is the same information as in the initial bar plot, using errorbar instead.

In [None]:
# Setup

gs = df.groupby("gender").percent_votes
pct_votes_x_gender = (
    pd.concat(
        [gs.mean().rename("mean"), gs.sem().rename("sem")], 
        axis=1)
    .reset_index())

print(pct_votes_x_gender)

fig, ax = plt.subplots(1)

In [None]:
# Plot

ax.set(
    xlim=[-.5, 2.5],
    title="Mean Vote Share by Gender Since 1990",
    xlabel="Gender",
    ylabel="Vote Share (%)")

ax.errorbar(
    x="gender", 
    y="mean", 
    yerr="sem", 
    data=pct_votes_x_gender, 
    fmt="o",
    linewidth=1,
    capsize=4,
    color="k",
    markersize=5
    )

fig.show()

#### Exercise 5

Create an error bar graph showing the mean vote share by incumbency status.

In [None]:
# Exercise Setup

df["incumbency"] = (df
     .sort_values(["id", "year"])
     .groupby("id")
     .elected
     .shift()
     .cat.rename_categories({"Elected": "Incumbent", "Not elected": "Nonincumbent"}))


gs = df.groupby("incumbency").percent_votes
pct_votes_x_incumbency = ( # Use this as your data
    pd.concat(
        [gs.mean().rename("mean"), gs.sem().rename("sem")], 
        axis=1)
    .reset_index())

print(pct_votes_x_incumbency)

In [None]:
# Your code goes here


### `ax.plot()`

[Reference Documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html)

Plot a function.

Ideal for situations where $y$ and $x$ are ordered or continuous variables and $y$ is a function of $x$ (i.e., there is exactly one $y$ value for each $x$ value).

Can plot markers, lines or both.

In [None]:
# Setup

# Calculate percentage of women candidates and MPs in each election

candidates = pd.DataFrame({
              "num": df.groupby("year").id.nunique(),
          "num_MPs": df[df.elected == "Elected"].copy().groupby("year").id.nunique(),
        "num_women": df[df.gender == "F"].copy().groupby("year").id.nunique(),
    "num_women_MPs": df[(df.gender == "F") & (df.elected == "Elected")].copy().groupby("year").id.nunique(),
})

candidates["pct_w_cand"] = 100 * candidates.num_women / candidates.num
candidates["pct_w_MPs"] = 100 * candidates.num_women_MPs / candidates.num_MPs
candidates = candidates.reset_index()

print(candidates)

In [None]:
# Plot

# Note that we can put multiple plots on the same Axes object.
# Also note that we can pass the variable name with pandas data object.
# Alternatively, we could directly pass the data.

fig, ax = plt.subplots()

ax.set(
    title="Canadian Women Politicians after 1990",
    xlabel="Year",
    ylabel="Percent")

ax.plot("year", "pct_w_cand", "C6--s", data=candidates, label="Candidates")
ax.plot("year", "pct_w_MPs", "C4-", data=candidates, label="MPs")

ax.legend()

fig.show()

#### Exercise 6

Plot age over time. Include labels.

In [None]:
# Setup

data = df.copy()
data["age"] = data.age.astype("float")
data = data.groupby("year").age.mean().reset_index()
print(data)

In [None]:
# Your code goes here


### `ax.fill_between()`

[Reference Documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.fill_between.html)

Fills the area between two sets of $y$ values (second $y$ value defaults to 0). 

Good for presenting uncertainties in continuous estimates.

In [None]:
# Setup

gs = df.groupby("year").percent_votes
data = (
    pd.concat(
        [gs.mean().rename("mean"), gs.sem().rename("sem")],
        axis=1)
    .reset_index())

data["ul"] = data["mean"] + 1.96 * data["sem"]
data["ll"] = data["mean"] - 1.96 * data["sem"]

print(data)

In [None]:
# Plot

fig, ax = plt.subplots(1)

ax.set(
    title="Average Vote Share Per Candidate Since 1990",
    xlabel="Year",
    ylabel="Vote Share (%)")

ax.plot("year", "mean", "-", data=data)
ax.fill_between(
    x="year", 
    y1="ul", 
    y2="ll", 
    data=data,
    alpha=.1)

fig.show()

#### Exercise 7

Produce a similar graph for the margin of victory.

In [None]:
# Setup

gs = df.groupby("year").margin
data = (
    pd.concat(
        [gs.mean().rename("mean"), gs.sem().rename("sem")],
        axis=1)
    .reset_index())

data["ul"] = data["mean"] + 1.96 * data["sem"]
data["ll"] = data["mean"] - 1.96 * data["sem"]

print(data)

In [None]:
# Your code goes here



### `ax.scatter()`

[Reference Documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html)

Creates a scatter plot.

Ideal when your data may be viewed as points in a two-dimensional space.

You can map values to marker colors and sizes as well, allowing you to represent higher dimensional data.

In [None]:
# Setup

data = df.sort_values(["id", "year"])
data["percent_votes_lag"] = data.groupby("id").percent_votes.shift()
print(data[["id", "year", "percent_votes", "percent_votes_lag"]])

In [None]:
# Plot

fig, ax = plt.subplots(1)

ax.set(
    title="Relation Between Previous and Current Vote Share",
    xlabel="Previous Vote Share (%)",
    ylabel="Current Vote Share (%)")

ax.scatter(
    x="percent_votes_lag", 
    y="percent_votes",
    s=2,
    alpha=.3,
    color="gray",
    data=data[["percent_votes_lag", "percent_votes"]].dropna())

plt.show()

Let's try mapping a variable to dot color.

In [None]:
# Calculate the number of votes cast in each riding
data = df[df.year == 2021].copy()
data["num_votes"] = data.groupby(["year", "province", "riding"]).votes.transform(sum)


# Plot the margin against the number of votes in riding and number of votes received.
fig, ax = plt.subplots(1)

ax.set(
    title="Margins by Number of Voters and Votes Received",
    xlabel="Number of Voters",
    ylabel="Votes Received")
ax.tick_params(axis="x", labelrotation = 45)

sc = ax.scatter(
    x="num_votes", 
    y="votes", 
    s=3, 
    c="margin", 
    data=data,
    cmap="coolwarm",
    vmin=-100,
    vmax=+100)

fig.colorbar(sc)

plt.show()

#### Exercise 8

Create a graph similar to the one above, but mapping `percent_votes` to point size instead of mapping `margin` to point color. 

Can you adjust the point size so that it looks nice?

In [None]:
# Setup

# Calculate the number of votes cast in each riding
data = df[df.year == 2021].copy()
data["num_votes"] = data.groupby(["year", "province", "riding"]).votes.transform(sum)
data.percent_votes = data.percent_votes / 3

In [None]:
# Your code goes here


## Controlling Plot Appearance

We've already seen how to control several aspects of plot appearance.

Let's go into some more detail.

Parts of this section are taken directly from Ben Root's  [Anatomy of Matplotlib](https://github.com/matplotlib/AnatomyOfMatplotlib/tree/master/).

### `ax.set(...)` vs. `ax.set_<property>(...)`

So far we have used the `ax.set(...)` command to control many aspects of plot appearance.

This method is great if you quickly want to make adjustments. 

So far we have adjusted titles, labels, and ticks. But you can do more with `ax.set()`, like adjusting plotting limits.

See [the documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set.html?highlight=set#matplotlib.axes.Axes.set) for a full list of options.

`ax.set()` can be limiting if you want to control details.

For instance suppose we want to control the size and font of the title of a graph.

In this situation, we can't use `ax.set()`, instead we need to use `ax.set_title(...)`.

In [None]:
fig, ax = plt.subplots()
ax.set_title("My Graph", fontfamily="serif", size="50")
fig.show()

### `fig.set()`

[Reference Documentation](https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure.set)

You can set some top-level figure parameters using `fig.set()`.

You mainly use this method to adjust the size, aspect ratio, and resolution of a figure.

### Robust Customization

[RcParams and Style Customization Tutorial](https://matplotlib.org/stable/tutorials/introductory/customizing.html)

You can imagine that styling each figure individually can get old. Luckily there are a few solutions for this.

First, it is possible to adjust all defaults provided by matplotlib by playing with the `rcParams` ('rc' for 'runtime configuration').

Better yet, matplotlib gives you the option to choose styles, which will automatically update your graphs to have a consistent look and feel. The simplest way to do this is to use `plt.style.use(...)`. E.g: `plt.style.use("ggplot")`.

You can also create your own styles.

In [None]:
# Here is how you list available styles

print(plt.style.available)

### Naming Colors
[Reference Documentation](http://matplotlib.org/api/colors_api.html#module-matplotlib.colors) 

There are several ways to name individual colors.

This comes in handy when manually setting colors for lines, markers, etc. or when configuring defaults/styles. 

#### Naming Colors by a Single Letter
The simplest way is to use a single letter to select among some basic colors.

- b: blue
- g: green
- r: red
- c: cyan
- m: magenta
- y: yellow
- k: black
- w: white

#### Popular Naming Schemes

You can also use the following color naming schemes:
- HTML/CSS HEX codes, e.g.: `"#0000FF"`
- Standard HTML/CSS color names (140), e.g.: `"midnightblue"`, `"olive"`, etc. [(full list)](https://www.w3schools.com/Colors/colors_names.asp) of the 140 color names. 
- About 1000 color names from the xkcd color survey, e.g.: `"xkcd:windowsblue"`, `"xkcd:firenginered"` [(full list)](https://xkcd.com/color/rgb/)
- Colors from the Tableau T10 pallette (also default mpl color cycle): `'tab:blue'`, `'tab:orange'` [(blog post)](https://www.tableau.com/about/blog/2016/7/colors-upgrade-tableau-10-56782)

#### More Color Naming Schemes

- Grayscale values as float strings between `"0.0"` (black) and `"1.0"` (white).
- Cycle reference using `"C0"`-`"C9"` to reference one of the first 10 colors in the current color cycle.
- RGB(A) tuples: Tuples whose values indicate intensity of r, g, b components plus transparency. Note: If transparency is given in an RGBA tuple and a separate alpha argument is set, the alpha argument will take precedence.

### Colormaps

[Reference Documentation](https://matplotlib.org/stable/tutorials/colors/colormaps.html?highlight=colormap)

Colormaps map ordered or continuous variables to color values.

We played with colormaps when looking at scatter plots. But they also come in handy when, e.g., plotting images.

There are several types: Sequential, Diverging, Qualitative, etc.

Matplotlib gives you many colormaps to choose from, but you also can create your own color maps if you absolutely want to. 

### Markers

[Reference Documentation](http://matplotlib.org/api/markers_api.html)

Markers are the symbols used to mark individual data points. 

We adjusted them when looking at `ax.plot()`.

Here is a list of marker names.

marker     |  description  | marker    |  description    | marker   |  description  | marker    |  description  
:----------|:--------------|:----------|:----------------|:---------|:--------------|:----------|:--------------
"."        |  point        | "+"       |  plus           | ","      |  pixel        | "x"       |  cross
"o"        |  circle       | "D"       |  diamond        | "d"      |  thin_diamond |           |
"8"        |  octagon      | "s"       |  square         | "p"      |  pentagon     | "\*"      |  star
"&#124;"   |  vertical line| "\_"      | horizontal line |  "h"     |  hexagon1     | "H"       |  hexagon2
0          |  tickleft     | 4         |  caretleft      | "<"      | triangle_left | "3"       |  tri_left
1          |  tickright    | 5         |  caretright     | ">"      | triangle_right| "4"       |  tri_right
2          |  tickup       | 6         |  caretup        | "^"      | triangle_up   | "2"       |  tri_up
3          |  tickdown     | 7         |  caretdown      | "v"      | triangle_down | "1"       |  tri_down
"None"     |  nothing      | `None`    |  default        | " "      |  nothing      | ""        |  nothing

### Linestyles

[Reference Documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.lines.Line2D.html#matplotlib.lines.Line2D)

Line styles control how lines are drawn.

Below is a list of linestyle names. You can also customize your own line styles.

linestyle          | description
-------------------|------------------------------
'-'                | solid
'--'               | dashed
'-.'               | dashdot
':'                | dotted
'None'             | draw nothing
' '                | draw nothing
''                 | draw nothing

### Ticks, Tick Lines, Tick Labels, Tickers, and Spines

* A Tick is the *location* of a Tick Label.
* A Tick Line is the line that denotes the location of the tick.
* A Tick Label is the text that is displayed at that tick.
* A [`Ticker`](http://matplotlib.org/api/ticker_api.html#module-matplotlib.ticker) automatically determines the ticks for an Axis and formats the tick labels.
* Spines are the axis lines for a plot. 

[`tick_params()`](https://matplotlib.org/api/axes_api.html#ticks-and-tick-labels) is often used to help configure your tickers.


To adjust spines, use [`set_position()`](http://matplotlib.org/api/spines_api.html#matplotlib.spines.Spine.set_position). 

In [None]:
fig, ax = plt.subplots()

ax.set(title="Plot of $y = x^2$")

ax.plot([i for i in range(-9, 10)], [i ** 2 for i in range(-9, 10)])

ax.spines['top'].set_color('none')
ax.spines['right'].set_color('none')

# move bottom spine up to y=0 position:
ax.tick_params(axis='x', labelbottom=True)
ax.spines['bottom'].set_position(('data',0))

# move left spine to the right to position x == 0:
ax.tick_params(axis='y', labelleft=True)
ax.spines['left'].set_position(('data',0))

fig.show()

## More Advanced Plotting: Grouping, Multiple Subplots & Faceting

Finally, we can preview some more advanced plotting techniques.

These techniques serve to organize information and combat overplotting.


### Grouping

Grouping is one technique used to facilitate comparing data from different groups.

We saw an example of grouping when looking at `ax.plot()`.

To plot data by group, we simply plot each group to the same axis.

Here is a scatterplot example.

In [None]:
# Setup

data = df.sort_values(["id", "year"])
data["percent_votes_lag"] = data.groupby("id").percent_votes.shift()

data = data[["elected", "percent_votes_lag", "percent_votes"]].dropna()
print(data)

In [None]:
# Plot

fig, ax = plt.subplots(1)

ax.set(
    title="Relation Between Previous and Current Vote Share",
    xlabel="Previous Vote Share (%)",
    ylabel="Current Vote Share (%)")

for name, _data in data.groupby("elected"):
    ax.scatter(
        x="percent_votes_lag", 
        y="percent_votes",
        s=2,
        alpha=.3,
        data=_data,
        label=name)

ax.legend()

fig.show()

#### Exercise 9

Can you group the data above by gender?

What do you think about the resulting graph?

In [None]:
# Setup

data = df.sort_values(["id", "year"])
data["percent_votes_lag"] = data.groupby("id").percent_votes.shift()

data = data[["gender", "percent_votes_lag", "percent_votes"]].dropna()
print(data)

In [None]:
# Your code goes here


#### Grouped Bar Chart

In experimental settings, people like to show bar charts grouped by conditions. 

This is a little bit more involved to do.

In [None]:
# Setup

gs = df.groupby(["elected", "censuscategory"]).percent_votes
data = (
    pd.concat(
        [gs.mean().rename("pct_votes_mean"), gs.sem().rename("pct_votes_sem")],
        axis=1)
    .reset_index())
print(data)

In [None]:
# Graph

x = pd.Series([i for i in range(len(data.censuscategory.cat.categories))])
width = .35  # the width of the bars

fig, ax = plt.subplots()

rects1 = ax.bar(
    x - width/2, 
    data[data.elected == "Elected"].pct_votes_mean, 
    width, 
    yerr=data[data.elected == "Elected"].pct_votes_sem,
    capsize=2,
    label='Elected')

rects2 = ax.bar(
    x + width/2, 
    data[data.elected == "Not elected"].pct_votes_mean,
    width, 
    yerr=data[data.elected == "Not elected"].pct_votes_sem,
    capsize=2,
    label='Not elected')

ax.set(
    title="Vote Share by Occupation and Incumbency",
    ylabel="Vote Share (%)",
    xticks=x,
    xticklabels=data.censuscategory.cat.categories)

ax.tick_params(axis="x", labelrotation = 90)

ax.legend()

fig.tight_layout()

plt.show()

### Multiple Subplots & Faceting

Often, we want to have multiple subplots in the same figure. 

Sometimes, this is done to provide additional information.

But, often, it is done to prevent overplotting. This is called faceting.

Consider the scatterplot by gender that we created previously.

Let's split each view into multiple subplots.

In [None]:
# Setup

data = df.sort_values(["id", "year"])
data["percent_votes_lag"] = data.groupby("id").percent_votes.shift()

data = data[["gender", "percent_votes_lag", "percent_votes"]].dropna()
print(data)

In [None]:
# Plot

fig, axs = plt.subplots(ncols=3, sharex=True, sharey=True)


for ax, (name, _data) in zip(axs, data.groupby("gender")):
    ax.set(title=name)
    ax.scatter(
        x="percent_votes_lag", 
        y="percent_votes",
        s=2,
        alpha=.3,
        data=_data)

axs[0].set(ylabel="Current Vote Share (%)")
axs[1].set(xlabel="Previous Vote Share (%)")

fig.suptitle("Past and Current Vote Share by Gender")
    
fig.show()

#### Exercise 10

Try faceting a plot of current vote share against previous vote share by incumbency (i.e., elected).

What do you think of the resulting plot? Is faceting more appropriate or grouping?

In [None]:
data = df.sort_values(["id", "year"])
data["percent_votes_lag"] = data.groupby("id").percent_votes.shift()

data = data[["elected", "percent_votes_lag", "percent_votes"]].dropna()
print(data)

In [None]:
# Your code goes here


## Conclusion

This is the end of our whirlwind tour of data visualization in Python using `matplotib`. 

I'll leave you with some pointers for further exploration.

### Advanced matplotlib

- Customizing subplot layouts
- Controling scaling (log, semilog, etc.)
- Plotting images, quivers, and 3D data
- Creating animations

### Other Plotting Tools Using Matplotlib

- Seaborn, providing more convenient APIs for typical data science plotting (e.g. grouping)
- plotnine, grammar of graphics in Python, uses matplotlib as backend

### Epilogue

Remeber to close your figures!

In [None]:
plt.close("all")