(vis-tables)=
# Tables


## Introduction

The humble table is a vastly underloved and underappreciated element of communicating analysis. While it may not be as visually engaging as a vivid graph (and is far less good for a general audience), it has the advantage of being able to convey exact numerical information. It's also an essential part of some analysis: for example, when writing economics papers, there is usually a "table 1" that contains descriptive statistics.

Unfortunately, Python is a bit weaker on tables than it ought to be. As ever, **pandas** is the Swiss Army Knife of data analysis and can produce tables in a wide range of formats—so, in this chapter, we'll be looking at *creating tables with **pandas***. While **pandas** isn't perfect for crafting tables, using it for them means you can benefit from its incredible number of output formats. However, **pandas** is not your *only* option—you can also create visually exciting tables using **matplotlib**, as we'll see.

For more on best practice for tables, check out the advice of the UK government's [Analysis Function](https://analysisfunction.civilservice.gov.uk/policy-store/data-visualisation-tables/).

As ever, we'll start by importing some key packages and initialising any settings:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(
    seed_for_prng
)  # prng=probabilistic random number generator

In [None]:
import matplotlib_inline.backend_inline

# Plot settings
plt.style.use(
    "https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")

# Set max rows displayed for readability
pd.set_option("display.max_rows", 6)
pd.set_option("display.max_columns", 7)

## Creating tables with **pandas**

### Getting data

We'll use the *penguins* dataset to demonstrate the use of **pandas** in creating tables. These data come with the **seaborn** package, which you'll need:

In [None]:
import seaborn as sns

pen = sns.load_dataset("penguins")
pen.head()

### Preparing your table in **pandas**

There are a few operations that you'll want to do again, and again, and again to create tables. A cross-tab is one such operation! A cross-tab is just a count of the number of elements split by two groupings. Rather than just display this using the `pd.crosstab` function, we can add totals or percentages using the `margins=` and `normalize=` commands.

In the below, we'll use margins and normalisation so that each row sums to 1, and the last row shows the probability mass over islands.

In [None]:
pd.crosstab(pen["species"], pen["island"], margins=True, normalize="index")

The neat thing about the cross-tabs that come out is that they are themselves **pandas** dataframes.

Of course, the usual **pandas** functions can be used to create any table you need:

In [None]:
pen_summary = (
    pen
    .groupby(["species", "island"])
    .agg(
        median_bill=("bill_length_mm", "median"),
        mean_bill=("bill_length_mm", "mean"),
        std_flipper=("flipper_length_mm", "std"),
    )
)
pen_summary

For reasons that will become apparent later, we'll replace one of these values with a missing value (`pd.NA`).

In [None]:
pen_summary.iloc[2, 1] = pd.NA

The table we just saw, `pen_summary`, is not what you'd call publication quality. The numbers have lots of superfluous digits. The names are useful for when you're doing analysis, but might not be so obvious to someone coming to this table for the first time. So let's see what we can do to clean it up a bit.

First, those numbers. We can apply number rounding quickly using `.round`.

In [None]:
pen_summary.round(2)

This returns another dataframe. To change the names of the columns, you can just use one of the standard approaches:

In [None]:
pen_sum_clean = (
    pen_summary
    .rename(columns={"median_bill": "Median bill length (mm)", "mean_bill": "Mean bill length (mm)", "std_flipper": "Std. deviation of flipper length"})
)
pen_sum_clean

### Styling a **pandas** dataframe

As well as making direct modifications to a dataframe, you can apply *styles*. These are much more versatile ways to achieve styling of a table.

Behind the scenes, when a table is displayed on a webpage like this, HTML (the language most of the internet is in) is used. Styling is a way of modifying the default HTML for showing tables so that they look nicer or better.

In the example below, you can see some of the options that are available:

- `precision` is like `.round`
- `na_rep` sets how missing values are rendered
- `thousands` sets the separator between every thousand (for readability)
- `formatter` gives fine-grained control over the formatting of individual columns



In [None]:
pen_styled = (
    pen_sum_clean.style
    .format(precision=3, na_rep='Value missing', thousands=",",
            formatter={
                "Mean bill length (mm)": "{:.1f}",
                "Std. deviation of flipper length (mm)": lambda x: "{:,.0f} um".format(x*1e3)
                }
            )
)
pen_styled

In [None]:
pen_styled.to_latex()

### Limitations of tables in **pandas**

## Creating tables with **matplotlib**