# Visualizing many distributions at once using boxplots and ridgeline plots

This is the fifth installment in a series of blog posts where we reproduce plots from Claus Wilke’s book, *Fundamentals of Data Visualization.* 

This notebook demonstrates how to recreate the boxplots and ridgeline plots found in the “[visualizing many distributions at once](https://clauswilke.com/dataviz/boxplots-violins.html)” chapter of the book.

We will use the `vbar()`, `scatter()`, `harea()`, and `patch()` glyphs to recreate the plots.

In [1]:
from bokeh.io import output_notebook
import pandas as pd

output_notebook()  # render plots inline in notebook

## Boxplot

The plot in this sub-section represent the mean daily temperatures in Lincoln, Nebraska in 2016. The line in the middle of the boxplot represent the median, and the box encloses the middle 50% of the data.

The top and bottom whiskers extend to the maximum and minimum that falls within 1.5 times the height of the box.

### Data preparation

In [2]:
file = "../data/csv_files/lincoln.csv"
df = pd.read_csv(file)

df["DATE"] = pd.to_datetime(df["DATE"])
df["TAVG"] = (df["TMAX"] + df["TMIN"]) / 2
df["MONTH"] = df.DATE.dt.strftime("%b")

df = df[
    [
        "MONTH",
        "TMIN",
        "TMAX",
        "TAVG",
    ]
]

qs = df.groupby("MONTH").TAVG.quantile([0.25, 0.5, 0.75]).unstack().reset_index()
qs.columns = ["MONTH", "Q1", "Q2", "Q3"]

iqr = qs.Q3 - qs.Q1
qs["upper"] = qs.Q3 + 1.5 * iqr
qs["lower"] = qs.Q1 - 1.5 * iqr
df = pd.merge(df, qs, on="MONTH", how="left")

df

Unnamed: 0,MONTH,TMIN,TMAX,TAVG,Q1,Q2,Q3,upper,lower
0,Jan,15.0,36.0,25.5,23.00,27.5,31.5,44.250,10.250
1,Jan,18.0,39.0,28.5,23.00,27.5,31.5,44.250,10.250
2,Jan,15.0,32.0,23.5,23.00,27.5,31.5,44.250,10.250
3,Jan,15.0,27.0,21.0,23.00,27.5,31.5,44.250,10.250
4,Jan,21.0,40.0,30.5,23.00,27.5,31.5,44.250,10.250
...,...,...,...,...,...,...,...,...,...
361,Dec,23.0,48.0,35.5,18.75,27.5,33.5,55.625,-3.375
362,Dec,29.0,47.0,38.0,18.75,27.5,33.5,55.625,-3.375
363,Dec,25.0,45.0,35.0,18.75,27.5,33.5,55.625,-3.375
364,Dec,21.0,49.0,35.0,18.75,27.5,33.5,55.625,-3.375


### Plotting

In [3]:
from bokeh.models import ColumnDataSource, Whisker
from bokeh.plotting import figure, show

In [4]:
# create figure object
p = figure(
    title="Figure 9.3",
    x_range=df.MONTH.unique(),
    toolbar_location=None,
    height=400,
    width=600,
    x_axis_label="month",
    y_axis_label="mean temperature (F)",
)

# create column data source object from the dataframe
source = ColumnDataSource(df)

# create whisker object and add it to figure
whisker = Whisker(base="MONTH", upper="upper", lower="lower", source=source)
whisker.upper_head.size = whisker.lower_head.size = 20
p.add_layout(whisker)

# create boxplot using two vbar() glyphs
p.vbar(
    x="MONTH",
    top="Q2",
    bottom="Q1",
    width=0.8,
    source=source,
    color="#E0E0E0",
    line_color="black",
)

p.vbar(
    x="MONTH",
    top="Q3",
    bottom="Q2",
    width=0.8,
    source=source,
    color="#E0E0E0",
    line_color="black",
)

# plot ouliers using scatter() glyph
outliers = df[~df.TAVG.between(df.lower, df.upper)]
p.scatter("MONTH", "TAVG", source=outliers, size=5, color="black")

# customize y-axis of plot
p.y_range.start = -10
p.yaxis.ticker = [0, 25, 50, 75]
p.grid.grid_line_color = None

show(p)

## Sina plot

The plot in this sub-section represent the mean daily temperatures in Lincoln, Nebraska in 2016. We use the same file as the boxplot.

### Data preparation

In [5]:
import numpy as np

# create a list of average temperature values for each month in the datafrmae
values = []
cats = list(df.MONTH.unique())
for cat in cats:
    month = df[df["MONTH"] == cat]
    m_values = month.TAVG.values
    m_values = np.nan_to_num(m_values, nan=0)
    values.append(m_values)

positions = np.linspace(-1, 90, 500)
source = ColumnDataSource(df)

### Plotting

In [6]:
from sklearn.neighbors import KernelDensity
from bokeh.transform import jitter

In [7]:
# cretae figure object
p = figure(
    title="Figure 9.8",
    x_range=df.MONTH.unique(),
    toolbar_location=None,
    height=400,
    width=500,
    x_axis_label="month",
    y_axis_label="mean temperature (F)",
)

# create an offset for each category in the data


def violin(category, data, scale=5):
    return list(zip([category] * len(data), scale * data))


# calculate the KDE for each month and plot the data
for cat, value in zip(cats, values):
    kde = KernelDensity(kernel="gaussian", bandwidth=3).fit(value[:, np.newaxis])
    log_density = kde.score_samples(positions[:, np.newaxis])
    x1, x2 = violin(cat, np.exp(log_density)), violin(cat, -(np.exp(log_density)))
    p.harea(x1=x1, x2=x2, y=positions, alpha=0.8, color="#E0E0E0")

    # def add_count(df, column):
    # series = df[column].value_counts()
    # count_col = series.to_dict()
    # return count_col

    # df["Avg_count"] = df["TAVG"].map(add_count(df, "TAVG"))

    # x = list(zip(df["MONTH"], df["Avg_count"]/64))

    # create a scatter plot for the average temp in each month
    p.scatter(
        x=jitter("MONTH", width=0.3, range=p.x_range),
        # x=x,
        y="TAVG",
        source=df,
        color="black",
    )

# customize plot
p.y_range.start = -10
p.yaxis.ticker = [0, 25, 50, 75]
p.grid.grid_line_color = None

show(p)

## Ridgeline plot

The plot in this sub-section represent the voting pattern in the U.S House of Representatives over the years. DW-NOMINATE scores are frequently used to compare the voting patterns between parties and over time. Here, score distributions are shown for each Congress from 1963 to 2013 separately for Democrats and Republicans. Each Congress is represented by its first year (dim_1 in the dataframe).

The `patch()` glyph is used for the plot.

### Data preparation

In [8]:
file = "../data/csv_files/dw_nominate_house.csv"

df = pd.read_csv(file)

# add year column by multiplying each congress by 2 from the year 1787
df["year"] = (df.congress) * 2 + 1787

# select only the relevant columns from the year 1963 onwards
year = df["year"] >= 1963
parties = (df["party_code"] == 100) | (df["party_code"] == 200)
dn = (df["cd"] != 0) & (df["cd"] != 98) & (df["cd"] != 99)

df = df[year & parties & dn].reset_index(drop=True)

# create two dataframes for both political parties
dems = df[df["party_code"] == 100]
repubs = df[df["party_code"] == 200]

In [9]:
from bokeh.models import ColumnDataSource
import numpy as np
from scipy.stats import gaussian_kde

# create a column data source containing the KDE data for each


def tweak_df(df):
    """
    Calculate Kernel Density Estimates (KDE) for each year in the input DataFrame.

    Parameters:
        df (pd.DataFrame): Input DataFrame with columns "year" and "dim_1".

    Returns:
        bokeh.models.ColumnDataSource: A ColumnDataSource containing KDE data for each year.

    Raises:
        ValueError: If the input DataFrame does not have the required columns or if there's an issue with KDE calculation.
    """
    # Validate input DataFrame
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame.")
    if "year" not in df.columns or "dim_1" not in df.columns:
        raise ValueError("Input DataFrame must have columns 'year' and 'dim_1'.")

    # Pivot the DataFrame
    try:
        new_df = df.pivot_table(
            index=df.groupby("year").cumcount(),
            columns="year",
            values="dim_1",
            aggfunc=np.mean,
        )
    except Exception as e:
        raise ValueError(f"Error occurred during DataFrame pivoting: {str(e)}")

    # Reset index and column names
    new_df.reset_index(drop=True, inplace=True)
    new_df.columns.name = None
    new_df = new_df.fillna(0)
    new_df.columns = new_df.columns.astype(str)

    # Calculate KDE for each column and create plotting ridge
    x = np.linspace(-1.5, 1.5, 500)
    source = ColumnDataSource(data=dict(x=x))

    def ridge(category, data, scale=2):
        return list(zip([category] * len(data), scale * data))

    try:
        for col in new_df.columns:
            pdf = gaussian_kde(new_df[col])
            y = ridge(col, pdf(x))
            source.add(y, col)
    except Exception as e:
        raise ValueError(f"Error occurred during KDE calculation: {str(e)}")

    return source

### Plotting

In [10]:
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure


def plot_ridges(df1, df2=None):
    """
    Plot ridges for two DataFrames.

    Parameters:
        df1 (bokeh.models.ColumnDataSource): ColumnDataSource containing data for the first category.
        df2 (bokeh.models.ColumnDataSource, optional): ColumnDataSource containing data for the second category.

    Returns:
        bokeh.plotting.figure: A Bokeh figure showing the ridges for both categories.
    """
    # Input validation
    if not isinstance(df1, ColumnDataSource):
        raise ValueError(f"{df1} must be a valid ColumnDataSource")

    if df2 is not None and not isinstance(df2, ColumnDataSource):
        raise ValueError(f"{df2} must be a valid ColumnDataSource")

    # Get the list of categories from the data keys
    cats = list(reversed(df1.data.keys()))[:-1]

    # Create a Bokeh figure with specified attributes
    p = figure(
        title="Figure 9.12",
        x_axis_label="DW-NOMINATE score",
        y_axis_label="year",
        toolbar_location=None,
        x_range=(-0.75, 1.5),
        y_range=cats,
        height=400,
        width=600,
    )

    # Plot ridges for each category
    for cat in cats:
        # Plot ridges for the first category
        p.patch(
            x="x",
            y=cat,
            source=df1,
            fill_color="blue",
            line_color="white",
            legend_label="Democrats",
            alpha=0.5,
        )

        # Plot ridges for the second category if df2 is provided
        if df2 is not None:
            # Input validation: Check if df2 has the same keys as df1
            if not set(df2.data.keys()) == set(df1.data.keys()):
                raise ValueError(f"{df2} must have the same keys as {df1}")

            p.patch(
                x="x",
                y=cat,
                source=df2,
                fill_color="red",
                line_color="white",
                legend_label="Republicans",
                alpha=0.5,
            )

    # Customize axis tick positions
    p.xaxis.ticker = [-0.75, -0.5, -0.25, 0.00, 0.25, 0.5, 0.75, 1.00]
    # p.yaxis.ticker = ["2013", "2003", "1993", "1983", "1973", "1963"]

    return p

In [11]:
# create ridgeline plot for both dataframes
r_cds = tweak_df(repubs)
d_cds = tweak_df(dems)
ridge = plot_ridges(d_cds, r_cds)

show(ridge)