# Swarmplot alternative

This notebook demonstrates the use of animation in a plot. We'll compare a seaborn swarmplot wqith an animated Holoviews scatter plot.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import random

import holoviews as hv

hv.extension('bokeh')

In these unusual times, people are encouraged to check in with the "Check In CBR" app to assist contact tracing. We'd like to visualise those checkins to see where people from different areas of Canberra are checking in.

The data is aggregated so all we know about is where people are from: Belconnen, Gungahlin, or Tuggeranong, and the general areas that they check in to.

Let's generate some data. (Don't look at this code too closely.)

In [None]:
np.random.seed(271828) # Make it repeatable.

NDAYS = 28
DAY0 = pd.Timestamp('2021-09-01')
DATES = pd.date_range('2021-09-01', periods=NDAYS, freq='1D')
PLACES = ['Tuggeranong Mall', 'Canberra Centre', 'Belconnen Town Centre', 'Bunnings Gungahlin']

def biased_places(place):
#     p = [i for i in PLACES if place in i]
    return PLACES + [place] * 5

def generate_checkins(n, area, roll, place):
    a = np.random.randn(n)
    a -= a.min()
    a = a * ((NDAYS-1)/a.max()) + roll
    bp = biased_places(place)
    events = [(area, i, (DAY0+pd.Timedelta(f'{i}d')).round('1s'), random.choice(bp)) for i in a if 0<=i<=NDAYS]
    return events

checkins = (
    generate_checkins(500, 'Belconnen', -14, PLACES[2]) +
    generate_checkins(500, 'Belconnen', 14, PLACES[1]) +
    generate_checkins(400, 'Tuggeranong', 14, PLACES[0]) +
    generate_checkins(300, 'Gungahlin', 0, PLACES[3]) + 
    generate_checkins(200, 'Gungahlin', 10, PLACES[1]) 
)
df = pd.DataFrame.from_records(checkins, columns=['From', 'Days', 'Date', 'Place']).sort_values('Place')
# df

We can visualise the data using a seaborn swarmplot. (For some reason, a swarmplot takes forever to draw if the x-axis is a date, so I've substituted "days from first date" instead.)

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
ax = sns.swarmplot(data=df, x='Days', y='Place', hue='From', ax=ax)

Swarmplots are pretty neat. However, as the amount of data grows, they become harder to read. The different colored dots are intermingled, making it harder to compare counts (especially when the colors are similar). And if you go back to the data-generating cell and another zero to the `generate_checkins()` values (change 500 to 5000, etc), the swarmplot just runs out of room to put dots and displays warning messages. (Try it.)

Let's see if we can do something with Holoviews. Holopviews doesn't have a swarmplot, so we'll have to try something different.

(It should be noted that there is a different solution that copes with very large numbers, based on a library called `datashader`, but that solution is on the high side.)

We'll start by grouping the by `From`, `Day`, and `Place`, and getting the size of each group. We can then create a scatter plot.

In [None]:
df['Day'] = df['Date'].dt.round('1d')
df_days = df.groupby(['From', 'Day', 'Place'], as_index=False).size().sort_values(['Place', 'From'], ascending=[False, True])
df_days

In [None]:
# Use the same colors as the seaborn swarmplot.
# The colors won't look exactly the same in this plot,
# because I'm using alpha=0.75 to try and make all of the dots visible.
#
cmap = {'Belconnen':'#4878d0', 'Gungahlin':'#ee854a', 'Tuggeranong':'#6acc64'}

hv.Scatter(df_days, 'Day', ['Place', 'From', 'size']).opts(
    width=800, height=400,
    color='From',
    size='size',
    alpha=0.75,
    cmap=cmap,
    legend_position='top'
)

That's not much good. Because all the dots are in the same places, they cover each other up, so we're really only seeing the colors with the largest dots.

The problem with the seaborn swarmplot and the Holoviews scatter plot is that we're trying to look at four dimensional data (`Day`, `Place`, `From`, `size/count`) in two spatial dimensions, and using color and a bunch of dots to squeeze the other data dimensions in.

- First we use a list comprehension to split `df_days` and create a list of tuples, where each tuple is (area, dataframe for that area).
- Then we use another list comprehension to create a scatter plot using each tuple.
- Finally, we draw each of the scatter plots.

If we have three plot elements `p1`, `p2`, and `p3`, we could draw them side-by-side using `p1 + p2 + p3`, which is a shortcut for putting the elements in a list and using `hv.Layout([p1, p2, p3])`. Because we already have a list of elements `plots`, we can draw them with `hv.Layout(plots)`. Then `.cols(1)` draws the plots in a single column.

The options are applied a little differently here. The only option attached to the `hv.Scatter()` in the list comprehension is the color for that area. When the layout is drawn, `hv.opts.Scatter()` is used to apply those options to all of the Scatter elements.

In [None]:
areas = [(area, df_days[df_days['From']==area]) for area in ['Belconnen', 'Gungahlin', 'Tuggeranong']]

plots = [hv.Scatter(from_df, 'Day', ['Place', 'size']).opts(color=cmap[area]) for area,from_df in areas]

hv.Layout(plots).cols(1).opts(
    hv.opts.Scatter(
        width=800, height=400,
        size='size',
        alpha=0.75
    )
)

Just for fun, we could overlay the individual plots and get back the original plot. (Compare this with the original plot.)

In [None]:
hv.Overlay(plots).opts(
    hv.opts.Scatter(
        width=800, height=400,
        size='size',
        alpha=0.75
    )
)

What we really want is to draw all three areas on the same plot, but be able to look at them one at a time. We can do that using a `HoloMap`, which is just a dictionary of Holoviews elements.

The `plots` list we created above is a list of (area, dataframe) tuples; turning that into a dictionary {area:scatterplot} is trivial.

A `HoloMap` is basically a visual (Python) dictionary; which plot you see depends on the dictionary key. Therefore, the kdim for a `HoloMap` is the area. In this case, we can call it whatever we want.

(Depending on how big your browser window is, you might have to either change the width to be narrowwe, or make your browser bigger, then rerun the cell so you can see everything.)

In [None]:
plot_map = {area:hv.Scatter(from_df, 'Day', ['Place', 'size']).opts(color=cmap[area]) for area,from_df in areas}

holomap = hv.HoloMap(plot_map, kdims='Checkins from area').opts(
    hv.opts.Scatter(
        width=800, height=400,
        size='size',
        alpha=0.75
    )
)
holomap

Alternatively, we can type a little more code and move the widget.

In [None]:
hv.output(holomap, widget_location='top_left')

Holoviews does all the right things for us. It adds a selection drop-down and populates it with the area keys, it redraws the correct scatter plot when we select a new area, and it even puts the right title on the plot. And unlike the swarmplot, it can handle any number of events (as long as we tweak the size to handle them).

## Time as the key dimension

Since one of the dimensions is time, and we expect time to move, we could take time out of the plot, and change the plot over time. We'll create another HoloMap, but using the date as the key dimension instead of the area.

Let's split up the grouped data by day, and have a look at one of the days. We'll use the same y-axis as before ("where people are checking in"), but this time the x-axis is "the areas that people are coming from". (I've also bumped the size up a bit.)

In [None]:
per_day = [df_days[df_days['Day']==d] for d in DATES]

d = -1
print(per_day[d])
hv.Scatter(per_day[d], 'From', ['Place', 'size']).opts(
    width=500, height=400,
    color='From',
    size=hv.dim('size')*2,
    cmap=cmap,
    show_legend=False
)

Comparing that with the same day of the swarmplot, things match up; lots of people from Belconnen going to the Canberra Centre, and plenty of people from Tuggeranong going to the local mall.

We'll create a scatter plot from each day and store them in a HoloMap, keyed by date.

In [None]:
def scatter_day(d):
    per_day = df_days[df_days['Day']==d]
    return hv.Scatter(per_day, 'From', ['Place', 'size']).opts(
        width=800, height=400,
        color='From',
        size=hv.dim('size')*2,
        cmap=cmap,
        show_grid=True,
        show_legend=False
    )

scatters = {d:scatter_day(d) for d in DATES}
hv.output(hv.HoloMap(scatters, kdims='Date'), widget_location='top_left')

Holoviews has recognised that the key is a date, and automatically added a date selector to the plot. You can move the slider to see what's happening on each day.

For any given day, it's easier to compare all of the areas that people come from vs all the places where checkins are happening, because both of those are in the same plot.