# Pattern of life

In this notebook we'll build a pattern-of-life visualistion. We take a collection of timestamps where some event happened (a bird was spotted, a shopper entered a store) and create a heatmap with the days of the week on one axis and hours of the day on the other axis, and the count of the things that happened at those days and hours as the visualisation.

In this way it becomes easy to spot patterns: birds mostly appearing at sunrise and sunset, or shoppers shopping after work on weekdays, but in the mornings on weekends.

We'll also play with pandas dataframes along the way.

If all you want is the visualisation without the journey, skip to the end and use the `pattern_of_life()` function. (And you'll need these imports.)

In [None]:
import pandas as pd
import holoviews as hv

hv.extension('bokeh')

## How to do it

We start with a set of event timestamps.
- From each timestamp, extract the day of week and hour of day. (This is easy with a pandas dataframe.)
- Group by the (day, hour) combination and count the size of each group. (Also easy with a pandas dataframe.)
- Using the days and hours as axes, and the group sizes as values, draw the heatmap.

Let's whip up a quick example. We'll create the processed data and plot it to see what it looks like.

## Manipulate the data

We start with a set of events timestamps, but to draw the heatmap, we need a dataframe with columns "day of the week", "hour of the day", and "events per day/hour". How do we get there from here? Pandas dataframes to the rescue.

Timestamp columns have a `dt` property that provides a whole lot of date/time methods. Let's create a dataframe with a datetime column and investigate.

In [None]:
# Create 5 timestamps with a frequency of 23 hours.
#
df = pd.DataFrame({'dtg': pd.date_range('2021-09-06', periods=5, freq='23H')})
df

In [None]:
# What can we do with a timestamp?
#
print([name for name in dir(df['dtg'].dt) if not name.startswith('_')])

That's a lot of methods.

The ones that we're interested in are `day_name()` and `hour()`.

The `day_name()` method returns the day of the week of a timestamp.

In [None]:
df['Day'] = df['dtg'].dt.day_name()
df

We really only want the day abbreviations, ie the first three letters. There must be a string method for that.

In [None]:
# What can we do with a string?
#
print([name for name in dir(df['Day'].str) if not name.startswith('_')])

That's also a lot of methods.

In plain Python, if we wanted the first three characters of a string, we'd use a slice.

In [None]:
s = 'Hello world.'
print(s[:3])

We can't use that on a dataframe (because pandas takes over the `[]` and does something else), but there is a string method called `slice` that does the same thing. We can combine that with the `day_name()` method.

In [None]:
df['Day'] = df['dtg'].dt.day_name().str.slice(0, 3)
df

All we need now is the hour.

In [None]:
df['Hour'] = df['dtg'].dt.hour
df

Now we just need to count how many times each combination of `Name` and `Hour` appears. This is done by grouping on those columes and getting the size, ie counting how many of each combination there are. (We don't need the `dtg` column any more.)

In [None]:
group_df = df.groupby(['Day', 'Hour'], as_index=False).size()
group_df

Which is just what we need for making a heatmap. (The `size` column comes from the `size()` method; it's as good a name as any.

## Holoviews HeatMap

Let's investigate the Holoviews HeatMap element. We'll start by creating an example input and plotting it.

In [None]:
df = pd.DataFrame({
    'Day': ['Tue', 'Thu', 'Thu'],
    'Hour': [9, 12, 17],
    'size': [1, 2, 3]
})
df

In [None]:
hv.HeatMap(df, kdims=['Day', 'Hour'], vdims='size')

Blergh. There's a lot missing there. We expect to see all of the days and hours, but `hv.HeatMap()` can only draw what we give it. This means we have to add everything else ourselves. In this case, not only do we have to fill out the missing days with 24 zero values each, but also the missing hours in the days that we have.

We need to build a dataframe containing `Day` and `Hour` columns containing each combination of day and hour. To do that, we'll use in inside-out trick.

The `pd.DataFrame.from_records()` function takes a sequence of tuples and a list of column names. We could build a list of tuples...
```
[('Mon', 0), ('Mon', 1), ('Mon', 2), ..., ('Mon', 23), ('Tue', 0), ..., ('Sun', 23)]
```

...and then build a dataframe from that list. Then we have a list sitting around doing nothing. Instead, we create a _generator function_ (a function that contains a `yield` statement) that generates rows one at a time, and give that to `pd.DataFrame.from_records()`. Doing it this way is handy if you're retrieving lots of rows from a database query, or reading some data that you have to change on the fly, and you don't want to keep track of another list.

We only need to create the day and hour columns, because we can easily add a column of all zero values to a dataframe.

In [None]:
DAYS = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

def generate_pol_zero():
    """Generate pattern-of-life data with all zero values."""
    
    for day in DAYS:
        for hour in range(24):
            yield day, hour
            
zero_df = pd.DataFrame.from_records(generate_pol_zero(), columns=['Day', 'Hour'])
zero_df['size'] = 0
zero_df

In this case, though, we know exactly what the input list looks like, so could do it like this.

In [None]:
zero_df = pd.DataFrame.from_records(
    [(day, hour) for day in DAYS for hour in range(24)],
    columns=['Day', 'Hour']
)
zero_df['size'] = 0
zero_df

Either way, we get the same dataframe, and the number of rows is $24 \times 7 = 168$ as expected.

Now we have a dataframe containing zero sightings of birds, shoppers, or anything. Let's try creating a heatmap to see if we have the right format.

In [None]:
hv.HeatMap(zero_df, kdims=['Hour', 'Day'], vdims='size')

Ignoring the horrible style, it has all of the days, all of the hours, and a uniform color (because the sizes are all zero). Excellent.

How do we combine that with the `df` dataframe that does contain counts?
- Concatenate the dataframes using `pd.concat()`, putting the real data first.
- Remove the rows that have duplicate (day, hour) values.

There should be 168 rows left.

In [None]:
# Concatenate the two dataframes.
#
full_df = pd.concat([df, zero_df])

# Drop duplicate (day, hour) rows, keeping the first row.
#
full_df = full_df.drop_duplicates(subset=['Day', 'Hour'], keep='first', ignore_index=True)

full_df

That looks like what we want. Let's see what the heatmap looks like now (with the original dataframe for comparison).

In [None]:
print(df)
hv.HeatMap(full_df, kdims=['Hour', 'Day'], vdims='size')

Ignoring the terrible style, the terrible colors, and the fact that the days are out of order, that seems fine: the counts in the dataframe match the colors in the heatmap (assuming darker means a higher count).

Now we can go ahead and start fixing the presentation.

## Label order

Let's fix the days first. Why are they out of order? By default, Holoviews will order the axes in whatever order your data is in. If you look back at `full_df`, you'll notice that the first few rows are whatever was in the original `df`, followed by the rest of the zero value rows. Since the rows start with `'Tue', 'Thu'`, that's what Holoviews uses; exactly what we asked for, even if we didn't realise it. 🙁

We can't fix it by using `full_df.sort_value()` to sort the rows, because then we'd get the days of the week in alphabetical order, and that's not right.

There are a couple of ways of fixing this.

Remember that Holoviews elements needs key dimensions (`kdims`) and value dimensions (`vdims`). For `hv.HeatMap` the kdims are `Hour` and `Day`, and the vdim is `size` (because the size value depends on both of Hour and Day). We can tell Holoviewws that the `Day` dimension not only has particular values, but the order that those values are in.

When we use `kdims='name'`, we're actually using a shortcut for `kdims=hv.Dimension('name')`. We can use the `values` parameter in `hv.Dimensions()` to specify the order of the day names. When Holoviews draws the heatmap, it will ensure that the `day_dim` dimension appears in the right order. (We'll assign the `HeatMap` element to `pol_heatmap` so we can keep using it.)

In [None]:
day_dim = hv.Dimension('Day', values=DAYS)
pol_heatmap = hv.HeatMap(full_df, kdims=['Hour', day_dim], vdims='size')
pol_heatmap

That worked nicely (ignoring the "going up" order for now).

Another way of fixing the order is to do it in the dataframe. Day names are _categorical_: they have a fixed number of values. By converting the day names from string type to categorical type, your dataframe becomes a little more efficient (probably not a problem for this small dataframe), and the sorting order can be maintained.

In [None]:
tmp_df = full_df
tmp_df['Day_cat'] = pd.Categorical(tmp_df['Day'], DAYS, ordered=True)
tmp_df = tmp_df.sort_values(['Day_cat', 'Hour'])
tmp_df

Now that the dataframe is sorted in `Day_cat` categorical order, we can let Holoviews draw the axes in the order that they appear. This heatmap will be the same as the one above.

In [None]:
hv.HeatMap(tmp_df, kdims=['Hour', 'Day_cat'], vdims='size')

Although categorical data can be useful, in this case we'll let Holoviews do the work; we'll use the `hv.Dimension()` code.

## Style

Now we'll go through the various style and plotting options one by one, starting with the size. Remember that using `.opts()` modifies the element, so we don't have to keep repeating previous options. We'll clear any existing options to start.

In [None]:
pol_heatmap.opts.clear()

### Size

Normally, we'd use `.opts(height=something, width=something)` to set the height and width. However, for this plot, we have a special requirement: we want the individual subelements of the heatmap to be squares, because it looks nicer that way.

The `height` and `width` parameters specify the width and height of the entire plot, including the axis labels, toolbar, and the other stuff that surrounds the actual plot. To specify the size of just the data part of the plot, we use `frame_width` and `frame_height`.

There are 24 values along the x-axis and 7 values up the y-axis. Therefore, to make sure the subelements are squares, we want the height to be 7//24 times the width

In [None]:
pol_heatmap.opts(
    frame_width=600,
    frame_height=600*7//24
)

### Axes

The x-axis does not number every major tick. There's plenty of room for 24 numbers, so we'll make this happen by settting `xticks` to a list of numbers. This has the beneficial side effect of removing the minor ticks.

The yaxis is upside down. Holoviews starts both axes from the bottom-left, so this is to be expected, but we'd like Monday to be at the top, because we read from the top down. Use the `invert_yaxis` parameter.

In [None]:
pol_heatmap.opts(
    xticks=list(range(24)),
    invert_yaxis=True
)

## Colors

The default colors are ... blech.

There are quite a few colormaps available to us. Colors are a subject in themselves: is the data categorical or continuous, can a colorblind user see the colors, are the colors uniform, etc, etc.

For a pattern-of-life heatmap, the counts form a sequence, so we'll go with a perceptually uniform sequential colormap, such as one of those below.

In [None]:
def cmap_examples(category, cols=4):
    import numpy as np
    from math import ceil
    cms = hv.plotting.util.list_cmaps(records=True, category=category, reverse=False)
    bars = [hv.Image(np.linspace(0, 1, 64)[np.newaxis], ydensity=1, label=f'{r.name}')
               .opts(
                   cmap=hv.plotting.util.process_cmap(r.name, provider=r.provider),
                   frame_width=172, aspect=6,
                   xaxis=None, yaxis=None,
                   fontsize={'title':10}
               )
           for r in cms
           if r.provider in ['bokeh', 'colorcet']]
    n = len(bars) * 1.0
    c = ceil(n/cols) if n>cols else cols
    return hv.Layout(bars).opts(transpose=(n>cols)).cols(c).opts(title=category)

cmap_examples('Uniform Sequential')

If you think they're a bit garish, you can try a mono sequential colormap, but they aren't perceptually uniform: in more complex plots, they be misleading, and colorblind people won't necessarily see the right shading

In [None]:
cmap_examples('Mono Sequential')

I'm going to go with Inferno here, but you can try other colormaps to see if you prefer another one.

The subelement squares don't quite meet, so the background shows through. You can choose a color that matches the base color of the colormap (black for Inferno) if you want a smooth background, or a contrasting color if you'd like to see a grid.

In [None]:
pol_heatmap.opts(
    cmap='Inferno',
    bgcolor='grey'
)

We have no idea what values the colors correspond to. This isn't a complete disaster, because the colors show relative values, but it would still be good to know at least the approximate values. We could add a hover tool, but we probably don't need to know the exact values, so a colorbar should do.

In [None]:
pol_heatmap.opts(
    colorbar=True
)

Finally we'll tidy up the surrounds.
- Remove most of the tools. There's no reason to pan or zoom (and therefore reset); the only tool needed is save.
- Change the axis labels. We could have done this using `hv.Dimension()` above, but it's easier to do it here.
- Add a title.

In [None]:
pol_heatmap.opts(
    default_tools=['save'],
    xlabel='Hour of day',
    ylabel='Day of week',
    title='Pattern of life'
)

And there it is, the finished product.

All we need to do now is wrap all of that up in a function, and add an example.

The only mandatory parameters are a dataframe, and the name of the column in the dataframe that holds the timestamps. A default width is used.

In [None]:
def pattern_of_life(data_df, dtg_col, width=600):
    """Create and style a Holoviews HeatMap element.
    
    Parameters
    ----------
    data_df: DataFrame
        A dataframe containing a timstamp column.
    dtg_col: str
        The name of the timestamp column.
    width: int
        The frame_width of the plot.
        The height will be automatically calculated to make the grid square.
    
    Returns
    -------
    A `hv.HeatMap` element.
    """

    DAYS = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
    
    df = pd.DataFrame({'dtg':data_df[dtg_col]})
    df['Day'] = df['dtg'].dt.day_name().str.slice(0, 3)
    df['Hour'] = df['dtg'].dt.hour
    df = df.drop(columns='dtg')

    df = df.groupby(['Day', 'Hour'], as_index=False).size()
    
    zero_df = pd.DataFrame.from_records(
        [(day, hour) for day in DAYS for hour in range(24)],
        columns=['Day', 'Hour']
    )
    zero_df['size'] = 0
    
    # Concatenate the two dataframes.
    #
    df = pd.concat([df, zero_df])

    # Drop duplicate (day, hour) rows, keeping the first row.
    #
    df = df.drop_duplicates(subset=['Day', 'Hour'], keep='first', ignore_index=True)
    
    day_dim = hv.Dimension('Day', values=DAYS)
    return hv.HeatMap(df, kdims=['Hour', day_dim], vdims='size').opts(
        frame_width=width,
        frame_height=width*7//24, #Ensure that the internal grid is square.
        cmap='Inferno',
        invert_yaxis=True, # Days read from top to bottom.
        xticks=list(range(24)), # Force all of the hours to be displayed.
        xlabel='Hour of day',
        ylabel='Day of week',
        title='Pattern of life',
        default_tools=['save'],
        bgcolor='white',
        colorbar=True
    )

In [None]:
# Try it out with some simple data.
# Add the DayName column so we can see if the days match the plot.
#
df = pd.DataFrame({'dtg': pd.date_range('2021-09-08 08:00', freq='25H', periods=4)})
df['DayName'] = df['dtg'].dt.day_name()
print(df)
pattern_of_life(df, 'dtg')

The `pattern_of_life()` function just returns the `hv.HeatMap` element, so we can still apply options to it.

In [None]:
hm = pattern_of_life(df, 'dtg', width=504)
hm.opts(cmap='Blues', title='What I did last week', xlabel='The Hours', ylabel='The Days', bgcolor='blue')

Let's try it out with some more specially crafted data to see if the visualisation meets our expectations. (You can keep running the cell using Ctrl+Enter to see different values.)

In [None]:
import numpy as np
timestamps = []
daily = lambda w,n:np.random.randn(n) * w # normal distribution of n events over w minutes
events = lambda d,w=240,n=100:[pd.Timestamp(d) + minute*pd.Timedelta('1minute') for minute in daily(w, n)]
timestamps.extend(events('2021-09-06 10:00'))
timestamps.extend(events('2021-09-07 11:00'))
timestamps.extend(events('2021-09-08 12:00'))
timestamps.extend(events('2021-09-09 13:00'))
timestamps.extend(events('2021-09-10 14:00'))
timestamps.extend(events('2021-09-11 21:00', 120, 50)) # half as many events in half as many minutes
df = pd.DataFrame({'dtg': timestamps})

pattern_of_life(df, 'dtg')

It looks like there's a standard workday on business days, starting and finishing a little later every day, with more events in the middle of the day. On Saturdays it's party on in the evenings, and Sundays are event free (apart from possible spillover from Saturdays).