# Visualisation using Holoviews

This notebook contains examples of visualisation using Holoviews, an open-source Python library for data analysis and visualization. If you have any questions, contact your nearest AA representative.

## Why Holoviews?

There are quite a few plotting libraries for Python. In particular, matplotlib is widely used and very capable. However, matplotlib was first released in 2003, when the web was a simpler place, and tends to be aimed at static or printed output. It also has a very complex API, making it difficult to use.

Holoviews is designed for the modern web, including Jupyter notebooks, and provides dynamic interactive output. With Holoviews, you first describe your data, then add  extra information to modify aspects of the visualisation.

## Before we start

You'll need to have Holoviews installed before continuing. If you haven't done so already, at a command prompt:
```
python -m pip install --user holoviews
```

Holoviews uses a lower level plotting library to draw visualisations. The next cell imports Holoviews, and tells it to use a library called `bokeh` for drawing.

Although it is not required for Holoviews, we'll use pandas dataframes to hold our data, so we need to import `pandas` as well. Using dataframes means our data is already labelled with column names, and uses appropriate datatypes (numbers, strings, timestamps). Holoviews understands dataframes, so it makes things easier for us.

In [None]:
import pandas as pd
import holoviews as hv

hv.extension('bokeh')

This notebook has been written using Anaconda 2021.05 (Python 3.3.8) and Holoviews version 1.14.5.

In [None]:
import sys
print(sys.version)
print(hv.__version__)

## Getting started

We'll start with some simple examples to give you an idea of how Holoviews works.

Before plotting any data, we need to know how to organise the data to make it plottable. The initial data set shows how many cups of coffee I drank last week.

In [None]:
coffee = pd.DataFrame({
    'Day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
    'Cups': [3, 5, 3, 2, 7],
    'Condition': ['OK', 'Nice', 'OK', 'Dull', 'Wow!']
})
coffee

We start with a scatter plot, using `hv.Scatter`. If we give Holoviews a dataframe, we just need to specify which columns to use for the x and y axes.

In [None]:
hv.Scatter(coffee, 'Day', 'Cups')

That was pretty easy. Not only did we get a scatter plot, but the default options are such that the tick labels match the values, the axis labels match the column names, and we get some tools that we'll look at in a minute.

Different kinds of charts work better with different kinds of data. Instead of points, let's use a curve.

In [None]:
hv.Curve(coffee, 'Day', 'Cups')

Still pretty easy, we just changed the name of the method. Let's try a bar chart.

In [None]:
hv.Bars(coffee, 'Day', 'Cups')

## Visualisation options

Now we know how to feed data to Holoviews, we can look at the next step: modifying the visualisation.

The scatter chart is functional, but the points are a bit small. Let's fix that.

Holoviews keeps the data (days, values) and the metadata (column names, data types) separate from the presentation (colors, sizes, etc). Presentation options are specified using the `.opts()` method.

In [None]:
hv.Scatter(coffee, 'Day', 'Cups').opts(size=10)

Now the dots are more visible. If you don't like the default steelblue color, you can change that as well.

In [None]:
hv.Scatter(coffee, 'Day', 'Cups').opts(size=10, color='red')

If only the plot was a little bigger, and there was a grid so it was easier to see where the dots are. Also, circles are boring.

In [None]:
hv.Scatter(coffee, 'Day', 'Cups').opts(
    width=400, height=400,
    marker='s',
    size=10,
    color='red',
    show_grid=True
)

Some of these options can be applied to a curve as well. The `marker` and `size` options are specific to a scatter plot, but a curve has a width. The color can be a know name ('red', 'green', 'blue', etc) or you can specify a web color.

In [None]:
hv.Curve(coffee, 'Day', 'Cups').opts(
    width=400,
    height=400,
    line_width=10,
    color='#ff7f00',
    show_grid=True
)

Unsurprisingly, options work for bars as well. Note that Holoviews elements are objects that can be assigned to names.

In [None]:
coffee_bars = hv.Bars(coffee, 'Day', 'Cups').opts(
    width=400, height=400,
    color='slategrey',
    show_grid=True
)
coffee_bars

How can we find what style options (color, line style, etc) and plot options (specific to the element) are avilable? Use `hv.help()`. For example, the next cell shows the many options available for `hv.Bars`. We'll look at some of these options below. For now, find out what happens when you add `invert_axes=True` to the `hv.Bars` example above.

In [None]:
hv.help(hv.Bars)

## Curve: Olympics 100m sprint

Let's look at something a little more complicated: the gold, silver and bronze medal winners of the Olympics 100m sprint.

In [None]:
from bokeh.sampledata.sprint import sprint
sprint = sprint.copy()
sprint

We know how to plot a curve of year against time.

In [None]:
hv.Curve(sprint, 'Year', 'Time')

We've plotted a single curve using all of the data, but there are actually three sets of data that we want to look at and compare: gold, silver, and bronze medal times.

We'll extract each medal into a separate dataframe using standard dataframe filtering, create a plot for each one, and look at the plots. Holoviews can display plots in a side-by-side layout.

In [None]:
gold_df = sprint[sprint['Medal']=='GOLD']
silver_df = sprint[sprint['Medal']=='SILVER']
bronze_df = sprint[sprint['Medal']=='BRONZE']

gold_medals = hv.Curve(gold_df, 'Year', 'Time')
silver_medals = hv.Curve(silver_df, 'Year', 'Time')
bronze_medals = hv.Curve(bronze_df, 'Year', 'Time')

hv.Layout([gold_medals, silver_medals, bronze_medals])

We can use `+` as a shortcut for a layout. We can also use `.cols()` to specify a number of columns for the layout.

In [None]:
(gold_medals + silver_medals + bronze_medals).cols(1)

That's all very well, but what we really want to do is plot the three sets of data in one chart. We can do that using `hd.Overlay()`, or use the `*` shortcut.

In [None]:
gold_medals * silver_medals * bronze_medals

Basic, but not very useful. Let's go back and recreate the plots, but this time we'll add a label to the metadata.

In [None]:
gold_medals = hv.Curve(gold_df, 'Year', 'Time', label='Gold')
silver_medals = hv.Curve(silver_df, 'Year', 'Time', label='Silver')
bronze_medals = hv.Curve(bronze_df, 'Year', 'Time', label='Bronze')

gold_medals * silver_medals * bronze_medals

Holoviews has automatically added a legend for us, using the labels in the metadata. Nice. But the colors don't match what we think they should be; fortunately we know how to fix that.

In [None]:
medals = gold_medals.opts(color='gold') * silver_medals.opts(color='grey') * bronze_medals.opts(color='sandybrown')
medals

Let's pause for a moment and look at the tools that Holoviews provides. From top to bottom:
- By default, the panning tool is enabled: you can just grab the plot with your mouse and pan it vertically and horizontally.
- The box zoom tool lets you select a rectangle with the mouse to zoom in on that area.
- If you click the wheel zoom tool (the mouse with the magnifying glass), you can use the mouse wheel to zoom in and out.
- The save tool saves the plot as an image.
- The reset tool resets the plot to its original state.

The plot is interactive: as well as the tools, you can click on the legend elements to de-emphasise them.

Now we add some finishing touches, including a title.

In [None]:
medals.opts(width=600, show_grid=True, title='Olympic 100m medal winning times')

### Adding the winners - kdims and vdims

It's a nice chart, but it doesn't tell us anything about the gold medal winners. To find out who the winners are, we need to add labels to the plot.

To see how labels work, let's go back to the coffee situation.

In [None]:
coffee_situation = hv.Bars(coffee, kdims='Day', vdims='Cups')
coffee_situation

So far the charts have had two dimensions: x and y. Holoviews considers these as belonging to the key dimensions (`kdims`) and the value dimensions (`vdims`). The kdims contain independent variables (for example, day of the week, year of the Olympics), and the vdims contain dependent variables which depend on the kdims (for example, the number of cups of coffee drunk depends on the day, the winning time depends on the year).

Different elements have different numbers of required key dimensions and value dimensions. For instance, a `Bars` element requires a key dimension and a value dimension.

The `Label` element requires two `kdims` dimensions and a `vdims` dimension. This makes sense, the winner's name (the vdim) depends on the year and the time (the kdim). Let's try it.

In [None]:
coffee_labels = hv.Labels(coffee, kdims=['Day', 'Cups'], vdims='Condition')
coffee_labels

Comparing that with the bar chart above, it looks pretty good. All we need to do now is overlay the bar plot with the label plot.

In [None]:
coffee_situation * coffee_labels

Hmm, not quite: the labels are in exactly the right positions to match the `Day` and `Cups` axes, but not for a pleasing visualisation. That's easily fixed, though: we just `.opts()` to tell the `hv.Labels` element to offset the labels in the `y` direction..

In [None]:
coffee_situation * coffee_labels.opts(yoffset=0.2)

Better.

Let's try creating labels for the gold medal winners. (Now that we know what `kdims` and `vdims` are required for `Labels`, I haven't bothered using the explicit parameter names.)

In [None]:
labels = hv.Labels(gold_df, ['Year', 'Time'], 'Name')
labels.opts(width=700)

Oh dear. We have exactly what we wanted, but we didn't really want that, all overlapping and stuff. We'll have to try a different way.

In the next cell, I've explicitly labelled the `kdims` and `vdims` parameters. They're the same parameters we've been using all along, so it obviously isn't mandatory; I'm just making it obvious. We're creating a scatter plot with `kdims` `Year` and `vdims` `Time` and `Name`. The scatter plot will use the kdim (`Year`) and the first vdim (`Time`) as before to draw dots.

In [None]:
gold_scatter = hv.Scatter(gold_df, kdims='Year', vdims=['Time', 'Name']).opts(width=700, size=4, color='black')
gold_scatter

Hmm, that didn't do anything obvious. But let's add the hover tool to the available tools. We can use `gold_scatter` with its existing options, and just add a `tools` option to enable the hover tool.

In [None]:
names = gold_scatter.opts(tools=['hover'])
names

It looks exactly the same. However, you should see an extra enabled "Hover" icon in the tool list. If you move the mouse over the chart, it changes to a hover cursor. If you move over any of the points, you'll see the data for the point displayed, including the name dimension. (Of course, you can customise what displays in the hover label, but we'll leave that for now.)

So now we have a plot that shows medal winning times, and another plot that shows winner's names.

I think we all know what happens next.

In [None]:
medals * names

Something else has happened as well. We didn't specify a width option, but the width came out as 700. That's because each plot element remembers its options, and both `medals` and `names` had `width=700`, so Holoviews did the right thing when combining them. On the other hand, they each had a different title (well, names didn't have a title at all), so it refused to pick one and kept the default of no title. Likewise, `show_grid` was True in one, False in the other, so it remained the default False.

We can add some finishing touches (again). We'll add another refinement: making the `wheel_zoom` tool active by default, so you don't have to click on it before zooming. (Try it.) Also, the curves are a bit close to the edges, so we'll add some padding all around.

In [None]:
winners = medals * names
winners.opts(
    show_grid=True,
    title='Olympic 100m winners',
    active_tools=['wheel_zoom'],
    padding=0.1
)

It should be pointed out that there's some misdirection being displayed here. The time values range between roughly 9.5 and 12.5, giving an exaggerated sense of how much times have improved. to display a more realistic plot, the y-axis should really start from 0.

To do that, use the `ylim` option, with a value of `(0, None)` indicating that the y-axis should range from 0 to whatever. The legend is now in an awkward place, so we'll move that as well.

In [None]:
winners.opts(
    ylim=(0, None),
    legend_position='bottom_left'
)

Is this plot better than the previous one? It depends what you want to focus on.

## More vdims

Let's look at another use for multiple vdims.

We have some automobile data; specifically, car names, year of introduction, and miles-per-gallon. We want to find out if cars became more efficient over the years.

In [None]:
from bokeh.sampledata.autompg import autompg
autompg = autompg.copy()
autompg

The first word of each car name is the manufacturer: we'll split the string and keep the first word. Then, to see what we're dealing with, we'll group by manufacturer and year and create a scatter plot to see which manufacturers introduced a car in each year.

In [None]:
autompg['manufacturer'] = autompg['name'].str.split().apply(lambda lst:lst[0])
manu_year = autompg[['manufacturer', 'yr']].groupby(['manufacturer', 'yr'], as_index=False).size()
manu_year

In [None]:
manu_year.describe()

Describing the dataframe tells us we have the years 70 to 82, and between 1 and 6 cars introduced by each manufacturer for at least one year. To create the scatter plot, we'll use the `responsive=True` option to automcatically adjust the width of the plot to your browser width.

In [None]:
hv.Scatter(manu_year, 'yr', 'manufacturer').opts(height=500, responsive=True)

Sigh. Even the sample data needs cleaning. Let's get to it, and redo everything.

In [None]:
autompg.loc[autompg['manufacturer']=='chevy', 'manufacturer'] = 'chevrolet'
autompg.loc[autompg['manufacturer']=='chevroelt', 'manufacturer'] = 'chevrolet'
autompg.loc[autompg['manufacturer']=='maxda', 'manufacturer'] = 'mazda'
autompg.loc[autompg['manufacturer']=='mercedes-benz', 'manufacturer'] = 'mercedes'
autompg.loc[autompg['manufacturer']=='toyouta', 'manufacturer'] = 'toyota'
autompg.loc[autompg['manufacturer']=='vokswagen', 'manufacturer'] = 'volkswagen'
autompg.loc[autompg['manufacturer']=='vw', 'manufacturer'] = 'volkswagen'

manu_year = autompg[['manufacturer', 'yr']].groupby(['manufacturer', 'yr'], as_index=False).size().rename(columns={'size':'ncars'})
manu_year

In [None]:
max_manu_cars = manu_year['ncars'].max()
title = f'Most cars introduced by a single manufacturer in a year: {max_manu_cars}'

hv.Scatter(manu_year, 'yr', 'manufacturer').opts(
    height=500,
    size=4,
    responsive=True,
    show_grid=True,
    title=title
)

The scatter plot tells us in which years each manufacturer introduced a car, so we can tell that Volkswagen is prolific, but Triumph much less so. What the plot doesn't tell us is which manufacturers introduced more cars in which years. For that, we need to add another vdim. In the cell above, we specified `size=4` for the points. In this cell, we'll specify  that the size of the dots is the valus of the `ncars` vdim.

In [None]:
hv.Scatter(manu_year, 'yr', ['manufacturer', 'ncars']).opts(
    size='ncars',
    height=500,
    responsive=True,
    show_grid=True,
    title='plain'
)

That's pretty cool - the more cars introduced, the larger the dot size. It's a bit unfortunate that the number of cars ranges from 1 to 6, because the dots come out a bit small. There are a couple of ways we can fix that, though. One would be to create another column in the dataframe which is a multiple of `ncars` and use that as the vdim.

We don't need to do that: Holoviews is clever enough to help us out.

In [None]:
hv.Scatter(manu_year, 'yr', ['manufacturer', 'ncars']).opts(
    size=hv.dim('ncars')*8,
    alpha=0.5,
    height=500,
    responsive=True,
    show_grid=True,
    title='boosted'
)

When we used `size='ncars'` in the "plain" plot, we were telling `hv.Scatter` to use the `ncars` dimension as the dot size. Using `size='ncars'` is a shortcut for `size=hv.dim('ncars')`, but `hv.Scatter` knows what we mean in context.

If we want to multiply the size, we can't use `size='ncars'*8`, because that means something else in Python. (You get an Internet point if you know what `'ncars'*8` does.) Therefore, we have to explicitly use `hv.dim('ncars')` to tell `hv.Scatter` that we mean the dimension `ncars`, not the string `'ncars'`.

The resulting "boosted" plot tells us which manufacturers introduced more cars (Chevrolet and Ford - shocked Pikachu face), which was what we wanted to know. However, we don't know what the actual values are.

(Admittedly I went a bit overboard when I multiplied by 8, but it gave me an opportunity to introduce the `alpha` style option to make the dots semi-transparent.)

Now we've figured that out, let's get back to our original objective - discovering change in mpg over years.

We'll group the data by manufacturer and year, and find the mean mpg for each group

In [None]:
# We're using the 'mpg' column.
#
col = 'mpg'

# Agregate using mean mpg.
# These both do the same thing, use whichever one makes the most sense.
#
# agg = autompg[['manufacturer', 'yr', col]].groupby(['manufacturer', 'yr'], as_index=False).agg({col:'mean'})
agg = autompg[['manufacturer', 'yr', col]].groupby(['manufacturer', 'yr'], as_index=False)[col].agg('mean')
agg

And now we'll plot the results using `mpg` for the dot size, as above.

In [None]:
hv.Scatter(agg, 'yr', ['manufacturer', col]).opts(
    size=col,
    alpha=0.5,
    height=500,
    responsive=True,
    title=f'Size is mean {col}'
)

That worked nicely. We can see that the dots are generally getting bigger from left to right, so that sort of answers our question.

The dots are too big, but we can easily fix that in the same way we fixed `ncars` above. However, we don't really have an idea of what the values are. We could use an `hv.Labels` element add values to the plot, but it would spoil the stark simplicitly.

Let's try applying the `mpg` dimension to the color instead of the size.

In [None]:
hv.Scatter(agg, 'yr', ['manufacturer', col]).opts(
    height=500,
    responsive=True,
    size=10,
    color=col,
    cmap='Viridis'
)

Personally I think it's easier to see the change using colors - cars did generally get more miles-per-gallon over the years from 1970 to 1982.

We still don't kow what the values are, but we can add a colorbar to tell us.

In [None]:
hv.Scatter(agg, 'yr', ['manufacturer', col]).opts(
    height=500,
    responsive=True,
    size=10,
    color=col,
    cmap='Viridis',
    colorbar=True
)

There are many different uniform sequential colormaps we could have used here. Pick a couple and see if you prefer a different colormap.

In [None]:
hv.plotting.util.list_cmaps(category='Uniform Sequential', records=False, reverse=False)

## Exploratory Data Analysis
Given what we've learned, we can do some exploratory data analysis on the autompg dataset.

Above we only looked at the `mpg` column. We'd really like to compare any numeric column with any other numeric column.

The cell below will open another page in your browser and display a dashboard that allows you to select any numeric value for the `x`, `y`, `color`, and `size` dimensions. The `panel` library that is used here is the same library the Holoviews uses for layouts.

Use the drop-down selections to plot `yr` vs `mpg` and see if the results match what we saw above. What other interesting trends are there?

In [None]:
import panel as pn
import panel.widgets as pnw
columns = sorted([col for col in autompg.columns if autompg[col].dtype!=object])

x = pnw.Select(name='x-axis', value='mpg', options=columns)
y = pnw.Select(name='y-axis', value='hp', options=columns)
color = pnw.Select(name='color', value='None', options=['None'] + columns)
size = pnw.Select(name='size', value='None', options=['None'] + columns)

@pn.depends(x.param.value, y.param.value, color.param.value, size.param.value) 
def create_figure(x, y, color, size):
    opts = dict(cmap='Viridis', responsive=True, height=600, line_color='black', colorbar=True)
    if color != 'None':
        opts['color'] = color
    opts['size'] = hv.dim(size).norm()*20 if size!='None' else 6
    
    return hv.Points(autompg, [x, y], label=f'{x.title()} vs {y.title()}').opts(**opts)

widgets = pn.WidgetBox(x, y, color, size, width=200)
pn.Row(widgets, create_figure).show('Cross-selector')