# Visualizing Data with UMAP

UMAP is a technique for projecting data into lower dimension in a manner that captures as much information about the high dimensional representation as possible. In practice this means that you can take complex high dimensional data and  compress it down to something that can be represented visually as a scatterplot. This makes it very valuable for exploratory data analysis -- just learning about, and gaining some intuition about your data. In this notebook we'll work through an example (on sample data) to demonstrate how to use UMAP to reduce the dimension, and how to use that to build an interactive plot to let your explore your data further.

Credit for this tutorial goes to Valerie Poulin

<h2> Run the following cell to get pictoral mouse-overs </h2>

The code in the below cell is a near-verbatim copy of the umap.plot.interactive function. I altered it so that if the associated hover_data has 'image' as a key, it will properly display the image. Otherwise you just get a text readout of the base64 encoding of the image -- not very helpful.

I have a feature request in with the UMAP developers to add the appropriate functionality before the release of umap 0.4.

Apologies for the wall of code -- this won't be necessary for much longer. 

In [4]:
# umap.plot.interactive -- and some dependencies

import matplotlib.colors
import colorcet
import matplotlib.pyplot as plt 
import bokeh.plotting as bpl

fire_cmap = matplotlib.colors.LinearSegmentedColormap.from_list("fire", colorcet.fire)
darkblue_cmap = matplotlib.colors.LinearSegmentedColormap.from_list(
    "darkblue", colorcet.kbc
)
darkgreen_cmap = matplotlib.colors.LinearSegmentedColormap.from_list(
    "darkgreen", colorcet.kgy
)
darkred_cmap = matplotlib.colors.LinearSegmentedColormap.from_list(
    "darkred", colors=colorcet.linear_kry_5_95_c72[:192], N=256
)
darkpurple_cmap = matplotlib.colors.LinearSegmentedColormap.from_list(
    "darkpurple", colorcet.linear_bmw_5_95_c89
)

plt.register_cmap("fire", fire_cmap)
plt.register_cmap("darkblue", darkblue_cmap)
plt.register_cmap("darkgreen", darkgreen_cmap)
plt.register_cmap("darkred", darkred_cmap)
plt.register_cmap("darkpurple", darkpurple_cmap)

def _to_hex(arr):
    return [matplotlib.colors.to_hex(c) for c in arr]

def _red(x):
    return (x & 0xFF0000) >> 16

def _green(x):
    return (x & 0x00FF00) >> 8

def _blue(x):
    return x & 0x0000FF


_themes = {
    "fire": {
        "cmap": "fire",
        "color_key_cmap": "rainbow",
        "background": "black",
        "edge_cmap": "fire",
    },
    "viridis": {
        "cmap": "viridis",
        "color_key_cmap": "Spectral",
        "background": "black",
        "edge_cmap": "gray",
    },
    "inferno": {
        "cmap": "inferno",
        "color_key_cmap": "Spectral",
        "background": "black",
        "edge_cmap": "gray",
    },
    "blue": {
        "cmap": "Blues",
        "color_key_cmap": "tab20",
        "background": "white",
        "edge_cmap": "gray_r",
    },
    "red": {
        "cmap": "Reds",
        "color_key_cmap": "tab20b",
        "background": "white",
        "edge_cmap": "gray_r",
    },
    "green": {
        "cmap": "Greens",
        "color_key_cmap": "tab20c",
        "background": "white",
        "edge_cmap": "gray_r",
    },
    "darkblue": {
        "cmap": "darkblue",
        "color_key_cmap": "rainbow",
        "background": "black",
        "edge_cmap": "darkred",
    },
    "darkred": {
        "cmap": "darkred",
        "color_key_cmap": "rainbow",
        "background": "black",
        "edge_cmap": "darkblue",
    },
    "darkgreen": {
        "cmap": "darkgreen",
        "color_key_cmap": "rainbow",
        "background": "black",
        "edge_cmap": "darkpurple",
    },
}

def interactive(
    umap_object,
    labels=None,
    values=None,
    hover_data=None,
    theme=None,
    cmap="Blues",
    color_key=None,
    color_key_cmap="Spectral",
    background="white",
    width=800,
    height=800,
    point_size=None,
):
    """Create an interactive bokeh plot of a UMAP embedding.
    While static plots are useful, sometimes a plot that
    supports interactive zooming, and hover tooltips for
    individual points is much more desireable. This function
    provides a simple interface for creating such plots. The
    result is a bokeh plot that will be displayed in a notebook.

    Note that more complex tooltips etc. will require custom
    code -- this is merely meant to provide fast and easy
    access to interactive plotting.

    Parameters
    ----------
    umap_object: trained UMAP object
        A trained UMAP object that has a 2D embedding.

    labels: array, shape (n_samples,) (optional, default None)
        An array of labels (assumed integer or categorical),
        one for each data sample.
        This will be used for coloring the points in
        the plot according to their label. Note that
        this option is mutually exclusive to the ``values``
        option.

    values: array, shape (n_samples,) (optional, default None)
        An array of values (assumed float or continuous),
        one for each sample.
        This will be used for coloring the points in
        the plot according to a colorscale associated
        to the total range of values. Note that this
        option is mutually exclusive to the ``labels``
        option.

    hover_data: DataFrame, shape (n_samples, n_tooltip_features)
    (optional, default None)
        A dataframe of tooltip data. Each column of the dataframe
        should be a Series of length ``n_samples`` providing a value
        for each data point. Column names will be used for
        identifying information within the tooltip.

    theme: string (optional, default None)
        A color theme to use for plotting. A small set of
        predefined themes are provided which have relatively
        good aesthetics. Available themes are:
           * 'blue'
           * 'red'
           * 'green'
           * 'inferno'
           * 'fire'
           * 'viridis'
           * 'darkblue'
           * 'darkred'
           * 'darkgreen'

    cmap: string (optional, default 'Blues')
        The name of a matplotlib colormap to use for coloring
        or shading points. If no labels or values are passed
        this will be used for shading points according to
        density (largely only of relevance for very large
        datasets). If values are passed this will be used for
        shading according the value. Note that if theme
        is passed then this value will be overridden by the
        corresponding option of the theme.

    color_key: dict or array, shape (n_categories) (optional, default None)
        A way to assign colors to categoricals. This can either be
        an explicit dict mapping labels to colors (as strings of form
        '#RRGGBB'), or an array like object providing one color for
        each distinct category being provided in ``labels``. Either
        way this mapping will be used to color points according to
        the label. Note that if theme
        is passed then this value will be overridden by the
        corresponding option of the theme.

    color_key_cmap: string (optional, default 'Spectral')
        The name of a matplotlib colormap to use for categorical coloring.
        If an explicit ``color_key`` is not given a color mapping for
        categories can be generated from the label list and selecting
        a matching list of colors from the given colormap. Note
        that if theme
        is passed then this value will be overridden by the
        corresponding option of the theme.

    background: string (optional, default 'white)
        The color of the background. Usually this will be either
        'white' or 'black', but any color name will work. Ideally
        one wants to match this appropriately to the colors being
        used for points etc. This is one of the things that themes
        handle for you. Note that if theme
        is passed then this value will be overridden by the
        corresponding option of the theme.

    width: int (optional, default 800)
        The desired width of the plot in pixels.

    height: int (optional, default 800)
        The desired height of the plot in pixels

    Returns
    -------
    """
    if theme is not None:
        cmap = _themes[theme]["cmap"]
        color_key_cmap = _themes[theme]["color_key_cmap"]
        background = _themes[theme]["background"]

    if labels is not None and values is not None:
        raise ValueError(
            "Conflicting options; only one of labels or values should be set"
        )

    points = umap_object.embedding_

    if points.shape[1] != 2:
        raise ValueError("Plotting is currently only implemented for 2D embeddings")

    if point_size is None:
        point_size = 100.0 / np.sqrt(points.shape[0])

    data = pd.DataFrame(umap_object.embedding_, columns=("x", "y"))

    if labels is not None:
        data["label"] = labels

        if color_key is None:
            unique_labels = np.unique(labels)
            num_labels = unique_labels.shape[0]
            color_key = _to_hex(
                plt.get_cmap(color_key_cmap)(np.linspace(0, 1, num_labels))
            )

        if isinstance(color_key, dict):
            data["color"] = pd.Series(labels).map(color_key)
        else:
            unique_labels = np.unique(labels)
            if len(color_key) < unique_labels.shape[0]:
                raise ValueError(
                    "Color key must have enough colors for the number of labels"
                )

            new_color_key = {k: color_key[i] for i, k in enumerate(unique_labels)}
            data["color"] = pd.Series(labels).map(new_color_key)

        colors = "color"

    elif values is not None:
        data["value"] = values
        palette = _to_hex(plt.get_cmap(cmap)(np.linspace(0, 1, 256)))
        colors = btr.linear_cmap(
            "value", palette, low=np.min(values), high=np.max(values)
        )

    else:
        colors = matplotlib.colors.rgb2hex(plt.get_cmap(cmap)(0.5))

    if points.shape[0] <= width * height // 10:

        if hover_data is not None:
            tooltip_dict = {}
            for col_name in hover_data:
                data[col_name] = hover_data[col_name]
                if col_name != 'image':
                    tooltip_dict[col_name] = "@" + col_name
                else:
                    tooltip_dict[col_name] = "<img src='@image' style='margin: 2px 2px 2px 2px'></img>"
            tooltips = list(tooltip_dict.items())
        else:
            tooltips = None

        # bpl.output_notebook(hide_banner=True) # this doesn't work for non-notebook use
        data_source = bpl.ColumnDataSource(data)

        plot = bpl.figure(
            width=width,
            height=height,
            tooltips=tooltips,
            background_fill_color=background,
        )
        plot.circle(x="x", y="y", source=data_source, color=colors, size=point_size)

        plot.grid.visible = False
        plot.axis.visible = False

        # bpl.show(plot)
    else:
        if hover_data is not None:
            warn(
                "Too many points for hover data -- tooltips will not"
                "be displayed. Sorry; try subssampling your data."
            )
        hv.extension("bokeh")
        hv.output(size=300)
        hv.opts('RGB [bgcolor="{}", xaxis=None, yaxis=None]'.format(background))
        if labels is not None:
            point_plot = hv.Points(data, kdims=["x", "y"], vdims=["color"])
            plot = hd.datashade(
                point_plot,
                aggregator=ds.count_cat("color"),
                cmap=plt.get_cmap(cmap),
                width=width,
                height=height,
            )
        elif values is not None:
            min_val = data.values.min()
            val_range = data.values.max() - min_val
            data["val_cat"] = pd.Categorical(
                (data.values - min_val) // (val_range // 256)
            )
            point_plot = hv.Points(data, kdims=["x", "y"], vdims=["val_cat"])
            plot = hd.datashade(
                point_plot,
                aggregator=ds.count_cat("val_cat"),
                cmap=plt.get_cmap(cmap),
                width=width,
                height=height,
            )
        else:
            point_plot = hv.Points(data, kdims=["x", "y"])
            plot = hd.datashade(
                point_plot,
                aggregator=ds.count(),
                cmap=plt.get_cmap(cmap),
                width=width,
                height=height,
            )

    return plot

## Embedding data

Our first goal is to use UMAP to project data into two dimensions. To start with we will load the base libraries we might make use of: numpy for handling of numeric arrays, and pandas for working with data frames -- there are cheat sheets and other notebooks available to explain more about these libraries.

In [5]:
import numpy as np
import pandas as pd

For our test data we will use the sample of MNIST digits that is packaged into sklearn. We can access this from ``sklearn.datasets`` via the ``load_digits`` function.

In [6]:
from sklearn.datasets import load_digits
import sklearn

Finally we will need UMAP itself for embedding the data. Here we will make use of the UMAP implementation from umap-learn version 0.4dev. 

In [7]:
from umap import UMAP
import umap.plot

Let's get the data extracted, and have a look at it, so we know what the starting point for working with data is.

In [8]:
digits = load_digits()
data = digits.data.astype(np.uint8)
data

array([[ 0,  0,  5, ...,  0,  0,  0],
       [ 0,  0,  0, ..., 10,  0,  0],
       [ 0,  0,  0, ..., 16,  9,  0],
       ...,
       [ 0,  0,  1, ...,  6,  0,  0],
       [ 0,  0,  2, ..., 12,  0,  0],
       [ 0,  0, 10, ..., 12,  1,  0]], dtype=uint8)

We have a large array of numbers. Each row is an observation -- a sequence of numeric values associated to a handwritten digit. Obviously it is hard to see what is going on here, and how these different vectors of numbers inter-relate. We would like to be able to visualise the full data set to get an undestanding of any underlying structure, whether there might be outliers, etc. Right now the data has dimensions as follows:

In [22]:
data.shape

(1797, 64)

So we have 1797 observations, and each one has 64 different features. This is high dimensional data, and difficult to visualise. Remember, an embedding is
 (1) A numeric representation of your data, and
 (2) a distance. 
Luckily, here we have numeric data. If your data set isn't numeric (e.g., if it is categorical), then you will have to find a sensible embedding of your data. See the accompanying slides, or jc-healy's "Embed all the Things" talk for some inspiration. 

Now we are ready to fit the data. It really is that simple

In [12]:
# defaults: metric = euclidean, 
#           output_metric = euclidean, and 
#           n_component (i.e., number of output dimensions) = 2
embedding = UMAP().fit(data)
# you might try some different metrics, like cosine

Here we have a little helper method to create an image object out of each of the data

In [13]:
from PIL import Image
from io import BytesIO
import base64

def embeddable_image(data): # Assumes data is a np.uint8 64-long vector
    buf = BytesIO()
    img = np.reshape(data, (-1,8) ) #Turn into an 8x8 array
    img = 255-img*8 #negative and interpolate into 8-bit pixels
    img = Image.fromarray(img, mode='L') #Create an image
    img = img.resize((32,32), Image.BICUBIC ) #Resize and interpolate
    # A little magic to save it in something that can fit in a pd DataFrame for later use
    img.save(buf, format='PNG')
    for_encoding = buf.getvalue()
    return 'data:image/png;base64,' + base64.b64encode(for_encoding).decode() 
    return img
# add a column containing the embedded images to the digits dataframe
digits['image'] = list(map(embeddable_image, data))

<h1> Optional step: Cluster the data with HDBSCAN </h1>

It might be informative to cluster the data with HDBSCAN and see how the clusters match the labels. Note that the cluster labels may not be in lockstep with the data labels, and that's OK for 2 reasons. First, the cluster numbers are just _cluster labels_ and not associated with the labels assigned to the data -- remember, we wouldn't ordinarily use unsupervised techniques on labelled data, but it's convenient for evaluating how we're doing. Secondly, clusters may expose some different relationships within the umapped data than what we expected (and this _is_ a reason we might use unsupervised techniques on labelled data).

In [25]:
import hdbscan 
clusters = hdbscan.HDBSCAN().fit_predict(embedding.embedding_)
#If you don't like the clusters you're getting in the below plot,
#you can change the parameters of HDBSCAN a bit to get what you'd
#expect. But you should know what that means for your data before
#you make a judgement call on the HDBSCAN defaults.

#clusters = hdbscan.HDBSCAN(min_samples=5, min_cluster_size=25).fit_predict(embedding.embedding_)

<h1> Plot an interactive notebook </h1>

In UMAP 0.4dev (and beyond), there is an interactive plotting environment baked in. I've made a tiny change to this (the large code block above) in order to view images within the plot.

First, a couple of lines of code to tell Bokeh to keep plots inline in the notebook (otherwise we'd have to switch back and forth among tabs.

In [14]:
from bokeh.plotting import output_notebook
from bokeh.resources import INLINE
output_notebook(resources=INLINE)

Now let's plot the data and see what we can see. <br>

**A few notes:** <br>

The 'hover_data' kwarg in the interactive function determines what information is displayed with mouseover. In this case, we are displaying the "ground truth" label as well as the underlying image. <br>

The 'labels' kwarg controls the colour mapping. Only uncomment one of these lines at a time, corresponding to whether you want the points coloured by "ground truth" label or HDBSCAN cluster label. <br>

The 'theme' kwarg simply controls the colour palette. 'fire' or 'viridis' are my favourites. 

Make sure you look at the bar on the right hand side of the plot -- using these options you can control the pan and zoom; you can set zoom to be by mouse rollover, you can save images to disk, etc. 

In [15]:
hover_data = pd.DataFrame({'label':digits.target, 'image':digits.image})
                    
# Generally, you would use the umap.plot.interactive function here, 
# but we're using my custom 'interactive' function because we want 
# to view the images inline 
#p = umap.plot.interactive(embedding, 
p = interactive(embedding,
                      #labels=clusters,           #uncomment this line to colour by cluster label
                      labels=digits.target,       #uncomment this line to colour by target label
                      theme='fire', 
                      hover_data = hover_data
                     )

#Now display your plot!
umap.plot.show(p)

**And that's it.** Hopefully you can use this as a template to start your own exploratory analyses on high dimensional data. 