# Composing and Linking Multiple Scatter Plots

In this notebook, we'll learn:
1. [About `jscatter`'s API for plotting multiple scatter plots](#API-for-Composing-Multiple-Scatter-Plots)
2. [How to synchronize selections using Fashion MNIST embeddings](#Synchronizing-the-Selection-and-Hover)
3. [How to synchronize views using LLM-based sentence embeddings](#Synchronizing-Views)

---

## API for Composing Multiple Scatter Plots

We'll start out with a very simple example to get familiar with the API.

In the following we'll compose two scatter plots next to each other using `jscatters.compose()`.

In [None]:
from jscatter import Scatter, compose
from numpy.random import rand

a = Scatter(x=rand(500), y=rand(500))
b = Scatter(x=rand(5000), y=rand(5000))

compose([a, b])

By default, `jscatter` arranges scatter plots into a single row but we can customize this of course.

In [None]:
compose([a, b], rows=2)

So good so far but the fun part starts when we link/synchronize the scatter plots' views and selections.

## Synchronizing the Selection and Hover

### Comparing Embedding Methods

To demonstrate the usefulness of linked/synchronized selections, let's take a look the [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist), which we embedded using four different embedding methods:

1. [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
2. [t-SNE](https://opentsne.readthedocs.io/en/stable/)
3. [UMAP](https://umap-learn.readthedocs.io/en/latest/)
4. [A convolutional autoencoder](https://blog.keras.io/building-autoencoders-in-keras.html)

In [None]:
!curl -L -C - -o data/fashion-mnist-embeddings.pq https://storage.googleapis.com/flekschas/jupyter-scatter-tutorial/fashion-mnist-embeddings.pq

In [None]:
import pandas as pd
fashion_mnist_embeddings = pd.read_parquet('data/fashion-mnist-embeddings.pq')
fashion_mnist_embeddings = fashion_mnist_embeddings.replace({"class": {0: "T-shirt/top", 1: "Trouser", 2: "Pullover", 3: "Dress", 4: "Coat", 5: "Sandal", 6: "Shirt", 7: "Sneaker", 8: "Bag", 9: "Ankle boot"}}).astype('category')
fashion_mnist_embeddings.head(3)

The dataframe contains pre-embedded x/y locations of each image and the associated class.

Since we're going to visualize each embedding using the same visual encoding, we can specify most things upfront:

In [None]:
config = dict(
    background_color='#111111',
    color_by='class',
    color_map={
        "T-shirt/top": '#FFFF00',
        "Trouser": '#1CE6FF',
        "Pullover": '#FF34FF',
        "Dress": '#FF4A46',
        "Coat": '#008941',
        "Sandal": '#006FA6',
        "Shirt": '#A30059',
        "Sneaker": '#FFDBE5',
        "Bag": '#7A4900',
        "Ankle boot": '#0000A6'
    },
    legend=True,
    axes=False,
    zoom_on_selection=True, # To automatically zoom to selected points
)

Finally, we need to create four `jscatter` instances and compose them in a 2x2 grid. This time however, we're going to link/synchronize the selection and point hovering across all four instances because each scatter plot references the same images from Fashion MNIST.

In [None]:
pca = Scatter(data=fashion_mnist_embeddings, x='pcaX', y='pcaY', **config)
tsne = Scatter(data=fashion_mnist_embeddings, x='tsneX', y='tsneY', **config)
umap = Scatter(data=fashion_mnist_embeddings, x='umapX', y='umapY', **config)
cae = Scatter(data=fashion_mnist_embeddings, x='caeX', y='caeY', **config)

compose(
    [pca, tsne, umap, cae],
    sync_selection=True,
    sync_hover=True,
    rows=2,
    row_height=240
)

Because I like to see the selected points within their local neighborhood, I activated `zoom_on_selection`. In combination with the synced selection, this makes all scatter plots automatically zoom to selected points.

## Synchronizing Views

### For Faceted Exploration or Shared Latent Spaces

Beyond synchronizing the selection, `jscatter` also supports view synchronization. However, it does not make much sense to activate this for the above Fashion MNIST example because each scatter plots drew a different embedding space. However, we might want to explore a large dataset where all points share the same latent space. In this case it can be interesting to facet the dataset to for comparison. And since the space is the same, it can be useful to synchronize the view.

> 🚨 LLM Alert

In the next example we're going to compare news articles from 2012-2022 by their title. For that we're using the fantastic [News Category Dataset](https://www.kaggle.com/datasets/rmisra/news-category-dataset?resource=download) from [Rishabh Misra, 2022](https://arxiv.org/abs/2209.11429). We embedded the titles, abstract, and both using a combination of the pretrained [all-MiniLM-L6-v2 sentence transformer from 🤗](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) and [UMAP](https://umap-learn.readthedocs.io/en/latest/).

In [None]:
!curl -L -C - -o data/huffpost-embeddings.pq https://storage.googleapis.com/flekschas/jupyter-scatter-tutorial/huffpost-embeddings.pq

In [None]:
import pandas as pd
huffpost_embeddings = pd.read_parquet('data/huffpost-embeddings.pq')
huffpost_embeddings.head(3)

In [None]:
from jscatter import glasbey_light

category_cmap = { cat: glasbey_light[i] for i, cat in enumerate(huffpost_embeddings.category.unique()) }

huffpost_scatter_config = dict(axes=False, background_color='#111111')

In [None]:
huffpost_scatter = Scatter(
    data=huffpost_embeddings,
    x='x',
    y='y',
    color_by='category',
    color_map=category_cmap,
    height=640,
    legend=True,
    **huffpost_scatter_config
)
huffpost_scatter.show()

As mentioned above, this dataset consists of news articles from 2012 – 2022. An intersting question is, whether the distribution of published articles has changed over the years. To achieve this we're going to facet the data frame by year and plot each year as an individual scatter.

In [None]:
def create_annual_scatter(year):
    return Scatter(
        data=huffpost_embeddings_years[year],
        x='x',
        y='y',
        color_by='category',
        color_map=category_cmap,
        **huffpost_scatter_config
    )

years = sorted(huffpost_embeddings.year.unique())

huffpost_embeddings_years = {
    year: huffpost_embeddings[huffpost_embeddings.year == year] for year in years
}

huffpost_scatters_years = {
    year: create_annual_scatter(year) for year in years
}

compose([(sc, y) for y, sc in huffpost_scatters_years.items()], sync_view=True, rows=3, cols=4, row_height=320)

In [None]:
huffpost_embeddings_years['2012'].iloc[huffpost_scatters_years['2012'].selection()]

## Synchronize Everything

### Visualizing Multiple Properties of the Same Dataset

Sometimes it can be useful to sychronize everything: the view, selection, and hover state. This is typically the case when one wants to simply explore many different properties at the same time.

This is common in exploring single-cell data. The purpose of the embedding visualization is not only to inform the biologist about cell type clusters but to also allow them to verify cluster validity by visually collerating clusters with known cell type marker expressions. To demonstrate this use case, let's take another look at the single-cell data from from [Mair et al., 2022](https://www.nature.com/articles/s41586-022-04718-w) that was clustered with [Ozette](https://www.ozette.com/)'s [FAUST method](https://doi.org/10.1016/j.patter.2021.100372) and transformed with [Ozette's Annotation Transformation](https://github.com/flekschas-ozette/ismb-biovis-2022) prior to embedding with [UMAP](https://umap-learn.readthedocs.io/en/latest/).

Given the frequent use of this scenario and since I'm lazy, `jscatter` offers a short-hand function called `link` to synchronize everything for you.

In [None]:
from itertools import cycle
from jscatter import glasbey_light, link

mair_2022_tumor_ozette = pd.read_parquet("./data/mair-2022-tumor-006-ozette.pq")

mair_2022_colormap = dict(zip(mair_2022_tumor_ozette.faustLabels.unique(), cycle(glasbey_light[1:])))
mair_2022_colormap["0_0_0_0_0"] = (0.2, 0.2, 0.2, 1.0)

mair_2022_scatter_config = dict(
    data=mair_2022_tumor_ozette,
    x='umapX',
    y='umapY',
    background_color="#111111",
    axes=False,
)

link([
    (Scatter(color_by='faustLabels', color_map=mair_2022_colormap, **mair_2022_scatter_config), "Cell Types"),
    (Scatter(color_by='CD4_Windsorized', legend=True, color_labeling=("low", "high"), **mair_2022_scatter_config), "CD4 Expression"),
    (Scatter(color_by='CD8_Windsorized', legend=True, color_labeling=("low", "high"), **mair_2022_scatter_config), "CD8 Expression"),
    (Scatter(color_by='CD19_Windsorized', legend=True, color_labeling=("low", "high"), **mair_2022_scatter_config), "CD19 Expression"),
])

---

## Next

Next up, we'll how to use everything we've learned so far and build bespoke interfaces for exploring large-scale datasets starting with the LLM-based news article dataset we just explored.

➡️ [Building a Bespoke Interface for Exploring LLM-Based Sentence Embeddings.ipynb](3-LLM-Sentence-Embedding.ipynb)