# Composing and Linking Multiple Scatter Plots

In this notebook, we'll learn:
1. [About `jscatter`'s API for plotting multiple scatter plots](#API-for-Composing-Multiple-Scatter-Plots)
2. [How to synchronize selections using Fashion MNIST embeddings](#Synchronizing-the-Selection-and-Hover)
3. [How to synchronize views using LLM-based sentence embeddings](#Synchronizing-Views)

---

In [9]:
from jscatter import Scatter, compose, link

## API for Composing Multiple Scatter Plots

We'll start out with a very simple example to get familiar with the API.

In the following we'll compose two scatter plots next to each other using `jscatters.compose()`.

In [20]:
from numpy.random import rand

a = Scatter(x=rand(500), y=rand(500))
b = Scatter(x=rand(5000), y=rand(5000))

compose([a, b])

GridBox(children=(HBox(children=(VBox(children=(Button(button_style='primary', icon='arrows', layout=Layout(wi…

By default, `jscatter` arranges scatter plots into a single row but we can customize this of course.

In [12]:
jscatter.compose([a, b], rows=2)

GridBox(children=(HBox(children=(VBox(children=(Button(button_style='primary', icon='arrows', layout=Layout(wi…

So good so far but the fun part starts when we link/synchronize the scatter plots' views and selections.

## Synchronizing the Selection and Hover

To demonstrate the usefulness of linked/synchronized selections, let's look at some more interesting data: multiple embeddings of the [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist).

In [5]:
!curl -L -C - -o data/fashion-mnist-embeddings.pq https://storage.googleapis.com/flekschas/jupyter-scatter-tutorial/fashion-mnist-embeddings.pq

** Resuming transfer from byte position 2696403
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0


In [32]:
import pandas as pd
fashion_mnist_embeddings = pd.read_parquet('data/fashion-mnist-embeddings.pq')
fashion_mnist_embeddings = fashion_mnist_embeddings.replace({"class": {0: "T-shirt/top", 1: "Trouser", 2: "Pullover", 3: "Dress", 4: "Coat", 5: "Sandal", 6: "Shirt", 7: "Sneaker", 8: "Bag", 9: "Ankle boot"}}).astype('category')
fashion_mnist_embeddings.head(3)

Unnamed: 0,pcaX,pcaY,tsneX,tsneY,umapX,umapY,caeX,caeY,class
0,-0.207672,0.619046,-0.512748,0.862887,-0.848567,-0.177148,-0.792607,-0.95234,Ankle boot
1,0.42387,-0.392556,0.556802,-0.625932,0.973414,-0.103313,-0.493724,-0.050538,T-shirt/top
2,-0.455815,-0.708062,-0.037304,-0.186733,0.463554,-0.061681,-0.372132,-0.272005,T-shirt/top


The dataframe contains pre-embedded x/y locations of each image and the associated class. For this example we embedded the images using:

1. [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
2. [t-SNE](https://opentsne.readthedocs.io/en/stable/)
3. [UMAP](https://umap-learn.readthedocs.io/en/latest/)
4. [A convolutional autoencoder](https://blog.keras.io/building-autoencoders-in-keras.html)

Since we're going to visualize each embedding using the same visual encoding, we can specify most things upfront:

In [26]:
config = dict(
    background_color='#111111',
    color_by='class',
    color_map={
        "T-shirt/top": '#FFFF00',
        "Trouser": '#1CE6FF',
        "Pullover": '#FF34FF',
        "Dress": '#FF4A46',
        "Coat": '#008941',
        "Sandal": '#006FA6',
        "Shirt": '#A30059',
        "Sneaker": '#FFDBE5',
        "Bag": '#7A4900',
        "Ankle boot": '#0000A6'
    },
    legend=True,
    axes=False,
    zoom_on_selection=True, # To automatically zoom to selected points
)

Finally, we need to create four `jscatter` instances and compose them in a 2x2 grid. This time however, we're going to link/synchronize the selection and point hovering across all four instances because each scatter plot references the same images from Fashion MNIST.

In [27]:
pca = Scatter(data=fashion_mnist_embeddings, x='pcaX', y='pcaY', **config)
tsne = Scatter(data=fashion_mnist_embeddings, x='tsneX', y='tsneY', **config)
umap = Scatter(data=fashion_mnist_embeddings, x='umapX', y='umapY', **config)
cae = Scatter(data=fashion_mnist_embeddings, x='caeX', y='caeY', **config)

compose(
    [pca, tsne, umap, cae],
    sync_selection=True,
    sync_hover=True,
    rows=2,
    row_height=240
)

GridBox(children=(HBox(children=(VBox(children=(Button(button_style='primary', icon='arrows', layout=Layout(wi…

Because I like to see the selected points within their local neighborhood, I activated `zoom_on_selection`. In combination with the synced selection, this makes all scatter plots automatically zoom to selected points.

## Synchronizing Views

Beyond synchronizing the selection, `jscatter` also supports view synchronization. However, it does not make much sense to activate this for the above Fashion MNIST example because each scatter plots drew a different embedding space. However, we might want to explore a large dataset where all points share the same latent space. In this case it can be interesting to facet the dataset to for comparison. And since the space is the same, it can be useful to synchronize the view.

> 🚨 LLM Alert

In the next example we're going to compare news articles from 2012-2022 by their title. For that we're using the fantastic [News Category Dataset](https://www.kaggle.com/datasets/rmisra/news-category-dataset?resource=download) from [Rishabh Misra, 2022](https://arxiv.org/abs/2209.11429). We embedded the titles, abstract, and both using a combination of the pretrained [all-MiniLM-L6-v2 sentence transformer from 🤗](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) and [UMAP](https://umap-learn.readthedocs.io/en/latest/).

In [None]:
!curl -L -C - -o data/huffpost-embeddings.pq https://storage.googleapis.com/flekschas/jupyter-scatter-tutorial/huffpost-embeddings.pq

In [6]:
import pandas as pd
huffpost_embeddings = pd.read_parquet('data/huffpost-embeddings.pq')
huffpost_embeddings.head(3)

Unnamed: 0,link,headline,category,short_description,authors,date,year,month,season,umap_headlines_x,umap_headlines_y,umap_short_descriptions_x,umap_short_descriptions_y,umap_combined_x,umap_combined_y
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23,2022,September,Fall,3.966085,0.465991,-1.86527,5.458853,-1.920846,5.376357
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23,2022,September,Fall,3.325465,4.053137,0.673807,7.048808,0.350905,7.271127
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23,2022,September,Fall,10.177062,5.699961,3.140072,5.262107,2.455263,4.685835


In [7]:
from jscatter import glasbey_light

category_cmap = { cat: glasbey_light[i] for i, cat in enumerate(huffpost_embeddings.category.unique()) }

huffpost_scatter_config = dict(axes=False, background_color='#111111')

In [28]:
huffpost_scatter = Scatter(
    data=huffpost_embeddings,
    x='umap_headlines_x',
    y='umap_headlines_y',
    color_by='category',
    color_map=category_cmap,
    height=640,
    legend=True,
    **huffpost_scatter_config
)
huffpost_scatter.show()

HBox(children=(VBox(children=(Button(button_style='primary', icon='arrows', layout=Layout(width='36px'), style…

As mentioned above, this dataset consists of news articles from 2012 – 2022. An intersting question is, whether the distribution of published articles has changed over the years. To achieve this we're going to facet the data frame by year and plot each year as an individual scatter.

In [36]:
def create_annual_scatter(year):
    return Scatter(
        data=huffpost_embeddings_years[year],
        x='umap_headlines_x',
        y='umap_headlines_y',
        color_by='category',
        color_map=category_cmap,
        **huffpost_scatter_config
    )

years = sorted(huffpost_embeddings.year.unique())

huffpost_embeddings_years = {
    year: huffpost_embeddings[huffpost_embeddings.year == year] for year in years
}

huffpost_scatters_years = {
    year: create_annual_scatter(year) for year in years
}

compose(huffpost_scatters_years.values(), sync_view=True, rows=3, cols=4, row_height=320)

GridBox(children=(HBox(children=(VBox(children=(Button(button_style='primary', icon='arrows', layout=Layout(wi…

In [38]:
huffpost_embeddings_years['2012'].iloc[huffpost_scatters_years['2012'].selection()]

Unnamed: 0,link,headline,category,short_description,authors,date,year,month,season,umap_headlines_x,umap_headlines_y,umap_short_descriptions_x,umap_short_descriptions_y,umap_combined_x,umap_combined_y
184608,https://www.huffingtonpost.comhttp://vitals.nb...,"Extra Pounds May Put You In The Hospital, Stud...",WELLNESS,Regardless of lifestyle and other health-relat...,,2012-10-22,2012,October,Fall,8.070447,1.427714,-1.274007,2.001766,-1.355832,1.925894


---

## Next

Next up, we'll how to use everything we've learned so far and build bespoke interfaces for exploring large-scale datasets starting with the LLM-based news article dataset we just explored.

➡️ [Building a Bespoke Interface for Exploring LLM-Based Sentence Embeddings.ipynb](3-LLM-Sentence-Embedding.ipynb)