# Exploring LLM-based Sentence Embeddings

In this notebook we're going see how we can build bespoke interfaces using `jscatter` and `ipywidgets` for exploring text embeddings. For this, we're going to use the [News Category Dataset](https://www.kaggle.com/datasets/rmisra/news-category-dataset?resource=download) from [Rishabh Misra, 2022](https://arxiv.org/abs/2209.11429) again. But this time, we're going to make the scatter plot configurable and display the actual news articles that relate to data points

## Setup Data & Scatter Plot Config

This is the same as from [2-Composing-Linking-Scatter-Plots.ipynb](#Synchronizing-Views)

In [None]:
!curl -L -C - -o data/huffpost-embeddings.pq https://storage.googleapis.com/flekschas/jupyter-scatter-tutorial/huffpost-embeddings.pq

In [None]:
import numpy as np
import pandas as pd
huffpost_embeddings = pd.read_parquet('data/huffpost-embeddings.pq')
huffpost_embeddings.year = huffpost_embeddings.year.astype('category')
huffpost_embeddings['length_clr'] = np.log(huffpost_embeddings.length.values / np.exp(np.mean(np.log(huffpost_embeddings.length.values))))
huffpost_embeddings.head(3)

In the following we're setting up colormaps and the base configuration for `jscatter`.

In [None]:
from jscatter import glasbey_light

category_cmap = { cat: glasbey_light[i] for i, cat in enumerate(sorted(huffpost_embeddings.category.unique())) }
month_cmap = { 'January': '#3C33FF', 'February': '#4587E8', 'March': '#6FCFF1', 'April': '#40E52C', 'May': '#9CFA0B', 'June': '#B7F113', 'July': '#FFFF34', 'August': '#FCD66F', 'September': '#FFAEBC', 'October': '#FF5DFF', 'November': '#C75DFF', 'December': '#AD7EFF' }
season_cmap = { 'Spring': '#9CFA0B', 'Summer': '#FFFF34', 'Fall': '#FF5DFF', 'Winter': '#3C33FF' }
year_cmap = { '2012': '#ffffe0', '2013': '#ffefc1', '2014': '#ffdfa8', '2015': '#ffcc92', '2016': '#ffba81', '2017': '#ffa875', '2018': '#ff926c', '2019': '#ff7d65', '2020': '#ff635e', '2021': '#ff4251', '2022': '#ff0000' }

huffpost_scatter_config = dict(x='x', y='y', axes=False, background_color='#111111')

Next, we're creating a `Scatter` instance for visualizing the news articles by category as in the previous notebook.

In [None]:
from jscatter import Scatter
scatter = Scatter(
    data=huffpost_embeddings,
    color_by='category',
    color_map=category_cmap,
    height=720,
    legend=True,
    **huffpost_scatter_config
)
scatter.show()

This is the same scatter plot we saw before. While it's useful, it would be even cooler if we could easily change the color settings and facet the news articles to better see trends. We can do this with `ipywidgets` as shown in the following section.

## Connecting `jscatter` with `ipywidgets`

To enhance the exploration, we're going to introduce two drop down menus for:

1. Changing the color settings
3. Filter down the dataset

Additionally, we're going to use Pandas DataFrame's pretty print functionality to show the actualy news headlines related to selected data points

In [None]:
from ipywidgets import Dropdown

categorical_variables = ["category", "year", "month", "season"]
continuous_variables = ["length_clr"]

categories = [list(map(lambda val: f"{cat}:{val}", sorted(huffpost_embeddings[cat].unique()))) for cat in categorical_variables]
categories = [item for sublist in categories for item in sublist]

select_color = Dropdown(options=categorical_variables + continuous_variables, value="category", description="Color by")
select_filter = Dropdown(options=["-"] + categories, value="-", description="Filter to")
select_facet = Dropdown(options=["-"] + categorical_variables, value="-", description="Facet by")

The most "complex" aspect of building the bespoke interface involes setting up the change event handlers and updating the scatter plots appropriately.

In [None]:
from ipywidgets import Box, HTML, Output

"""
The following part listenes to selection changes, prints the related news headlines,
and capture the print output so we can display it next to the scatter plot.
"""
table = Output()
@table.capture(clear_output=True)
def on_selection_change(change):
    display(huffpost_embeddings.iloc[change.new][["category", "headline"]].style.hide(axis='index'))

scatter.widget.observe(on_selection_change, names=["selection"])

"""
The following part set's up the event handlers for changing the point color
"""
def get_cmap(color):
    if color == "category":
        return (category_cmap, None)
    if color == "month":
        return (month_cmap, None)
    if color == "season":
        return (season_cmap, None)
    if color == "year":
        return (year_cmap, None)
    if color == "length_clr":
        return ("coolwarm", [-1, 1])
    return ("auto", None)

def on_color_change(change):
    map, norm = get_cmap(change.new)
    scatter.color(by=change.new, map=map, norm=norm)

select_color.observe(on_color_change, names=["value"])

"""
The final setup involves setting up the event handlers for when the filter
setting changes
"""
def on_filter_change(change):
    cat, val = (None, None) if change.new == "-" else change.new.split(":")
    if cat is None:
        scatter.filter(None)
    else:
        scatter.filter(huffpost_embeddings[huffpost_embeddings[cat] == val].index)

select_filter.observe(on_filter_change, names=["value"])

In the last step, we have to compose all pieces together. Thankfully, `ipywidgets` make this straight forward thanks to it's `AppLayout`, `HBox`, and `VBox` widgets.

In [None]:
from ipywidgets import AppLayout, HTML, HBox, VBox

VBox([
    AppLayout(
        center=HBox([select_color, select_filter]),
        right_sidebar=HTML(value="Selected news articles:")
    ),
    AppLayout(center=scatter.show(), right_sidebar=table)
])

Try changing the color and filter settings to see how new patterns emerge. Also try and select some points to view the related news headlines.

An overall insight is that there are many small clusters of news articles related to a specific topic or person.

For instance:

1. Black Friday related news articles

In [None]:
select_color.value = "season"
black_friday = [181658, 180634, 112847, 4451, 82237]
scatter.selection(black_friday).zoom(black_friday, padding=20)

In [None]:
hurricane = [26535, 77027, 26235, 68812, 25048, 25129, 25972, 26092]
scatter.selection(hurricane).zoom(hurricane, padding=20)

2. Articles related to corona. These articles are obviously much more present in 2021 but there are some earlier articles related to MERS too.

In [None]:
select_color.value = "year"
corona_covid = [5344, 2116, 5188, 3204, 5228, 5197, 160843, 5172, 5202, 5252, 4975]
scatter.selection(corona_covid).zoom(corona_covid, padding=50)

3. Political articles tend to have slightly longer headlines than wellness/healthy living artiles

In [None]:
select_color.value = "length_clr"
long_headlines = [14746, 72340, 52809, 67847, 51324]
short_headlines = [205616, 131555, 187525, 176811, 191559, 170954, 140922, 198538, 115228, 137532]
scatter.selection(long_headlines + short_headlines).zoom(None)

4. There are two broad categories of "divorce" articles. One is about celebraty related divorse news and the other is about advice columns.

In [None]:
select_filter.value = "category:DIVORCE"
celebraty_divorce = [155148, 184805, 193010, 197669, 194119, 195507, 155438, 134902,]
divorse_advice = [136070, 166976, 169044, 205926, 147585, 155480, 200162, 201555]
scatter.selection(celebraty_divorce + divorse_advice).zoom(None)

5. "Wellness" articles seem to have disappered after `2014` and reappear during to COVID

In [None]:
select_color.value = "year"
select_filter.value = "category:WELLNESS"
scatter.selection(celebraty_divorce + divorse_advice).zoom(None)

---

## Next

Next up, we'll show you how to bring the integration even further by using `jscatter` with a custom widget for the Fashion MNIST example we saw previously.

➡️ [Building a Bespoke Interface for Exploring Fashion MNIST](3-Fashion-MNIST.ipynb)