# Bulk Labelling as a Notebook

This notebook contains a convenient pattern to cluster and label new text data. The end-goal is to discover intents that might be used in a virtual assistant setting. This can be especially useful in an early stage and is part of the "iterate on your data"-mindset. Note that this tactic won't generate "gold" labels but it should generate something useful to help you get started. 

## Dependencies 

You'll need to install a few things to get started. 

- [whatlies](https://rasahq.github.io/whatlies/)
- [human-learn](https://koaning.github.io/human-learn/)
- [ipywidgets](https://ipywidgets.readthedocs.io/en/stable/)

You can install all tools by running this line in an empty cell; 

```python
%pip install "whatlies[all]" "human-learn" "ipywidgets"
```

If you're running Jupyter < 3, note that in order for the widgets to work you'll also need to run these commands *before* running jupyter.

```bash
jupyter nbextension enable --py widgetsnbextension
jupyter labextension install @jupyter-widgets/jupyterlab-manager
```

Next, you *should* run this notebook on port 8888. If you can't, be sure to read [this comment](https://github.com/bokeh/bokeh/issues/8096#issuecomment-406815954) and set a flag;

```
export BOKEH_ALLOW_WS_ORIGIN=localhost:8889
python -m jupyter lab --port 8889 --allow-websocket-origin=localhost:8889
```

We use `whatlies` to fetch embeddings and to handle the dimensionality reduction. We use `human-learn` for the interactive labelling interface. Feel free to check the documentation of both packages to learn more. 

## Let's go

To get started we'll first import a few tools.

In [1]:
import pathlib 
import numpy as np
import pandas as pd
import ipywidgets as widgets

from whatlies import EmbeddingSet 
from whatlies.transformers import Pca, Umap
from hulearn.preprocessing import InteractivePreprocessor
from hulearn.experimental.interactive import InteractiveCharts
from whatlies.language import UniversalSentenceLanguage, LaBSELanguage

In [2]:
# If you want to use another dataset this is where you should define a new list of texts.
txt = pathlib.Path("nlu.md").read_text()
texts = list(set([t.replace(" - ", "") for t in txt.split("\n") if len(t) > 0 and t[0] != "#"]))
print(f"We're going to label {len(texts)} texts.")

We're going to label 1087 texts.


Next, we're going to pick the language model of interest.

In [3]:
# The language agnostic bert model works is a good starting option, 
# especially for Non-English use-cases but it is a fair bit slower.
# You can swap this out with another embedding source if you feel like though. 
# lang = LaBSELanguage()
lang = UniversalSentenceLanguage(variant="large")

INFO:absl:Using /var/folders/d6/dmnhh0tx2k92pnf0fsms0_p40000gp/T/tfhub_modules to cache modules.
INFO:absl:Downloading TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder-large/5'.
INFO:absl:Downloaded https://tfhub.dev/google/universal-sentence-encoder-large/5, Total size: 577.10MB
INFO:absl:Downloaded TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder-large/5'.


In [4]:
# This is where we prepare all of the state
embset = lang[texts]
df = embset.transform(Umap(2)).to_dataframe().reset_index()
df.columns = ['text', 'd1', 'd2']
df['label'] = ''

In [5]:
# Here's the global state object
state = {}
state['df'] = df.copy()
state['chart'] = InteractiveCharts(df.loc[lambda d: d['label'] == ''], labels=['group'])

## Showing Clusters 

The idea is that we're embedding text embeddings in a two dimensional space. For more info on the details watch [the first tutorial](https://www.youtube.com/watch?v=YsMoGd7sYMQ&t=1s&ab_channel=Rasa).

![](pipeline.png)

We'll be using Vincent's infamous [human-learn library](https://koaning.github.io/human-learn/guide/drawing-features/custom-features.html) to draw selections of 2D embeddings.

Drawing can be a bit tricky though, so pay attention. 

0. To start drawing, make sure the red ball icon is selected.
1. You'll want to **double-click** to start drawing. 
2. You can then click points together to form a polygon. 
3. Next you need to double-click to stop drawing. 

This allows you to draw polygons that can be used in the code below to fetch the examples that you're interested in. Once you've drawn a polygon click "show examples" to see examples of your selections and use the textbox and "add label" button to add labels.

In [7]:
pd.set_option('display.max_colwidth', -1)

def show_draw_chart(b=None):
    with out_table:
        out_table.clear_output()
    with out_chart:
        out_chart.clear_output()
        state['chart'].dataf = state['df'].loc[lambda d: d['label'] == '']
        state['chart'].charts = []
        state['chart'].add_chart(x='d1', y='d2', legend=False)

def show_examples(b=None):
    with out_table:
        out_table.clear_output()
        tfm = InteractivePreprocessor(json_desc=state['chart'].data())
        subset = state['df'].pipe(tfm.pandas_pipe).loc[lambda d: d['group'] != 0]
        display(subset.sample(min(15, subset.shape[0]))[['text']])

def assign_label(b=None):
    tfm = InteractivePreprocessor(json_desc=state['chart'].data())
    idx = state['df'].pipe(tfm.pandas_pipe).loc[lambda d: d['group'] != 0].index
    state['df'].iloc[idx, 3] = label_name.value
    with out_counter:
        out_counter.clear_output()
        n_lab = state['df'].loc[lambda d: d['label'] != ''].shape[0]
        print(f"{n_lab}/{state['df'].shape[0]} labelled")

def retrain_state(b=None):
    keep = list(state['df'].loc[lambda d: d['label'] == '']['text'])
    umap = Umap(2)
    new_df = EmbeddingSet(*[e for e in embset if e.name in keep]).transform(umap).to_dataframe().reset_index()
    new_df.columns = ['text', 'd1', 'd2']
    new_df['label'] = ''
    state['df'] = pd.concat([new_df, state['df'].loc[lambda d: d['label'] != '']])
    show_draw_chart(b)

out_table = widgets.Output()
out_chart = widgets.Output()
out_counter = widgets.Output()

label_name = widgets.Text("label name")

btn_examples = widgets.Button(
    description='Show Examples',
    icon='eye'
)

btn_label = widgets.Button(
    description='Add label',
    icon='check'
)

btn_retrain = widgets.Button(
    description='Retrain',
    icon='coffee'
)

btn_redraw = widgets.Button(
    description='Redraw',
    icon='check'
)

btn_examples.on_click(show_examples)
btn_label.on_click(assign_label)
btn_redraw.on_click(show_draw_chart)
btn_retrain.on_click(retrain_state)

show_draw_chart()
display(widgets.VBox([widgets.HBox([btn_retrain, btn_examples, btn_redraw]), 
                      widgets.HBox([out_chart, out_table])]), 
        label_name, 
        widgets.HBox([btn_label, out_counter]))

VBox(children=(HBox(children=(Button(description='Retrain', icon='coffee', style=ButtonStyle()), Button(descri…

Text(value='label name')

HBox(children=(Button(description='Add label', icon='check', style=ButtonStyle()), Output()))

In [26]:
# This is the dataframe with the labels attached
# you can inspect it here or save it to disk.
state['df']

Unnamed: 0,text,d1,d2,label
0,i want to know the company which generated you,13.232250,-2.044659,
1,are you a rasa bot?,6.183568,10.678663,
2,Do you have a great day?,15.074646,2.750982,
3,where are your parents from?,15.513592,-4.858191,
4,you are chatbot,6.515159,10.655087,
...,...,...,...,...
1082,who is your creator,13.236359,-1.328454,
1083,"Hi, glad to meet you.",8.927155,-11.018297,
1084,What's the weather like where I am right now?,8.862854,7.777061,
1085,IS there any near by restaurant?,7.594832,21.447973,
