outputs = model(**inputs)# Bulk Labelling as a Notebook

This notebook contains a convenient pattern to cluster and label new text data. The end-goal is to discover intents that might be used in a virtual assistant setting. This can be especially useful in an early stage and is part of the "iterate on your data"-mindset. 

## Dependencies 

You'll need to install a few things to get started. 

- [whatlies](https://rasahq.github.io/whatlies/)
- [human-learn](https://koaning.github.io/human-learn/)

You can install both tools by running this line in an empty cell; 

```python
%pip install "whatlies[tfhub]" "human-learn"
```

We use `whatlies` to fetch embeddings and to handle the dimensionality reduction. We use `human-learn` for the interactive labelling interface. Feel free to check the documentation of both packages to learn more. 

## Let's go

To get started we'll first import a few tools.

In [98]:
import pathlib 
import numpy as np
import tensorflow as tf
import tensorflow_hub
from whatlies.language import CountVectorLanguage, UniversalSentenceLanguage, BytePairLanguage, SentenceTFMLanguage
from whatlies.language import TFHubLanguage
from whatlies import Embedding, EmbeddingSet
from whatlies.transformers import Pca, Umap, Tsne, Lda
import json

In [14]:
import datasets
emotion_data=datasets.load_dataset('emotion')
app_data=datasets.load_dataset('app_reviews')
coronavirus_queries=datasets.load_dataset("bing_coronavirus_query_set", queries_by="country", start_date="2020-09-01", end_date="2020-09-30")





In [15]:
import pandas as pd
emotion_data=pd.DataFrame.from_dict(emotion_data['train'])
app_data=pd.DataFrame.from_dict(app_data['train'])
coronavirus_queries=pd.DataFrame.from_dict(coronavirus_queries['train'])

In [16]:
coronavirus_queries.head()

Unnamed: 0,Country,Date,IsImplicitIntent,PopularityScore,Query,id
0,Romania,2020-09-01,False,3,coronavirus worldometer,1
1,United States,2020-09-01,True,1,scdhec,2
2,United States,2020-09-01,False,1,n95 mask coronavirus,3
3,Brazil,2020-09-01,True,28,parcelamento fgts mp 927,4
4,United States,2020-09-01,False,1,coronavirus colombia,5


Next we will load in some embedding frameworks. There can be very heavy, just so you know! 

In [17]:
lang_cv  = CountVectorLanguage(10)
lang_use = TFHubLanguage('https://tfhub.dev/google/universal-sentence-encoder/4')
lang_bp  = BytePairLanguage("en", dim=300, vs=200_000)
lang_multi = TFHubLanguage('https://tfhub.dev/google/universal-sentence-encoder-multilingual/3')





In [135]:
def get_embedding(vec,text):
    return Embedding(text, vec)

def get_embeddingset(veclist,textlist):
    return EmbeddingSet(*[get_embedding(veclist[q],textlist[q]) for q in range(len(textlist))])

def prepare_data(lang,transformer,textlist=None):
    if isinstance(lang,EmbeddingSet):
        return lang.transform(transformer)
    return lang[textlist].transform(transformer)

def make_plot(lang,transformer,textlist=None):
    return prepare_data(lang,transformer,textlist).plot_interactive(annot=False).properties(width=200, height=200, title=type(lang).__name__)


Next we'll load in the texts that we'd like to embed/cluster. The goal here is to provide multiple datasets to test varying functionality of the algorithm in various cases.

In [128]:
#Original dataset : very clear and idealistic
txt = pathlib.Path("nlu.md").read_text()
texts = list(set([t.replace(" - ", "") for t in txt.split("\n") if len(t) > 0 and t[0] != "#"]))
#print(f"We're going to plot {len(texts)} texts.")

#emotions dataset : tweets showing a range of five emotions
# texts=emotion_data.text.tolist()[0:3000]

#app review dataset
texts=app_data.review.tolist()[0:3000]

#corovirus queries dataset
# texts=coronavirus_queries.head(3000)['Query'].tolist()

Keep in mind that it's better to start out with 1000 sentences or so. Much more might break the browser's memory in the next visual.

## Showing Clusters 

![](pipeline.png)

The cell below will take the texts and have them pass through different language backends. After this they will be mapped to a two dimensional space by using [UMAP](https://umap-learn.readthedocs.io/en/latest/). It takes a while to plot everything (mainly because the universal sentence encoder and the transformer language models are heavy).

In [20]:
make_plot(lang_use,Umap(2),texts) | make_plot(lang_multi,Umap(2),texts) | make_plot(lang_bp,Umap(2),texts)  





What you see are four charts. You should notice that certain clusters have appeared. For your usecase you might need to check which language backend makes the most sense. 

## Note for Non-English 

The only model shown here that is English specific is the universal sentence encoder (`lang_use`). All the other ones also support other languages. For more information check the [bytepair documentation](https://nlp.h-its.org/bpemb/) and the [sentence transformer documentation](https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models).

## Towards Labelling 

We'll now prepare a dataframe that we'll assign labels to. We'll do that by loading in the same text file but now into a pandas dataframe.

## Trying with french datasets

trying with french datasets requires us to use new embedding frameworks. the most renowned french embedding frameworks are BERT a




In [31]:
Bazaarvoice_df=pd.read_csv("Bazaarvoice.csv",low_memory=False)
Calls_df=pd.read_csv("calls_df.csv")
conversations_df =pd.read_csv("conversations.csv")

bazaarvoice_titles=Bazaarvoice_df['Review Title'].astype(str).head(2000).tolist()
bazaarvoice_texts=Bazaarvoice_df['Review Text'].astype(str).head(2000).tolist()
calls_texts=Calls_df['clean_text_cli'].astype(str).head(2000).tolist()
conversations=conversations_df[~conversations_df.message.isin(['ENGAGEMENT_RULE_TRIGGERED','NAVIGATION_CHANGED','CONVERSATION_PUSHED','AUTOMATIC_MESSAGE_SENT','CONVERSATION_CLOSED'])].astype(str).head(2000).message.tolist()

In [52]:
from sentence_transformers import SentenceTransformer
distil_model = SentenceTransformer('quora-distilbert-multilingual')

In [138]:


distil_titles=get_embeddingset(distil_model.encode(bazaarvoice_titles),bazaarvoice_titles)
distil_texts=get_embeddingset(distil_model.encode(bazaarvoice_texts),bazaarvoice_texts)
distil_calls=get_embeddingset(distil_model.encode(bazaarvoice_texts),calls_texts)


make_plot(distil_titles,Umap(2)) | make_plot(distil_texts,Umap(2)) | make_plot(distil_calls,Umap(2))

In [142]:
from transformers import TFFlaubertModel,FlaubertTokenizer

In [148]:
flaubert_model = FlaubertModel.from_pretrained('flaubert/flaubert_base_cased')
tokenizer = FlaubertTokenizer.from_pretrained('flaubert/flaubert_base_cased')

In [146]:
inputs = tokenizer(bazaarvoice_texts[1], return_tensors="tf",padding=True)
outputs = flaubert_model(**inputs)

AttributeError: 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'size'

In [139]:
outputs

BaseModelOutput(last_hidden_state=tensor([[[ 0.2144,  0.1988,  0.4698,  ..., -0.6093,  0.0833, -0.2972],
         [-0.7298, -0.2909,  0.5778,  ..., -0.7779, -1.0305,  0.6958],
         [-0.4656,  1.5761,  0.7879,  ..., -1.2832, -1.3338,  0.2423],
         ...,
         [-2.1313, -2.2690,  1.0858,  ..., -2.5149,  0.6685,  1.2669],
         [-0.8647, -1.3122,  1.1420,  ..., -2.2087,  0.6550,  0.4402],
         [-0.5195, -0.1150,  1.1479,  ..., -1.3523, -0.8522, -1.1070]]],
       grad_fn=<MulBackward0>), hidden_states=None, attentions=None)

In [43]:
make_plot(lang_multi,Umap(2),bazaarvoice_titles) | make_plot(lang_multi,Umap(2),calls_texts) | make_plot(lang_multi,Umap(2),conversations)

In [6]:
df = lang_use[texts].transform(Umap(2)).to_dataframe().reset_index()
df.columns = ['text', 'd1', 'd2']
df['label'] = ''
df.shape[0]

1087

We are now going to be labelling!

# Fancy interactive drawing! 

We'll be using Vincent's infamous [human-learn library](https://koaning.github.io/human-learn/guide/drawing-features/custom-features.html) for this. First we'll need to instantiate some charts.

Next we get to draw! Drawing can be a bit tricky though, so pay attention. 

1. You'll want to double-click to start drawing. 
2. You can then click points together to form a polygon. 
3. Next you need to double-click to stop drawing. 

This allows you to draw polygons that can be used in the code below to fetch the examples that you're interested in.

## Rerun

This is where we will start labelling. That also means that we might re-run this cell after we've added labels.

In [26]:
from hulearn.experimental.interactive import InteractiveCharts

charts = InteractiveCharts(df.loc[lambda d: d['label'] == ''], labels=['group'])

charts.add_chart(x='d1', y='d2')

We can now use this selection to retreive a subset of rows. This is a quick varification to see if the points you select indeed belong to the same cluster.

In [23]:
from hulearn.preprocessing import InteractivePreprocessor
tfm = InteractivePreprocessor(json_desc=charts.data())

df.pipe(tfm.pandas_pipe).loc[lambda d: d['group'] != 0].sample(10)

Unnamed: 0,text,d1,d2,label,group
195,can you tell me your age?,8.330536,-7.792078,,1
190,From where did you come?,8.228561,-4.920124,,1
947,Where did you come from?,8.244767,-4.900227,,1
151,what is your birthday?,7.743145,-7.129227,,1
479,how old are u,7.773867,-7.961021,,1
797,What area are you from?,8.29266,-5.523428,,1
461,can you tell me what number represents your age?,8.243992,-7.800375,,1
663,What city are you in?,8.151293,-5.675588,,1
120,do you know how old you are?,7.778389,-7.997067,,1
141,what is your exact age?,8.266831,-7.720518,,1


If you're confident that you'd like to assign a label, you can do so below. 

In [24]:
label_name = 'origin'

In [25]:
idx = df.pipe(tfm.pandas_pipe).loc[lambda d: d['group'] != 0].index

df.iloc[idx, 3] = label_name

print(f"We just assigned {len(idx)} labels!")

We just assigned 100 labels!


That's it! You've just attached a label to a group of points! 

## Rerun 

You can now scroll up and start relabelling clusters that aren't assigned yet. Once you're confident that this works, you can export by running the final code below.

In [22]:
df.head()

Unnamed: 0,text,d1,d2,label
0,What languages can you communicate in?,2.727605,-0.036231,
1,what u can do?,10.641682,5.367977,
2,What exactly is my name?,14.029011,1.462056,
3,What does Rasa make?,24.505444,6.782425,
4,can you help me?,8.584467,4.582135,


In [168]:
df.to_csv("first_order_labelled.csv")

## Final Notes

There's a few things to mention. 

1. This method of labelling is great when you're working on version 0 of something. It'll get you a whole lot of data fast but it won't be high quality data. 
2. The use-case for this method might be at the start of design a virtual assistant. You've probably got data from social media that you'd like to use as a source of inspiration for intents. This is certainly a valid starting point but you should be aware that the language that folks use on a feedback form is different than the language used in a chatbox. Again, these labels are a reasonable starting point, but they should not be regarded as ground truth. 
3. Labelling is only part of the goal here. Another big part is understanding the data. This is very much a qualitative/human task. You might be able to quickly label 1000 points in 5 minutes with this technique but you'll lack an understanding if you don't take the time for it. 