# Introduction of zero-shot classification using chat data

In this notebook, I would like to make a rapid introduction to zero-shot classification for practitioners. We will cover what it is and explore a use case. After being positively surprised with my toy examples, I prepared this notebook. I wanted to share what I see as a potentially game-changing technique to apply machine learning when ground-truth labels are unavailable or are costly to be collected. 

This is not a comprehensive tutorial, nor does it discuss efficient approaches. Still, it's more like a taste of zero-shot classification in practice for those with no experience with it. In the conclusion, I mention some takeaways that I see on how zero-shot classification cal help in feature engineering pipelines, since you can use its output of intermediary variables to other systems. These intermediary variables are arguably more explainable since you chose the candidate labels, and you can use them in simpler models that other people can scrutinize.


For now, go to the example #1 below!


* Original author: Adelson de Araujo
* Last update: 08/12/2021

## Imports and utilities

In [None]:
#hide_input
import os
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch

plt.rcParams.update({'font.size': 18, "font.family": "Times"})

In [None]:
from transformers import AutoTokenizer, pipeline
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoModelForSequenceClassification

def zeroshot_classifier():
    # Only works with English text
    tokenizer = AutoTokenizer\
                    .from_pretrained("facebook/bart-large-mnli")
    model = AutoModelForSequenceClassification\
                    .from_pretrained("facebook/bart-large-mnli")
    return pipeline(task='zero-shot-classification',
                    model=model, tokenizer=tokenizer)


# we will need in the second example
def translator(src: str, dest: str):
    src = src.lower()
    dest = dest.lower()
    tokenizer = AutoTokenizer\
                    .from_pretrained(f"Helsinki-NLP/opus-mt-{src}-{dest}")
    model = AutoModelForSeq2SeqLM\
                    .from_pretrained(f"Helsinki-NLP/opus-mt-{src}-{dest}")
    return pipeline(task='translation',
                    model=model, tokenizer=tokenizer)


## Example #1: Reasoning or personal impression?

This is an example taken from [Fiacco & Ros√© (2018)](https://dl.acm.org/doi/pdf/10.1145/3231644.3231655).

Suppose you want to classify a text as a ***causal reasoning***, a ***evaluation reasoning*** or a ***personal impression***. 

If someone says 
* "Use of coal increases pollution", we expect the label to be ***causal reasoning***.
* "Use of wind power may not be reliable throughtout the year", we expect the label to be ***evaluation reasoning***.
* "I prefer coal power", we expect the label to be ***personal impression***.

Of course, these are not completely exclusive classes and could be better conceived to be exclusive, but suppose these are exclusive.

We can use a pre-trained model that is able to perform zero-shot learning and generate labels without any previous training data:

In [None]:
pipe = zeroshot_classifier()

candidate_labels = ['causal reasoning', 
                    'evaluation reasoning', 
                    'personal impression']

input_texts = ["Use of coal increases pollution", 
              "Use of wind power may not be reliable throughout the year",
              "I prefer coal power"]

with torch.no_grad():
    predictions = pipe(input_texts, 
                       candidate_labels=candidate_labels)

In [None]:
#hide_input

fig, ax = plt.subplots(nrows=3, 
                       figsize=(6,10))

for i, p in enumerate(predictions):
    sns.barplot(y='labels', x='scores', data=p, ax=ax[i], 
                order=candidate_labels)
    ax[i].set_title(p['sequence'][:50])

fig.tight_layout()

## Okay, I want to know more. What is zero-shot classification?

To be more accurate, zero-shot classification refers to the inference part of zero-shot learning, which is currently a hot research topic in ML literature of transer learning.

As good overview of what this is, [Ian Goodfellow's wrote in Quora the following answer](https://qr.ae/pGl1ss).

```
Zero-shot learning is being able to solve a task despite not having received any training examples of that task.
```

He answered the question quite some time ago, so this is not a new thing. In Wikipedia, you will see that this has been researched at least since 2008.

Further, Joe Davison wrote [a comprehensive blog post](https://joeddav.github.io/blog/2020/05/29/ZSL.html) about zero-shot learning, where he greatly explains successes in the field of transfer learning that allowed models to perform surprisingly well in several classification tasks.

## Why I have not hearded about this before?

A more accessible use of zero-shot classification (inference) for NLP practitioners has been materialized more recently particularly by the HuggingFace's [transformers](https://huggingface.co/docs/transformers/index) library and the [models](https://huggingface.co/models) page. New models are being proposed given that more interesting, robust, and semantically transferable [datasets](https://huggingface.co/datasets) are available. More people are using pre-trained models and engaging in transfer learning pipelines. 


## Okay, but labeling without training samples is really doable?

Zero-shot learning is a particular form of transfer learning. That there are different ways to do the job, and techniques vary in computer vision and NLP. Of course, I still want to see more studies discussing interrater reliability with these kind of tools in a diverse set of scenarios to have a stronger argument on using it "in the wild".

In this notebook, we will walk through how you can put your hands in some unlabeled text data and label it automatically using models available from HuggingFace's `transformers`. 


## Do you know exactly how/why does this work?

I guess this answer depends a bit on the model you are using. Below, we test the BART model trained by facebook, and I suggest you read the [description of the model](https://huggingface.co/facebook/bart-large-mnli) we are using here. They answer this question quite clearly, but you must have known what "NLI" is.

If you want to read my own (shorter) explanation of their explanation, here it is:

NLI stands for Natural Language Inference, and it refers to a particular classification task. Suppose two pieces of texts, a premise, and a hypothesis. The option of labels are *entailment* (when the hypothesis confirms the premise), *contradiction* (when the hypothesis denies the premise), and *neutral*. Check some examples [here]. There are some NLI datasets available; for example, SNLI and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) are the most famous ones I know of. Some people also call NLI's task the entailment classification task.

The MultiNLI dataset is enormous and enables robust algorithm architectures to produce very good models that can be transferred to other semantically similar tasks.


## But how does models trained on NLI works for zero-shot classification?

With the entailment, neutral, and contradiction classes to match two pieces of text (premise and hypothesis), a lot of other classification tasks can be adapted to be semantically similar. For example, you can take an arbitrary input text and consider it as the premise. For each `item` in `candidate_labels` you want to explore with zero-shot, you create a hypothesis "This sentence is about `{item}`". Suppose that your model does a great job at the entailment task. If the model predict that relation as an entailment, that indicates a good match between your input text and this label. Would that make sense?  

Let's go through another example to see it in action again, but in a more challenging context.

## Example #2: DOTA-2 in-game chats

If you want to go through a second example, I will use some data from DOTA-2 chats to classify them as one of the following `candidate_labels = ['chitchat', 'game features', 'coordination', 'toxic offense', 'gender discrimination', 'religious intolerance', 'racism']`. This data is in Russian, so we have a translation step in between that we may loose some information. Also we are not carrying too much in preprocessing steps, but they are indeed required for more serious projects. Also, it seems that there are zero-shot models in a few other languages available out there, such as French, Spanish, German, even Russian. See a list from HuggingFace [models](https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=downloads).

### a. Load the data

In [None]:
df = pd.read_csv('/kaggle/input/gosuai-dota-2-game-chats/dota2_chat_messages.csv', nrows=100)
df['text'] = df['text'].fillna('')

print('Mean length of text', df['text'].apply(lambda x: len(x)).mean())
print(df.head(15))

In [None]:
%%time

df_sample = df.sample(50)

### b. Translate to English

In [None]:
translate = translator('ru', 'en')

In [None]:
%%time

with torch.no_grad():
    translated_text = translate([t[:100] for t in list(df_sample['text'])])
df_sample['text_en'] = [t['translation_text'] for t in translated_text]

In [None]:
df_sample.head(10)

### c. Load zero-shot classifier from Huggingface's transformers library

In [None]:
candidate_labels = ['chitchat', 
                    'game features', 
                    'coordination', 
                    'toxic offense', 
                    'gender discrimination', 
                    'religious intolerance', 
                    'racism']

pipe = zeroshot_classifier()

### d. Generate and explore predictions

In [None]:
%%time

with torch.no_grad():
    predictions = pipe(list(df_sample['text_en']), 
                       candidate_labels=candidate_labels)
predictions[0]

In [None]:
#hide_input

labels = []
for p in predictions:
    labels.append(p['labels'][np.argmax(p['scores'])])

df_sample['label'] = labels
df_sample

In [None]:
#hide_input

how_many_to_plot = 10

fig, ax = plt.subplots(nrows=how_many_to_plot, 
                       figsize=(6,30))

for i, p in enumerate(random.sample(predictions, how_many_to_plot)):
    sns.barplot(y='labels', x='scores', data=p, ax=ax[i], 
                order=candidate_labels)
    ax[i].set_title(p['sequence'][:50])

fig.tight_layout()

## Concluding remarks

In case you want to know more about zero-shot learning, I encourage you to go through the following material:

* https://joeddav.github.io/blog/2020/05/29/ZSL.html
* https://arxiv.org/abs/1909.00161
* https://www.deeplearningbook.org/contents/representation.html (Section 15.2)
* https://www.aaai.org/Papers/AAAI/2008/AAAI08-132.pdf

As a takeaway, I think a well-designed zero-shot classifier (with suitable candidate labels) can be a game-changing tool for several AI projects. 

One use case is, for example, "expert systems," where you generate these output probabilities for classes that you understand as intermediary features that you provide to a rule-based decision-making mechanism. Then you can write things like "if causal_reasoning is high, do the action A; if evaluation_reasoning is high, do the action B." There has been some research opportunity with zero-shot learning as a feature engineering step because of this kind of system. Interestingly, one can automate understandable features computed from black-box models. Also, if you build `candidate_labels` using some theory is great because you have more evidence to back up design choices.

The main issue I see with this is how we can evaluate if the zero-shot labels are suitable. Well, one way to measure is by letting the model compute several labels, and then you or another person can label manually. With your labels and the zero-shot ones, you can use Cohen's Kappa as your agreement level. That, however, is not scalable when you are testing several `candidate_labels`.

If you see interesting use cases for zero-shot learning or want to interact/discuss this notebook, please comment below!!