<a href="https://colab.research.google.com/github/componavt/neural_synset/blob/master/src/dataset/multi_label_and_hypothesis_template_%F0%9F%A4%97_Zero_Shot_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers



In [None]:
from transformers import pipeline

In [None]:
classifier = pipeline("zero-shot-classification")
# classifier = pipeline("zero-shot-classification", device=0) # to utilize GPU

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

We can use this pipeline by passing in a sequence and a list of candidate labels. The pipeline assumes by default that only one of the candidate labels is true, returning a list of scores for each label which add up to 1.

In [None]:
sequence = "осуществлять деятельность, работать, действовать в какой-нибудь области"
candidate_labels = ["работать", "воевать", "грубость"]

classifier(sequence, candidate_labels)

{'sequence': 'осуществлять деятельность, работать, действовать в какой-нибудь области',
 'labels': ['работать', 'воевать', 'грубость'],
 'scores': [0.8228265643119812, 0.14131566882133484, 0.03585772588849068]}

To do multi-class classification, simply pass `multi_class=True`. In this case, the scores will be independent, but each will fall between 0 and 1.

In [None]:
sequences = ["осуществлять деятельность, работать, действовать в какой-нибудь области",
             "подвизаться совершать подвиг в чём-либо, часто о ежедневном борении"]
candidate_labels = ["жаргонный", "книжный", "ироничный", "официальный"]

classifier(sequences, candidate_labels, multi_label=True)

[{'sequence': 'осуществлять деятельность, работать, действовать в какой-нибудь области',
  'labels': ['официальный', 'ироничный', 'книжный', 'жаргонный'],
  'scores': [0.5514466762542725,
   0.48188477754592896,
   0.47814375162124634,
   0.31762558221817017]},
 {'sequence': 'подвизаться совершать подвиг в чём-либо, часто о ежедневном борении',
  'labels': ['официальный', 'ироничный', 'жаргонный', 'книжный'],
  'scores': [0.45808038115501404,
   0.43311503529548645,
   0.4227086901664734,
   0.321784108877182]}]

\+ hypothesis_template (уточнение запроса через шаблон-гипотезу).

By providing a more precise hypothesis template, we are able to see a more accurate classification of the second review.

In [None]:
sequences = ["осуществлять деятельность, работать, действовать в какой-нибудь области",
             "подвизаться совершать подвиг в чём-либо, часто о ежедневном борении"]
candidate_labels = ["жаргонный", "книжный", "ироничный", "официальный"]

hypothesis_template = "Оценка предметной области {}."

classifier(sequence, candidate_labels, hypothesis_template=hypothesis_template)

{'sequence': 'осуществлять деятельность, работать, действовать в какой-нибудь области',
 'labels': ['книжный', 'официальный', 'ироничный', 'жаргонный'],
 'scores': [0.27613362669944763,
  0.25833889842033386,
  0.2516063153743744,
  0.21392114460468292]}

So how does this method work?

The underlying model is trained on the task of Natural Language Inference (NLI), which takes in two sequences and determines whether they contradict each other, entail each other, or neither.

This can be adapted to the task of zero-shot classification by treating the sequence which we want to classify as one NLI sequence (called the premise) and turning a candidate label into the other (the hypothesis). If the model predicts that the constructed premise _entails_ the hypothesis, then we can take that as a prediction that the label applies to the text. Check out [this blog post](https://joeddav.github.io/blog/2020/05/29/ZSL.html) for a more detailed explanation.

By default, the pipeline turns labels into hypotheses with the template `This example is {class_name}.`. This works well in many settings, but you can also customize this for your specific setting. Let's add another review to our above sentiment classification example that's a bit more challenging:

The second example is a bit harder. Let's see if we can improve the results by using a hypothesis template which is more specific to the setting of review sentiment analysis. Instead of the default, `This example is {}.`, we'll use, `The sentiment of this review is {}.` (where `{}` is replaced with the candidate class name)

#### Update: Zero-shot classification in 100 languages

Interested in using the pipeline for languages other than English? We've trained a cross-lingual model on top of XLM RoBERTa which you can use by passing `model='joeddav/xlm-roberta-large-xnli'` when creating the pipeline:

In [None]:
classifier = pipeline("zero-shot-classification", model='joeddav/xlm-roberta-large-xnli')

OSError: You are trying to access a gated repo.
Make sure to request access at https://huggingface.co/joeddav/xlm-roberta-large-xnli and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`.

You can use it with any combination of languages. For example, let's classify a Russian sentence with English candidate labels:

In [None]:
sequence = "За кого вы голосуете в 2020 году?" # translation: "Who are you voting for in 2020?"
candidate_labels = ["Europe", "public health", "politics"]

classifier(sequence, candidate_labels)

{'labels': ['politics', 'Europe', 'public health'],
 'scores': [0.9048484563827515, 0.05722189322113991, 0.03792969882488251],
 'sequence': 'За кого вы голосуете в 2020 году?'}

Now let's do the same but with the labels in French:



In [None]:
sequence = "За кого вы голосуете в 2020 году?" # translation: "Who are you voting for in 2020?"
candidate_labels = ["Europe", "santé publique", "politique"]

classifier(sequence, candidate_labels)

{'labels': ['politique', 'Europe', 'santé publique'],
 'scores': [0.9726154804229736, 0.017128489911556244, 0.010256024077534676],
 'sequence': 'За кого вы голосуете в 2020 году?'}

As we discussed in the last section, the default hypothesis template is the English, `This text is {}.`. If you are working strictly within one language, it may be worthwhile to translate this to the language you are working with:

In [None]:
sequence = "¿A quién vas a votar en 2020?"
candidate_labels = ["Europa", "salud pública", "política"]
hypothesis_template = "Este ejemplo es {}."

classifier(sequence, candidate_labels, hypothesis_template=hypothesis_template)

{'labels': ['política', 'Europa', 'salud pública'],
 'scores': [0.9109585881233215, 0.05954807624220848, 0.029493311420083046],
 'sequence': '¿A quién vas a votar en 2020?'}

The model is fine-tuned on XNLI which includes 15 languages: Arabic, Bulgarian, Chinese, English, French, German, Greek, Hindi, Russian, Spanish, Swahili, Thai, Turkish, Urdu, and Vietnamese. The base model is trained on 85 more, so the model will work to some degree for any of those in the XLM RoBERTa training corpus (see the full list in appendix A of the [XLM Roberata paper](https://arxiv.org/abs/1911.02116)).

See the [model page](https://huggingface.co/joeddav/xlm-roberta-large-xnli) for more.