## Extended use of Huggingface's Zero-Shot Pipeline

- In this notebook, we extend the last notebook's zero-shot learning while using custom sentences and labels to classify those texts.  
- You will also see, how multi-lingual transformer models can be used to perform various tasks in many languages.

In [1]:
from transformers import pipeline

import pandas as pd

In [4]:
classifier = pipeline("zero-shot-classification", device=0) # to utilize GPU

No model was supplied, defaulted to FacebookAI/roberta-large-mnli and revision 130fb28 (https://huggingface.co/FacebookAI/roberta-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.






All PyTorch model weights were used when initializing TFRobertaForSequenceClassification.

All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


We can use this pipeline by passing in a sequence and a list of candidate labels. The pipeline assumes by default that only one of the candidate labels is true, returning a list of scores for each label which add up to 1.

In [8]:
"""
We create a function to display our predictions from the model in a tabular form
"""
def get_predictions_score(prediction):
    pred_labels = prediction['labels']
    pred_scores = prediction['scores']
    seq = [prediction['sequence']]
    return  pd.concat([
                pd.DataFrame(seq),
                pd.DataFrame(pred_labels),
                pd.DataFrame(pred_scores),
            ], axis=1, ignore_index=True).rename(columns={0:'Sequence',1:'Labels', 2:'Probability'}).set_index(['Sequence'])

In [9]:
sequence = "Amazon is the longest river in the world"
candidate_labels = ["geography",  "delivery"]

pred = classifier(sequence, candidate_labels)
get_predictions_score(pred)

Unnamed: 0_level_0,Labels,Probability
Sequence,Unnamed: 1_level_1,Unnamed: 2_level_1
Amazon is the longest river in the world,geography,0.870195
,delivery,0.129805


What if we change some spellings? Here we change Amazon -> amazon. It doesn't make much difference but in some cases it will. <br> Try playing with spellings and adding or removing labels

In [10]:
sequence = "amazon is the longest river in the world"
candidate_labels = ["geography",  "delivery"]

pred = classifier(sequence, candidate_labels)
get_predictions_score(pred)

Unnamed: 0_level_0,Labels,Probability
Sequence,Unnamed: 1_level_1,Unnamed: 2_level_1
amazon is the longest river in the world,geography,0.805673
,delivery,0.194326


In the example below, you'll see how good are these models in understanding the context, with a slight spelling mistake. <br> Try changing the spelling and observe the results

In [11]:
sequence = "are we going to Oktoberfest?"
candidate_labels = ["food", "Munich", "bear", "wine", "pretzel", "sausage"] ## What if you change bear (animal) -> beer (drink)

pred = classifier(sequence, candidate_labels)
get_predictions_score(pred)

Unnamed: 0_level_0,Labels,Probability
Sequence,Unnamed: 1_level_1,Unnamed: 2_level_1
are we going to Oktoberfest?,wine,0.489264
,Munich,0.160479
,sausage,0.132136
,food,0.091372
,bear,0.085947
,pretzel,0.040801


In [12]:
sequence = "Who are you voting for in 2020?"
candidate_labels = ["food", "public health", "plants", "fruits","america"]

pred = classifier(sequence, candidate_labels)
get_predictions_score(pred)

Unnamed: 0_level_0,Labels,Probability
Sequence,Unnamed: 1_level_1,Unnamed: 2_level_1
Who are you voting for in 2020?,america,0.397639
,public health,0.213483
,plants,0.137287
,fruits,0.133618
,food,0.117974


##### The predictions are poor as the labels are not related to the sequence. But there are ways to improve upon this. We can provide related target labels for the input sequence.


In [None]:
## Think about other labels which can improve the predictions
## HINT: Labels related to your text

To do multi-class classification, simply pass `multi_class=True`. In this case, the scores will be independent, but each will fall between 0 and 1.

In [None]:
sequence = "Who are you voting for in 2020?"
candidate_labels = ["politics", "public health", "economics", "elections"]

pred = classifier(sequence, candidate_labels, multi_label=True)
get_predictions_score(pred)

#### Sentiment Classification

Here's an example of sentiment classification: 

In [None]:
sequence = "I hated this movie. The acting sucked."
candidate_labels = ["positive", "negative"]

pred = classifier(sequence, candidate_labels)
get_predictions_score(pred)

So how does this method work?

The underlying model is trained on the task of Natural Language Inference (NLI), which takes in two sequences and determines whether they contradict each other, entail each other, or neither.

This can be adapted to the task of zero-shot classification by treating the sequence which we want to classify as one NLI sequence (called the premise) and turning a candidate label into the other (the hypothesis). If the model predicts that the constructed premise _entails_ the hypothesis, then we can take that as a prediction that the label applies to the text. Check out [this blog post](https://joeddav.github.io/blog/2020/05/29/ZSL.html) for a more detailed explanation.

By default, the pipeline turns labels into hypotheses with the template `This example is {class_name}.`. This works well in many settings, but you can also customize this for your specific setting. Let's add another review to our above sentiment classification example that's a bit more challenging:

In [None]:
sequences = [
    "I hated this movie. The acting sucked.",
    "This movie didn't quite live up to my high expectations, but overall I still really enjoyed it."
]
candidate_labels = ["positive", "negative"]

classifier(sequences, candidate_labels)

The second example is a bit harder. Let's see if we can improve the results by using a hypothesis template which is more specific to the setting of review sentiment analysis. Instead of the default, `This example is {}.`, we'll use, `The sentiment of this review is {}.` (where `{}` is replaced with the candidate class name)

In [None]:
sequences = [
    "I hated this movie. The acting sucked.",
    "This movie didn't quite live up to my high expectations, but overall I still really enjoyed it."
]
candidate_labels = ["positive", "negative"]
hypothesis_template = "The sentiment of this review is {}."

classifier(sequences, candidate_labels, hypothesis_template=hypothesis_template)

By providing a more precise hypothesis template, we are able to see a more accurate classification of the second review.

> Note that sentiment classification is used here just as an illustrative example. The [Hugging Face Model Hub](https://huggingface.co/models?filter=text-classification) has a number of models trained specifically on sentiment tasks which can be used instead.

#### Zero-shot classification in more than 100 languages



Interested in using the pipeline for languages other than English? There is a cross-lingual model on top of XLM RoBERTa which you can use by passing `model='joeddav/xlm-roberta-large-xnli'` when creating the pipeline: 

In [None]:
classifier = pipeline("zero-shot-classification", model='joeddav/xlm-roberta-large-xnli', device=0)

You can use it with any combination of languages. For example, let's classify a Russian sentence with English candidate labels:

In [None]:
sequence = "За кого вы голосуете в 2020 году?" # translation: "Who are you voting for in 2020?"
candidate_labels = ["Europe", "public health", "politics"]

classifier(sequence, candidate_labels)

Now let's do the same but with the labels in French:



In [None]:
sequence = "За кого вы голосуете в 2020 году?" # translation: "Who are you voting for in 2020?"
candidate_labels = ["Europe", "santé publique", "politique"]

classifier(sequence, candidate_labels)

As we discussed in the last section, the default hypothesis template is the English, `This text is {}.`. If you are working strictly within one language, it may be worthwhile to translate this to the language you are working with:

In [None]:
sequence = "¿A quién vas a votar en 2020?"
candidate_labels = ["Europa", "salud pública", "política"]
hypothesis_template = "Este ejemplo es {}."

classifier(sequence, candidate_labels, hypothesis_template=hypothesis_template)

The model is fine-tuned on XNLI which includes 15 languages: Arabic, Bulgarian, Chinese, English, French, German, Greek, Hindi, Russian, Spanish, Swahili, Thai, Turkish, Urdu, and Vietnamese. The base model is trained on 85 more, so the model will work to some degree for any of those in the XLM RoBERTa training corpus (see the full list in appendix A of the [XLM Roberata paper](https://arxiv.org/abs/1911.02116)).

See the [model page](https://huggingface.co/joeddav/xlm-roberta-large-xnli) for more.

### Different Pipeline models

[Read here](https://huggingface.co/docs/transformers/main_classes/pipelines) about different models available from Huggingface pipeline.

#### Text Generation

In [None]:
text_gen = pipeline("text-generation", model='gpt2') # to utilize GPU

In [None]:
prompt = "Data Science is"
text_gen(prompt, max_length=30, num_return_sequences=3)

You can play around with different starting sentence. You can change `max_length` argument if you want shorter or longer sentences.

#### Sentiment Analysis

The sentiment analysis example in the beginning of the notebook can also be done using a sentiment analysis pipeline

In [None]:
## Create a new sentiment-analysis pipeline and play with the examples in the new pipeline
## HINT: You don't need to provide labels to the sentiment analysis pipeline as it is trained for the same task

### OPTIONAL
#### You can create a Hugging face account and create a token if you wish to create or push content to a repository (e.g., when training a model or modifying a model card) within hugging face.

- Create an account at https://huggingface.co/
- After logging in
    - go to Settings->Access Tokens
    - Create new token and give write permissions

- Run these commands 
    - `brew install huggingface-cli`
    - `huggingface-cli login` and paste the access token from huggingface
    - **Do not add access token for github if it asks**

    Reference: https://huggingface.co/docs/hub/security-tokens