## Multi-Task NLP with Transformers

__Credit:__ This notebook has been adapted from the __`transformers`__ package by HuggingFace from their examples. You can checkout their repository [here](https://github.com/huggingface/transformers)

Newly introduced in transformers v2.3.0, **pipelines** provides a high-level, easy to use,
API for doing inference over a variety of downstream-tasks, including:

- ***Sentence Classification _(Sentiment Analysis)_***: Indicate if the overall sentence is either positive or negative, i.e. *binary classification task* or *logitic regression task*.
- ***Token Classification (Named Entity Recognition, Part-of-Speech tagging)***: For each sub-entities _(*tokens*)_ in the input, assign them a label, i.e. classification task.
- ***Question-Answering***: Provided a tuple (`question`, `context`) the model should find the span of text in `content` answering the `question`.
- ***Mask-Filling***: Suggests possible word(s) to fill the masked input with respect to the provided `context`.
- ***Summarization***: Summarizes the ``input`` article to a shorter article.
- ***Translation***: Translates the input from a language to another language.
- ***Feature Extraction***: Maps the input to a higher, multi-dimensional space learned from the data.

Pipelines encapsulate the overall process of every NLP process:

 1. *Tokenization*: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).
 2. *Inference*: Maps every tokens into a more meaningful representation.
 3. *Decoding*: Use the above representation to generate and/or extract the final output for the underlying task.

The overall API is exposed to the end-user through the `pipeline()` method with the following
structure:

```python
from transformers import pipeline

# Using default model and tokenizer for the task
pipeline("<task-name>")

# Using a user-specified model
pipeline("<task-name>", model="<model_name>")

# Using custom model/tokenizer as str
pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')
```

<div style="text-align: right; font-size:80%"><i>Tutorial by: [Dipanjan (DJ) Sarkar</i>](https://www.linkedin.com/in/dipanzan)</div>


These models are already fine-tuned models for specific tasks available in huggingface hub

# Install dependencies

In [1]:
!nvidia-smi

zsh:1: command not found: nvidia-smi


In [None]:
#!pip install -q transformers

[K     |████████████████████████████████| 2.9 MB 5.0 MB/s 
[K     |████████████████████████████████| 3.3 MB 51.8 MB/s 
[K     |████████████████████████████████| 56 kB 6.8 MB/s 
[K     |████████████████████████████████| 895 kB 57.6 MB/s 
[K     |████████████████████████████████| 636 kB 56.1 MB/s 
[?25h

In [2]:
from transformers import pipeline

## 1. Sentence Classification - Sentiment Analysis

In [3]:
nlp_sentiment_model = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [4]:
nlp_sentiment_model('This is an excellent movie! Really nice plot and casting.')

[{'label': 'POSITIVE', 'score': 0.9998741149902344}]

In [5]:
nlp_sentiment_model('This movie was so NOT good!')

[{'label': 'NEGATIVE', 'score': 0.9998019337654114}]

In [6]:
nlp_sentiment_model('I tried to like this book but I definitely did not enjoy reading it!')

[{'label': 'NEGATIVE', 'score': 0.9994773268699646}]

## 2. Token Classification - Named Entity Recognition

In [7]:
nlp_token_class = pipeline('ner')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [8]:
text = """Three more countries have joined an "international grand committee" of parliaments, adding to calls for
Facebook's boss, Mark Zuckerberg, to give evidence on misinformation to the coalition. Brazil, Latvia and Singapore
bring the total to eight different parliaments across the world, with plans to send representatives to London on 27
November with the intention of hearing from Zuckerberg. Since the Cambridge Analytica scandal broke, the Facebook chief
has only appeared in front of two legislatures: the American Senate and House of Representatives, and the European parliament.
Facebook has consistently rebuffed attempts from others, including the UK and Canadian parliaments, to hear from Zuckerberg.
He added that an article in the New York Times on Thursday, in which the paper alleged a pattern of behaviour from Facebook
to "delay, deny and deflect" negative news stories, "raises further questions about how recent data breaches were allegedly
dealt with within Facebook."
"""

nlp_token_class(text)



[{'entity': 'I-ORG',
  'score': 0.99894947,
  'index': 20,
  'word': 'Facebook',
  'start': 104,
  'end': 112},
 {'entity': 'I-PER',
  'score': 0.99953496,
  'index': 25,
  'word': 'Mark',
  'start': 121,
  'end': 125},
 {'entity': 'I-PER',
  'score': 0.99878114,
  'index': 26,
  'word': 'Z',
  'start': 126,
  'end': 127},
 {'entity': 'I-PER',
  'score': 0.9470854,
  'index': 27,
  'word': '##uck',
  'start': 127,
  'end': 130},
 {'entity': 'I-PER',
  'score': 0.7890564,
  'index': 28,
  'word': '##er',
  'start': 130,
  'end': 132},
 {'entity': 'I-PER',
  'score': 0.9926771,
  'index': 29,
  'word': '##berg',
  'start': 132,
  'end': 136},
 {'entity': 'I-LOC',
  'score': 0.9998429,
  'index': 42,
  'word': 'Brazil',
  'start': 191,
  'end': 197},
 {'entity': 'I-LOC',
  'score': 0.9998599,
  'index': 44,
  'word': 'Latvia',
  'start': 199,
  'end': 205},
 {'entity': 'I-LOC',
  'score': 0.9998784,
  'index': 46,
  'word': 'Singapore',
  'start': 210,
  'end': 219},
 {'entity': 'I-LOC',


## 3. Question Answering

In [9]:
nlp_qa = pipeline('question-answering')

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [10]:
context = """
Coronaviruses are a large family of viruses which may cause illness in animals or humans.
In humans, several coronaviruses are known to cause respiratory infections ranging from the
common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS).
The most recently discovered coronavirus causes coronavirus disease COVID-19.
COVID-19 is the infectious disease caused by the most recently discovered coronavirus.
This new virus and disease were unknown before the outbreak began in Wuhan, China, in December 2019.
COVID-19 is now a pandemic affecting many countries globally.
The most common symptoms of COVID-19 are fever, dry cough, and tiredness.
Other symptoms that are less common and may affect some patients include aches
and pains, nasal congestion, headache, conjunctivitis, sore throat, diarrhea,
loss of taste or smell or a rash on skin or discoloration of fingers or toes.
These symptoms are usually mild and begin gradually.
Some people become infected but only have very mild symptoms.
Most people (about 80%) recover from the disease without needing hospital treatment.
Around 1 out of every 5 people who gets COVID-19 becomes seriously ill and develops difficulty breathing.
Older people, and those with underlying medical problems like high blood pressure, heart and lung problems,
diabetes, or cancer, are at higher risk of developing serious illness.
However, anyone can catch COVID-19 and become seriously ill.
People of all ages who experience fever and/or  cough associated with difficulty breathing/shortness of breath,
chest pain/pressure, or loss of speech or movement should seek medical attention immediately.
If possible, it is recommended to call the health care provider or facility first,
so the patient can be directed to the right clinic.
People can catch COVID-19 from others who have the virus.
The disease spreads primarily from person to person through small droplets from the nose or mouth,
which are expelled when a person with COVID-19 coughs, sneezes, or speaks.
These droplets are relatively heavy, do not travel far and quickly sink to the ground.
People can catch COVID-19 if they breathe in these droplets from a person infected with the virus.
This is why it is important to stay at least 1 meter) away from others.
These droplets can land on objects and surfaces around the person such as tables, doorknobs and handrails.
People can become infected by touching these objects or surfaces, then touching their eyes, nose or mouth.
This is why it is important to wash your hands regularly with soap and water or clean with alcohol-based hand rub.
Practicing hand and respiratory hygiene is important at ALL times and is the best way to protect others and yourself.
When possible maintain at least a 1 meter distance between yourself and others.
This is especially important if you are standing by someone who is coughing or sneezing.
Since some infected persons may not yet be exhibiting symptoms or their symptoms may be mild,
maintaining a physical distance with everyone is a good idea if you are in an area where COVID-19 is circulating.
"""

In [11]:
nlp_qa(context=context, question='What is a coronavirus ?')

{'score': 0.6717585325241089,
 'start': 19,
 'end': 89,
 'answer': 'a large family of viruses which may cause illness in animals or humans'}

In [12]:
nlp_qa(context=context, question='What is covid-19 ?')

{'score': 0.4254707396030426,
 'start': 407,
 'end': 476,
 'answer': 'infectious disease caused by the most recently discovered coronavirus'}

In [13]:
nlp_qa(context=context, question='What are covid-19 symptoms ?')

{'score': 0.8406685590744019,
 'start': 682,
 'end': 713,
 'answer': 'fever, dry cough, and tiredness'}

In [14]:
nlp_qa(context=context, question='How do people get infected by covid-19 ?')

{'score': 0.3460104167461395,
 'start': 2464,
 'end': 2498,
 'answer': 'touching these objects or surfaces'}

In [16]:
nlp_qa(context=context, question='Did someone apply a large rudder input during takeoff ?')

{'score': 0.04208790138363838,
 'start': 1444,
 'end': 1495,
 'answer': 'anyone can catch COVID-19 and become seriously ill.'}

## 4. Text Generation - Mask Filling

In [None]:
nlp_fill = pipeline('fill-mask')

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [None]:
nlp_fill('The ship is reaching the ' + nlp_fill.tokenizer.mask_token)

[{'score': 0.09211686998605728,
  'sequence': 'The ship is reaching the surface',
  'token': 4084,
  'token_str': ' surface'},
 {'score': 0.07959000021219254,
  'sequence': 'The ship is reaching the port',
  'token': 4103,
  'token_str': ' port'},
 {'score': 0.053496353328228,
  'sequence': 'The ship is reaching the shore',
  'token': 8373,
  'token_str': ' shore'},
 {'score': 0.04391564428806305,
  'sequence': 'The ship is reaching the dock',
  'token': 15261,
  'token_str': ' dock'},
 {'score': 0.031611498445272446,
  'sequence': 'The ship is reaching the destination',
  'token': 6381,
  'token_str': ' destination'}]

## 5. Summarization

Summarization is currently supported by `Bart` and `T5`.

In [None]:
summarizer = pipeline('summarization', model='facebook/bart-large-cnn',
                      tokenizer='facebook/bart-large-cnn')

Downloading:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [None]:
BIG_DOC = """
Coronaviruses are a large family of viruses which may cause illness in animals or humans.
In humans, several coronaviruses are known to cause respiratory infections ranging from the
common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS).
The most recently discovered coronavirus causes coronavirus disease COVID-19.
COVID-19 is the infectious disease caused by the most recently discovered coronavirus.
This new virus and disease were unknown before the outbreak began in Wuhan, China, in December 2019.
COVID-19 is now a pandemic affecting many countries globally.
The most common symptoms of COVID-19 are fever, dry cough, and tiredness.
Other symptoms that are less common and may affect some patients include aches
and pains, nasal congestion, headache, conjunctivitis, sore throat, diarrhea,
loss of taste or smell or a rash on skin or discoloration of fingers or toes.
These symptoms are usually mild and begin gradually.
Some people become infected but only have very mild symptoms.
Most people (about 80%) recover from the disease without needing hospital treatment.
Around 1 out of every 5 people who gets COVID-19 becomes seriously ill and develops difficulty breathing.
Older people, and those with underlying medical problems like high blood pressure, heart and lung problems,
diabetes, or cancer, are at higher risk of developing serious illness.
However, anyone can catch COVID-19 and become seriously ill.
People of all ages who experience fever and/or  cough associated with difficulty breathing/shortness of breath,
chest pain/pressure, or loss of speech or movement should seek medical attention immediately.
If possible, it is recommended to call the health care provider or facility first,
so the patient can be directed to the right clinic.
People can catch COVID-19 from others who have the virus.
The disease spreads primarily from person to person through small droplets from the nose or mouth,
which are expelled when a person with COVID-19 coughs, sneezes, or speaks.
These droplets are relatively heavy, do not travel far and quickly sink to the ground.
People can catch COVID-19 if they breathe in these droplets from a person infected with the virus.
This is why it is important to stay at least 1 meter) away from others.
These droplets can land on objects and surfaces around the person such as tables, doorknobs and handrails.
People can become infected by touching these objects or surfaces, then touching their eyes, nose or mouth.
This is why it is important to wash your hands regularly with soap and water or clean with alcohol-based hand rub.
Practicing hand and respiratory hygiene is important at ALL times and is the best way to protect others and yourself.
When possible maintain at least a 1 meter distance between yourself and others.
This is especially important if you are standing by someone who is coughing or sneezing.
Since some infected persons may not yet be exhibiting symptoms or their symptoms may be mild,
maintaining a physical distance with everyone is a good idea if you are in an area where COVID-19 is circulating.
"""


result = summarizer(BIG_DOC)

In [None]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
summary = result[0]['summary_text']
print('\n'.join(nltk.sent_tokenize(summary)))


COVID-19 is the infectious disease caused by the most recently discovered coronavirus.
It is now a pandemic affecting many countries globally.
Most people recover from the disease without needing hospital treatment.
Around 1 out of every 5 people who gets COVID- 19 becomes seriously ill and develops difficulty breathing.


## 6. Translation

Translation is currently supported by `T5` for the language mappings English-to-French (`translation_en_to_fr`), English-to-German (`translation_en_to_de`) and English-to-Romanian (`translation_en_to_ro`).

In [None]:
# English to French
translator = pipeline('translation_en_to_fr')

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

In [None]:
translator("The quick brown fox jumped over the lazy dog")

[{'translation_text': 'Le renard brun rapide saute au-dessus du chien piètre'}]

In [None]:
# English to German
translator = pipeline('translation_en_to_de')

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


In [None]:
translator("The quick brown fox jumped over the lazy dog")

[{'translation_text': 'Der schnelle braune Fuchs sprang über den faulen Hund'}]

## 7. Text Generation

Text generation is currently supported by GPT-2, OpenAi-GPT, TransfoXL, XLNet, CTRL and Reformer.

In [None]:
text_generator = pipeline("text-generation", model='gpt2', tokenizer='gpt2')

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [None]:
result = text_generator("The football game is about to")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
print(result[0]['generated_text'])

The football game is about to be transformed by the latest evolution in the evolution of the school sport – a football match where the entire pitch is involved – and there will be a major change to the way that fans see the game."

As well


## 8. Projection - Features Extraction

In [None]:
import numpy as np
nlp_features = pipeline('feature-extraction')
output = nlp_features('This is a short sentence')
print('Shape:', np.array(output).shape)   # (Samples, Tokens, Vector Size)
np.array(output)

No model was supplied, defaulted to distilbert-base-cased (https://huggingface.co/distilbert-base-cased)


Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/251M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Shape: (1, 7, 768)


array([[[ 0.31193659,  0.06641161, -0.0303604 , ...,  0.04635397,
          0.34695518,  0.07248024],
        [-0.1749889 , -0.08628027,  0.32349387, ...,  0.51969773,
          0.24023356,  0.37703645],
        [ 0.10653968,  0.27738377,  0.3384136 , ...,  0.66813093,
          0.31344935,  0.13851486],
        ...,
        [ 0.27000797, -0.05821696,  0.10258093, ...,  0.46564671,
          0.34796861,  0.26275688],
        [ 0.14764583,  0.32623157,  0.0352209 , ...,  0.09629995,
          0.21582821,  0.09945892],
        [ 0.89270836, -0.05968712, -0.14042372, ...,  0.35298911,
          0.8723703 , -0.42272568]]])

Now you have a nice picture of what is possible through transformers' pipelines.

Feel free to try these different pipelines with your own inputs