## Multi-Task NLP Applications with Transformers

__Credit:__ This notebook has been adapted from the __`transformers`__ package by HuggingFace from their examples. You can checkout their repository [here](https://github.com/huggingface/transformers)

Newly introduced in transformers v2.3.0, **pipelines** provides a high-level, easy to use,
API for doing inference over a variety of downstream-tasks, including: 

- ***Sentence Classification _(Sentiment Analysis)_***: Indicate if the overall sentence is either positive or negative, i.e. *binary classification task* or *logitic regression task*.
- ***Token Classification (Named Entity Recognition, Part-of-Speech tagging)***: For each sub-entities _(*tokens*)_ in the input, assign them a label, i.e. classification task.
- ***Question-Answering***: Provided a tuple (`question`, `context`) the model should find the span of text in `content` answering the `question`.
- ***Mask-Filling***: Suggests possible word(s) to fill the masked input with respect to the provided `context`.
- ***Summarization***: Summarizes the ``input`` article to a shorter article.
- ***Translation***: Translates the input from a language to another language.
- ***Feature Extraction***: Maps the input to a higher, multi-dimensional space learned from the data.

Pipelines encapsulate the overall process of every NLP process:
 
 1. *Tokenization*: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).
 2. *Inference*: Maps every tokens into a more meaningful representation. 
 3. *Decoding*: Use the above representation to generate and/or extract the final output for the underlying task.

The overall API is exposed to the end-user through the `pipeline()` method with the following 
structure:

```python
from transformers import pipeline

# Using default model and tokenizer for the task
pipeline("<task-name>")

# Using a user-specified model
pipeline("<task-name>", model="<model_name>")

# Using custom model/tokenizer as str
pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')
```

<div style="text-align: right; font-size:80%"><i>Tutorial by: [Dipanjan (DJ) Sarkar](https://www.linkedin.com/in/dipanzan)</div></i>

# Install dependencies

In [1]:
!nvidia-smi

Wed Jun 15 13:58:20 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install -q transformers

[K     |████████████████████████████████| 4.2 MB 8.6 MB/s 
[K     |████████████████████████████████| 596 kB 58.2 MB/s 
[K     |████████████████████████████████| 86 kB 5.5 MB/s 
[K     |████████████████████████████████| 6.6 MB 50.5 MB/s 
[?25h

In [3]:
from transformers import pipeline

## 1. Sentence Classification - Sentiment Analysis

In [4]:
nlp_sentiment_model = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [5]:
nlp_sentiment_model('This is an excellent movie! Really nice plot and casting.')

[{'label': 'POSITIVE', 'score': 0.9998741149902344}]

In [6]:
nlp_sentiment_model('This movie was so NOT good!')

[{'label': 'NEGATIVE', 'score': 0.9998019337654114}]

In [7]:
nlp_sentiment_model('I tried to like this book but I definitely did not enjoy reading it!')

[{'label': 'NEGATIVE', 'score': 0.9994773268699646}]

In [9]:
nlp_sentiment_model = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-emotion")

Downloading:   0%|          | 0.00/768 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/476M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [10]:
nlp_sentiment_model('This is an excellent movie! Really nice plot and casting.')

[{'label': 'optimism', 'score': 0.952976405620575}]

In [11]:
nlp_sentiment_model('This movie was so NOT good!')

[{'label': 'joy', 'score': 0.9188464879989624}]

In [12]:
nlp_sentiment_model('I tried to like this book but I definitely did not enjoy reading it!')

[{'label': 'sadness', 'score': 0.7316125631332397}]

## 2. Token Classification - Named Entity Recognition

In [13]:
nlp_token_class = pipeline('ner')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

In [15]:
text = """Three more countries have joined an "international grand committee" of parliaments, adding to calls for 
Facebook's boss, Mark Zuckerberg, to give evidence on misinformation to the coalition. Brazil, Latvia and Singapore 
bring the total to eight different parliaments across the world, with plans to send representatives to London on 27 
November with the intention of hearing from Zuckerberg. Since the Cambridge Analytica scandal broke, the Facebook chief 
has only appeared in front of two legislatures: the American Senate and House of Representatives, and the European parliament. 
Facebook has consistently rebuffed attempts from others, including the UK and Canadian parliaments, to hear from Zuckerberg. 
He added that an article in the New York Times on Thursday, in which the paper alleged a pattern of behaviour from Facebook 
to "delay, deny and deflect" negative news stories, "raises further questions about how recent data breaches were allegedly 
dealt with within Facebook."
"""

entities = nlp_token_class(text)

In [16]:
[item for item in entities if item['entity'] == 'I-ORG']

[{'end': 113,
  'entity': 'I-ORG',
  'index': 20,
  'score': 0.99894947,
  'start': 105,
  'word': 'Facebook'},
 {'end': 414,
  'entity': 'I-ORG',
  'index': 82,
  'score': 0.99405766,
  'start': 405,
  'word': 'Cambridge'},
 {'end': 418,
  'entity': 'I-ORG',
  'index': 83,
  'score': 0.9975211,
  'start': 415,
  'word': 'Ana'},
 {'end': 420,
  'entity': 'I-ORG',
  'index': 84,
  'score': 0.98012805,
  'start': 418,
  'word': '##ly'},
 {'end': 424,
  'entity': 'I-ORG',
  'index': 85,
  'score': 0.9603263,
  'start': 420,
  'word': '##tica'},
 {'end': 452,
  'entity': 'I-ORG',
  'index': 90,
  'score': 0.9987739,
  'start': 444,
  'word': 'Facebook'},
 {'end': 527,
  'entity': 'I-ORG',
  'index': 104,
  'score': 0.99672997,
  'start': 521,
  'word': 'Senate'},
 {'end': 537,
  'entity': 'I-ORG',
  'index': 106,
  'score': 0.9995024,
  'start': 532,
  'word': 'House'},
 {'end': 540,
  'entity': 'I-ORG',
  'index': 107,
  'score': 0.9990483,
  'start': 538,
  'word': 'of'},
 {'end': 556,
 

In [17]:
[item for item in entities if item['entity'] == 'I-LOC']

[{'end': 198,
  'entity': 'I-LOC',
  'index': 42,
  'score': 0.9998429,
  'start': 192,
  'word': 'Brazil'},
 {'end': 206,
  'entity': 'I-LOC',
  'index': 44,
  'score': 0.9998599,
  'start': 200,
  'word': 'Latvia'},
 {'end': 220,
  'entity': 'I-LOC',
  'index': 46,
  'score': 0.9998784,
  'start': 211,
  'word': 'Singapore'},
 {'end': 331,
  'entity': 'I-LOC',
  'index': 65,
  'score': 0.99962544,
  'start': 325,
  'word': 'London'},
 {'end': 661,
  'entity': 'I-LOC',
  'index': 127,
  'score': 0.9994717,
  'start': 659,
  'word': 'UK'}]

## 3. Question Answering

In [18]:
nlp_qa = pipeline('question-answering')

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [19]:
context = """
Coronaviruses are a large family of viruses which may cause illness in animals or humans.  
In humans, several coronaviruses are known to cause respiratory infections ranging from the 
common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). 
The most recently discovered coronavirus causes coronavirus disease COVID-19.
COVID-19 is the infectious disease caused by the most recently discovered coronavirus. 
This new virus and disease were unknown before the outbreak began in Wuhan, China, in December 2019. 
COVID-19 is now a pandemic affecting many countries globally.
The most common symptoms of COVID-19 are fever, dry cough, and tiredness. 
Other symptoms that are less common and may affect some patients include aches 
and pains, nasal congestion, headache, conjunctivitis, sore throat, diarrhea, 
loss of taste or smell or a rash on skin or discoloration of fingers or toes. 
These symptoms are usually mild and begin gradually. 
Some people become infected but only have very mild symptoms.
Most people (about 80%) recover from the disease without needing hospital treatment. 
Around 1 out of every 5 people who gets COVID-19 becomes seriously ill and develops difficulty breathing. 
Older people, and those with underlying medical problems like high blood pressure, heart and lung problems, 
diabetes, or cancer, are at higher risk of developing serious illness.  
However, anyone can catch COVID-19 and become seriously ill.  
People of all ages who experience fever and/or  cough associated with difficulty breathing/shortness of breath, 
chest pain/pressure, or loss of speech or movement should seek medical attention immediately. 
If possible, it is recommended to call the health care provider or facility first, 
so the patient can be directed to the right clinic.
People can catch COVID-19 from others who have the virus. 
The disease spreads primarily from person to person through small droplets from the nose or mouth, 
which are expelled when a person with COVID-19 coughs, sneezes, or speaks. 
These droplets are relatively heavy, do not travel far and quickly sink to the ground. 
People can catch COVID-19 if they breathe in these droplets from a person infected with the virus.  
This is why it is important to stay at least 1 meter) away from others. 
These droplets can land on objects and surfaces around the person such as tables, doorknobs and handrails.  
People can become infected by touching these objects or surfaces, then touching their eyes, nose or mouth.  
This is why it is important to wash your hands regularly with soap and water or clean with alcohol-based hand rub.
Practicing hand and respiratory hygiene is important at ALL times and is the best way to protect others and yourself.
When possible maintain at least a 1 meter distance between yourself and others. 
This is especially important if you are standing by someone who is coughing or sneezing.  
Since some infected persons may not yet be exhibiting symptoms or their symptoms may be mild, 
maintaining a physical distance with everyone is a good idea if you are in an area where COVID-19 is circulating. 
"""

In [20]:
nlp_qa(context=context, question='What is a coronavirus ?')

  tensor = as_tensor(value)
  for span_id in range(num_spans)


{'answer': 'a large family of viruses which may cause illness in animals or humans',
 'end': 89,
 'score': 0.6717580556869507,
 'start': 19}

In [21]:
nlp_qa(context=context, question='What is covid-19 ?')

  tensor = as_tensor(value)
  for span_id in range(num_spans)


{'answer': 'infectious disease caused by the most recently discovered coronavirus',
 'end': 480,
 'score': 0.4254714548587799,
 'start': 411}

In [22]:
nlp_qa(context=context, question='What are covid-19 symptoms ?')

  tensor = as_tensor(value)
  for span_id in range(num_spans)


{'answer': 'fever, dry cough, and tiredness',
 'end': 719,
 'score': 0.840667724609375,
 'start': 688}

In [23]:
nlp_qa(context=context, question='How do people get infected by covid-19 ?')

  tensor = as_tensor(value)
  for span_id in range(num_spans)


{'answer': 'touching these objects or surfaces',
 'end': 2528,
 'score': 0.3460101783275604,
 'start': 2494}

In [24]:
nlp_qa(context=context, question='How can we protect ourselves from covid-19 ?')

  tensor = as_tensor(value)
  for span_id in range(num_spans)


{'answer': 'Practicing hand and respiratory hygiene',
 'end': 2727,
 'score': 0.6492879390716553,
 'start': 2688}

## 4. Text Generation - Mask Filling

In [25]:
nlp_fill = pipeline('fill-mask')

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [26]:
nlp_fill('The ship is reaching the ' + nlp_fill.tokenizer.mask_token)

[{'score': 0.09211745113134384,
  'sequence': 'The ship is reaching the surface',
  'token': 4084,
  'token_str': ' surface'},
 {'score': 0.07958997040987015,
  'sequence': 'The ship is reaching the port',
  'token': 4103,
  'token_str': ' port'},
 {'score': 0.053496330976486206,
  'sequence': 'The ship is reaching the shore',
  'token': 8373,
  'token_str': ' shore'},
 {'score': 0.04391562566161156,
  'sequence': 'The ship is reaching the dock',
  'token': 15261,
  'token_str': ' dock'},
 {'score': 0.031611308455467224,
  'sequence': 'The ship is reaching the destination',
  'token': 6381,
  'token_str': ' destination'}]

## 5. Summarization

Summarization is currently supported by `Bart` and `T5`.

In [27]:
summarizer = pipeline('summarization', model='facebook/bart-large-cnn', 
                      tokenizer='facebook/bart-large-cnn')

Downloading:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [28]:
BIG_DOC = """ 
Coronaviruses are a large family of viruses which may cause illness in animals or humans.  
In humans, several coronaviruses are known to cause respiratory infections ranging from the 
common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). 
The most recently discovered coronavirus causes coronavirus disease COVID-19.
COVID-19 is the infectious disease caused by the most recently discovered coronavirus. 
This new virus and disease were unknown before the outbreak began in Wuhan, China, in December 2019. 
COVID-19 is now a pandemic affecting many countries globally.
The most common symptoms of COVID-19 are fever, dry cough, and tiredness. 
Other symptoms that are less common and may affect some patients include aches 
and pains, nasal congestion, headache, conjunctivitis, sore throat, diarrhea, 
loss of taste or smell or a rash on skin or discoloration of fingers or toes. 
These symptoms are usually mild and begin gradually. 
Some people become infected but only have very mild symptoms.
Most people (about 80%) recover from the disease without needing hospital treatment. 
Around 1 out of every 5 people who gets COVID-19 becomes seriously ill and develops difficulty breathing. 
Older people, and those with underlying medical problems like high blood pressure, heart and lung problems, 
diabetes, or cancer, are at higher risk of developing serious illness.  
However, anyone can catch COVID-19 and become seriously ill.  
People of all ages who experience fever and/or  cough associated with difficulty breathing/shortness of breath, 
chest pain/pressure, or loss of speech or movement should seek medical attention immediately. 
If possible, it is recommended to call the health care provider or facility first, 
so the patient can be directed to the right clinic.
People can catch COVID-19 from others who have the virus. 
The disease spreads primarily from person to person through small droplets from the nose or mouth, 
which are expelled when a person with COVID-19 coughs, sneezes, or speaks. 
These droplets are relatively heavy, do not travel far and quickly sink to the ground. 
People can catch COVID-19 if they breathe in these droplets from a person infected with the virus.  
This is why it is important to stay at least 1 meter) away from others. 
These droplets can land on objects and surfaces around the person such as tables, doorknobs and handrails.  
People can become infected by touching these objects or surfaces, then touching their eyes, nose or mouth.  
This is why it is important to wash your hands regularly with soap and water or clean with alcohol-based hand rub.
Practicing hand and respiratory hygiene is important at ALL times and is the best way to protect others and yourself.
When possible maintain at least a 1 meter distance between yourself and others. 
This is especially important if you are standing by someone who is coughing or sneezing.  
Since some infected persons may not yet be exhibiting symptoms or their symptoms may be mild, 
maintaining a physical distance with everyone is a good idea if you are in an area where COVID-19 is circulating. 
"""


result = summarizer(BIG_DOC)

In [29]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [30]:
summary = result[0]['summary_text']
print('\n'.join(nltk.sent_tokenize(summary)))


COVID-19 is the infectious disease caused by the most recently discovered coronavirus.
It is now a pandemic affecting many countries globally.
Most people recover from the disease without needing hospital treatment.
Around 1 out of every 5 people who gets COVID- 19 becomes seriously ill and develops difficulty breathing.


## 6. Translation

Translation is currently supported by `T5` for the language mappings English-to-French (`translation_en_to_fr`), English-to-German (`translation_en_to_de`) and English-to-Romanian (`translation_en_to_ro`).

In [31]:
# English to French
translator = pipeline('translation_en_to_fr')

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [32]:
translator("The quick brown fox jumped over the lazy dog")

[{'translation_text': 'Le renard brun rapide saute au-dessus du chien piètre'}]

In [35]:
# English to German
translator = pipeline('translation_en_to_de')

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [36]:
translator("The quick brown fox jumped over the lazy dog")

[{'translation_text': 'Der schnelle braune Fuchs sprang über den faulen Hund'}]

## 7. Text Generation

Text generation is currently supported by GPT-2, OpenAi-GPT, TransfoXL, XLNet, CTRL and Reformer.

In [37]:
text_generator = pipeline("text-generation", model='gpt2', tokenizer='gpt2')

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [38]:
result = text_generator("The football game is about to")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [39]:
print(result[0]['generated_text'])

The football game is about to change," said Kuzma.

More from WorldPost:

DUKE'S FALCON FADES POSSESSION BY MARY DUNCAN, STRIKE IN PITCH



## 8. Projection - Features Extraction 

In [40]:
import numpy as np
nlp_features = pipeline('feature-extraction')
output = nlp_features('This is a short sentence')
print('Shape:', np.array(output).shape)   # (Samples, Tokens, Vector Size)
np.array(output)

No model was supplied, defaulted to distilbert-base-cased (https://huggingface.co/distilbert-base-cased)


Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/251M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Shape: (1, 7, 768)


array([[[ 0.31193665,  0.06641193, -0.03036013, ...,  0.04635384,
          0.34695518,  0.07248037],
        [-0.17498867, -0.08628001,  0.32349417, ...,  0.51969761,
          0.24023347,  0.37703663],
        [ 0.10653973,  0.27738383,  0.33841369, ...,  0.66813082,
          0.31344929,  0.13851519],
        ...,
        [ 0.27000755, -0.05821747,  0.10258091, ...,  0.46564612,
          0.34796897,  0.26275694],
        [ 0.14764586,  0.32623136,  0.03522095, ...,  0.09630004,
          0.21582818,  0.09945892],
        [ 0.89270818, -0.05968791, -0.14042342, ...,  0.35298908,
          0.87237018, -0.42272508]]])

Now you have a nice picture of what is possible through transformers' pipelines.

Feel free to try these different pipelines with your own inputs