## Multi-Task NLP with Transformers

__Credit:__ This notebook has been adapted from the __`transformers`__ package by HuggingFace from their examples. You can checkout their repository [here](https://github.com/huggingface/transformers)

Newly introduced in transformers v2.3.0, **pipelines** provides a high-level, easy to use,
API for doing inference over a variety of downstream-tasks, including: 

- ***Sentence Classification _(Sentiment Analysis)_***: Indicate if the overall sentence is either positive or negative, i.e. *binary classification task* or *logitic regression task*.
- ***Token Classification (Named Entity Recognition, Part-of-Speech tagging)***: For each sub-entities _(*tokens*)_ in the input, assign them a label, i.e. classification task.
- ***Question-Answering***: Provided a tuple (`question`, `context`) the model should find the span of text in `content` answering the `question`.
- ***Mask-Filling***: Suggests possible word(s) to fill the masked input with respect to the provided `context`.
- ***Summarization***: Summarizes the ``input`` article to a shorter article.
- ***Translation***: Translates the input from a language to another language.
- ***Feature Extraction***: Maps the input to a higher, multi-dimensional space learned from the data.

Pipelines encapsulate the overall process of every NLP process:
 
 1. *Tokenization*: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).
 2. *Inference*: Maps every tokens into a more meaningful representation. 
 3. *Decoding*: Use the above representation to generate and/or extract the final output for the underlying task.

The overall API is exposed to the end-user through the `pipeline()` method with the following 
structure:

```python
from transformers import pipeline

# Using default model and tokenizer for the task
pipeline("<task-name>")

# Using a user-specified model
pipeline("<task-name>", model="<model_name>")

# Using custom model/tokenizer as str
pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')
```

# Install dependencies

In [1]:
!nvidia-smi

Fri Sep 18 03:30:26 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    23W / 300W |      0MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ae/05/c8c55b600308dc04e95100dc8ad8a244dd800fe75dfafcf1d6348c6f6209/transformers-3.1.0-py3-none-any.whl (884kB)
[K     |████████████████████████████████| 890kB 5.0MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 10.2MB/s 
Collecting tokenizers==0.8.1.rc2
[?25l  Downloading https://files.pythonhosted.org/packages/80/83/8b9fccb9e48eeb575ee19179e2bdde0ee9a1904f97de5f02d19016b8804f/tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 43.6MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K 

In [3]:
from transformers import pipeline

## 1. Sentence Classification - Sentiment Analysis

In [4]:
nlp_sentiment_model = pipeline('sentiment-analysis')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




In [5]:
nlp_sentiment_model('This is an excellent movie! Really nice plot and casting.')

[{'label': 'POSITIVE', 'score': 0.9998741149902344}]

In [6]:
nlp_sentiment_model('This movie was so NOT good!')

[{'label': 'NEGATIVE', 'score': 0.9998019337654114}]

## 2. Token Classification - Named Entity Recognition

In [7]:
nlp_token_class = pipeline('ner')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=998.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=60.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1334448817.0, style=ProgressStyle(descr…




In [8]:
text = """Three more countries have joined an "international grand committee" of parliaments, adding to calls for 
Facebook's boss, Mark Zuckerberg, to give evidence on misinformation to the coalition. Brazil, Latvia and Singapore 
bring the total to eight different parliaments across the world, with plans to send representatives to London on 27 
November with the intention of hearing from Zuckerberg. Since the Cambridge Analytica scandal broke, the Facebook chief 
has only appeared in front of two legislatures: the American Senate and House of Representatives, and the European parliament. 
Facebook has consistently rebuffed attempts from others, including the UK and Canadian parliaments, to hear from Zuckerberg. 
He added that an article in the New York Times on Thursday, in which the paper alleged a pattern of behaviour from Facebook 
to "delay, deny and deflect" negative news stories, "raises further questions about how recent data breaches were allegedly 
dealt with within Facebook."
"""

nlp_token_class(text)



[{'entity': 'I-ORG',
  'index': 20,
  'score': 0.998949408531189,
  'word': 'Facebook'},
 {'entity': 'I-PER', 'index': 25, 'score': 0.999535083770752, 'word': 'Mark'},
 {'entity': 'I-PER', 'index': 26, 'score': 0.9987813234329224, 'word': 'Z'},
 {'entity': 'I-PER',
  'index': 27,
  'score': 0.9470852017402649,
  'word': '##uck'},
 {'entity': 'I-PER', 'index': 28, 'score': 0.7890573740005493, 'word': '##er'},
 {'entity': 'I-PER',
  'index': 29,
  'score': 0.9926771521568298,
  'word': '##berg'},
 {'entity': 'I-LOC',
  'index': 42,
  'score': 0.9998429417610168,
  'word': 'Brazil'},
 {'entity': 'I-LOC',
  'index': 44,
  'score': 0.9998599290847778,
  'word': 'Latvia'},
 {'entity': 'I-LOC',
  'index': 46,
  'score': 0.9998783469200134,
  'word': 'Singapore'},
 {'entity': 'I-LOC',
  'index': 65,
  'score': 0.9996254444122314,
  'word': 'London'},
 {'entity': 'I-PER', 'index': 75, 'score': 0.9987867474555969, 'word': 'Z'},
 {'entity': 'I-PER',
  'index': 76,
  'score': 0.9003939032554626,
 

## 3. Question Answering

In [9]:
nlp_qa = pipeline('question-answering')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=473.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=260793700.0, style=ProgressStyle(descri…




In [10]:
context = """
Coronaviruses are a large family of viruses which may cause illness in animals or humans.  
In humans, several coronaviruses are known to cause respiratory infections ranging from the 
common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). 
The most recently discovered coronavirus causes coronavirus disease COVID-19.
COVID-19 is the infectious disease caused by the most recently discovered coronavirus. 
This new virus and disease were unknown before the outbreak began in Wuhan, China, in December 2019. 
COVID-19 is now a pandemic affecting many countries globally.
The most common symptoms of COVID-19 are fever, dry cough, and tiredness. 
Other symptoms that are less common and may affect some patients include aches 
and pains, nasal congestion, headache, conjunctivitis, sore throat, diarrhea, 
loss of taste or smell or a rash on skin or discoloration of fingers or toes. 
These symptoms are usually mild and begin gradually. 
Some people become infected but only have very mild symptoms.
Most people (about 80%) recover from the disease without needing hospital treatment. 
Around 1 out of every 5 people who gets COVID-19 becomes seriously ill and develops difficulty breathing. 
Older people, and those with underlying medical problems like high blood pressure, heart and lung problems, 
diabetes, or cancer, are at higher risk of developing serious illness.  
However, anyone can catch COVID-19 and become seriously ill.  
People of all ages who experience fever and/or  cough associated with difficulty breathing/shortness of breath, 
chest pain/pressure, or loss of speech or movement should seek medical attention immediately. 
If possible, it is recommended to call the health care provider or facility first, 
so the patient can be directed to the right clinic.
People can catch COVID-19 from others who have the virus. 
The disease spreads primarily from person to person through small droplets from the nose or mouth, 
which are expelled when a person with COVID-19 coughs, sneezes, or speaks. 
These droplets are relatively heavy, do not travel far and quickly sink to the ground. 
People can catch COVID-19 if they breathe in these droplets from a person infected with the virus.  
This is why it is important to stay at least 1 meter) away from others. 
These droplets can land on objects and surfaces around the person such as tables, doorknobs and handrails.  
People can become infected by touching these objects or surfaces, then touching their eyes, nose or mouth.  
This is why it is important to wash your hands regularly with soap and water or clean with alcohol-based hand rub.
Practicing hand and respiratory hygiene is important at ALL times and is the best way to protect others and yourself.
When possible maintain at least a 1 meter distance between yourself and others. 
This is especially important if you are standing by someone who is coughing or sneezing.  
Since some infected persons may not yet be exhibiting symptoms or their symptoms may be mild, 
maintaining a physical distance with everyone is a good idea if you are in an area where COVID-19 is circulating. 
"""

In [11]:
nlp_qa(context=context, question='What is a coronavirus ?')



{'answer': 'a large family of viruses which may cause illness in animals or humans.',
 'end': 92,
 'score': 0.6717595458030701,
 'start': 19}

In [12]:
nlp_qa(context=context, question='What is covid-19 ?')



{'answer': 'infectious disease caused by the most recently discovered coronavirus.',
 'end': 482,
 'score': 0.42547109723091125,
 'start': 411}

In [13]:
nlp_qa(context=context, question='What are covid-19 symptoms ?')



{'answer': 'fever, dry cough, and tiredness.',
 'end': 721,
 'score': 0.8865543603897095,
 'start': 688}

In [14]:
nlp_qa(context=context, question='How does covid-19 spread ?')



{'answer': 'small droplets from the nose or mouth,',
 'end': 2016,
 'score': 0.3333466649055481,
 'start': 1977}

In [15]:
nlp_qa(context=context, question='How do people get infected by covid-19 ?')



{'answer': 'by touching these objects or surfaces, then touching their eyes, nose or mouth.',
 'end': 2572,
 'score': 0.22112037241458893,
 'start': 2491}

In [16]:
nlp_qa(context=context, question='How can we protect ourselves from covid-19 ?')



{'answer': 'Practicing hand and respiratory hygiene',
 'end': 2727,
 'score': 0.6725284457206726,
 'start': 2688}

## 4. Text Generation - Mask Filling

In [17]:
nlp_fill = pipeline('fill-mask')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=480.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=331070498.0, style=ProgressStyle(descri…




Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
nlp_fill('I am attending the open data science ' + nlp_fill.tokenizer.mask_token)

[{'score': 0.38388171792030334,
  'sequence': '<s>I am attending the open data science conference</s>',
  'token': 1019,
  'token_str': 'Ġconference'},
 {'score': 0.20860441029071808,
  'sequence': '<s>I am attending the open data science workshop</s>',
  'token': 9780,
  'token_str': 'Ġworkshop'},
 {'score': 0.06517862528562546,
  'sequence': '<s>I am attending the open data science congress</s>',
  'token': 12442,
  'token_str': 'Ġcongress'},
 {'score': 0.04182310402393341,
  'sequence': '<s>I am attending the open data science summit</s>',
  'token': 3564,
  'token_str': 'Ġsummit'},
 {'score': 0.03647162765264511,
  'sequence': '<s>I am attending the open data science conferences</s>',
  'token': 14041,
  'token_str': 'Ġconferences'}]

## 5. Summarization

Summarization is currently supported by `Bart` and `T5` and also newly `Pegasus`

In [20]:
summarizer = pipeline('summarization')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1621.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1222317369.0, style=ProgressStyle(descr…




In [21]:
BIG_DOC = """ 
Coronaviruses are a large family of viruses which may cause illness in animals or humans.  
In humans, several coronaviruses are known to cause respiratory infections ranging from the 
common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). 
The most recently discovered coronavirus causes coronavirus disease COVID-19.
COVID-19 is the infectious disease caused by the most recently discovered coronavirus. 
This new virus and disease were unknown before the outbreak began in Wuhan, China, in December 2019. 
COVID-19 is now a pandemic affecting many countries globally.
The most common symptoms of COVID-19 are fever, dry cough, and tiredness. 
Other symptoms that are less common and may affect some patients include aches 
and pains, nasal congestion, headache, conjunctivitis, sore throat, diarrhea, 
loss of taste or smell or a rash on skin or discoloration of fingers or toes. 
These symptoms are usually mild and begin gradually. 
Some people become infected but only have very mild symptoms.
Most people (about 80%) recover from the disease without needing hospital treatment. 
Around 1 out of every 5 people who gets COVID-19 becomes seriously ill and develops difficulty breathing. 
Older people, and those with underlying medical problems like high blood pressure, heart and lung problems, 
diabetes, or cancer, are at higher risk of developing serious illness.  
However, anyone can catch COVID-19 and become seriously ill.  
People of all ages who experience fever and/or  cough associated with difficulty breathing/shortness of breath, 
chest pain/pressure, or loss of speech or movement should seek medical attention immediately. 
If possible, it is recommended to call the health care provider or facility first, 
so the patient can be directed to the right clinic.
People can catch COVID-19 from others who have the virus. 
The disease spreads primarily from person to person through small droplets from the nose or mouth, 
which are expelled when a person with COVID-19 coughs, sneezes, or speaks. 
These droplets are relatively heavy, do not travel far and quickly sink to the ground. 
People can catch COVID-19 if they breathe in these droplets from a person infected with the virus.  
This is why it is important to stay at least 1 meter) away from others. 
These droplets can land on objects and surfaces around the person such as tables, doorknobs and handrails.  
People can become infected by touching these objects or surfaces, then touching their eyes, nose or mouth.  
This is why it is important to wash your hands regularly with soap and water or clean with alcohol-based hand rub.
Practicing hand and respiratory hygiene is important at ALL times and is the best way to protect others and yourself.
When possible maintain at least a 1 meter distance between yourself and others. 
This is especially important if you are standing by someone who is coughing or sneezing.  
Since some infected persons may not yet be exhibiting symptoms or their symptoms may be mild, 
maintaining a physical distance with everyone is a good idea if you are in an area where COVID-19 is circulating. 
"""


result = summarizer(BIG_DOC)

In [22]:
import nltk
nltk.download('punkt')
 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [23]:
summary = result[0]['summary_text']
print('\n'.join(nltk.sent_tokenize(summary)))


 COVID-19 is the infectious disease caused by the most recently discovered coronavirus .
The most common symptoms of the disease are fever, dry cough, tiredness and tiredness .
Around 80% of people recover from the disease without needing hospital treatment .
Around 1 out of every 5 people who gets the disease becomes seriously ill and develops difficulty breathing .


## 6. Translation

Translation is currently supported by `T5` for the language mappings English-to-French (`translation_en_to_fr`), English-to-German (`translation_en_to_de`) and English-to-Romanian (`translation_en_to_ro`).

In [24]:
# English to French
translator = pipeline('translation_en_to_fr')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691430.0, style=ProgressStyle(descri…




In [25]:
translator("The quick brown fox jumped over the lazy dog")

[{'translation_text': 'Le renard brun rapide saute au-dessus du chien piètre'}]

In [26]:
# English to German
translator = pipeline('translation_en_to_de')

In [27]:
translator("The quick brown fox jumped over the lazy dog")

[{'translation_text': 'Der schnelle braune Fuchs sprang über den faulen Hund'}]

## 7. Text Generation

Text generation is currently supported by GPT-2, OpenAi-GPT, TransfoXL, XLNet, CTRL and Reformer.

In [28]:
text_generator = pipeline("text-generation", model='gpt2', tokenizer='gpt2')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




In [55]:
result = text_generator("Today I am attending the Open Data Science Europe 2020 Conference because it is")

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


In [56]:
print(result[0]['generated_text'])

Today I am attending the Open Data Science Europe 2020 Conference because it is the right place to start. There is nothing wrong with it. I have the same passion for open data, working with local and foreign organizations on projects. All the other members agree


## 8. Projection - Features Extraction 

In [57]:
import numpy as np
nlp_features = pipeline('feature-extraction')
output = nlp_features('This is a short sentence')
print('Shape:', np.array(output).shape)   # (Samples, Tokens, Vector Size)
np.array(output)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=263273408.0, style=ProgressStyle(descri…


Shape: (1, 7, 768)


array([[[ 0.31193641,  0.06641171, -0.03036025, ...,  0.04635391,
          0.34695509,  0.07248025],
        [-0.17498924, -0.08628017,  0.32349387, ...,  0.51969755,
          0.24023351,  0.37703678],
        [ 0.10653978,  0.27738377,  0.33841342, ...,  0.66813123,
          0.31344947,  0.13851488],
        ...,
        [ 0.27000779, -0.0582172 ,  0.10258101, ...,  0.46564651,
          0.34796879,  0.26275688],
        [ 0.14764577,  0.32623127,  0.03522119, ...,  0.09630016,
          0.21582849,  0.09945899],
        [ 0.89270842, -0.05968782, -0.14042357, ...,  0.35298914,
          0.87237084, -0.4227255 ]]])

Now you have a nice picture of what is possible through transformers' pipelines.

Feel free to try these different pipelines with your own inputs