<a href="https://colab.research.google.com/github/denocris/MHPC-Natural-Language-Processing-Lectures-2020/blob/master/lectrue_2_intro_huggingface_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to the Transformers Library by Hugging Face (Lecture II)


### My Contacts
For any questions or doubts you can find my contacts here:

* [Linkedin](https://www.linkedin.com/in/cristiano-de-nobili/) and [Twitter](https://twitter.com/denocris) (here I regulary post about AI and Science news)
* My [Personal Website](https://denocris.com)
* My [Instagram](https://www.instagram.com/denocris/?hl=it) (I am a Pilot, so here I mostly post about traveling, flying and adventures)
* My recent TEDx on [AI and Human Creativity](https://youtu.be/8-hrmer9d_E)

### Course Repository

All notebooks can be found [here!](https://github.com/denocris/MHPC-Natural-Language-Processing-Lectures-2020)

### Goals fo this lecture

* Understanding the basics of Transformers library and its pipeline
* Understanding of Transfer Learning, in particular the concept of fine-tuning
* Warm-up and know about the most common problems in NLP

In these lectures (and all the course) **we will never train a model from scratch**. This is beyond our possibilities at the moment. In this 2nd lecture, we will use pre-trained models and extract from them useful feautures (**feature-extraction**). Only during the 3rd lecture, we will **fine-tune a model**. So, keep in mind the differences between

* Training from scratch (from random weights/parameters to learned ones). This is beyond our possibilities;

* Feature-extraction (use with out modify an already trained model as a generator of learned features);

* Fine-tuninig (re-train on a specific task or dataset an already trained model).

## Introduction

[Transformers](https://huggingface.co/transformers/) was built by [Hugging Face](https://huggingface.co/), a Paris and NY startup whose mission is to democratize NLP for everyone. In the last year, they strongly contribute to the recent NLP revolution by building an easy to use interface between the latest models available and application to real cases.

Transformers library provides state-of-the-art general-purpose transformer-based architectures (such as BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, T5, CTRL...) for Natural Language Processing, Understanding, and Generation with over thousands of pre-trained models in 100+ languages and deep interoperability between PyTorch & TensorFlow 2.0.

Transformers is an opinionated library built for NLP researchers seeking to use/study/extend large-scale transformers models. The library was designed with two strong goals in mind:

* be as easy and fast to use as possible;
* provide state-of-the-art models with performances as close as possible to the original models.

The aim of this section is to leverage the use of Transformers library pipelines API at the highest level possible. Without any training, we take advantage of pre-trained models to tackle a variety of downstream-tasks (Sentence Classification, Question & Answering). The idea is to warm-up with the most popular NLP problems.

In the following lectures, we will dive into low-levels understanding in greater datails what is hidden in this high-level API.

In [2]:
!pip install -q transformers

[K     |████████████████████████████████| 675kB 2.8MB/s 
[K     |████████████████████████████████| 3.8MB 14.9MB/s 
[K     |████████████████████████████████| 1.1MB 42.9MB/s 
[K     |████████████████████████████████| 890kB 44.5MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [0]:
from __future__ import print_function
import ipywidgets as widgets

import numpy as np
import copy
import torch

from transformers import BertTokenizer, BertForMaskedLM
from transformers import pipeline
from transformers import AutoTokenizer, AutoModel, AutoModelForQuestionAnswering

## About Transfer Learning, Pre-Trained Models and BERT


**What Transfer Learning is?**

Many NLP successful applications rely on transfer learning. The same is true also for Computer Vision and other deep learning fields.

Transfer learning is a technique that consists to train a machine learning model for a task and use the knowledge gained in it to another different but related task.

![alt text](https://capstoneretire.com/wp-content/uploads/2014/10/grandpa-granddaughter-walking.jpg)

So why we should use Transfer Learning in NLP?

* Many NLP tasks, such as question & answering or name entity recognition (NER), share common knowledge about language (underlying semantics…)

* The opportunity to reuse the huge quantity of unlabeled texts from the web (in case of semi-unsupervised learning). 

The idea behind Transfer Learning is to try to store the knowledge gained in solving the source task in the source domain and apply it to another similar problem of interest is the same concept of the learning process by experience. We can learn something and we can use this knowledge to solve a similar task.

Recent algorithms such as BERT-like ones (BERT, RoBERTa, GPT-n ecc...) have a huge number of parameters. To train them a lot of text data is required and a lot of energy power. So, for the moment, only big companies, well-founded startups, or research institutions can deal with this effort. So that is why we will not train from scratch one of this model. The good news is that many of the companies or institutions mentioned before, trained and will train models for us. That is why transfer learning is important! 

Regarding these models, let us have a bird-eye view of how they are trained.

**But first, what a languange model is?**

Language Modeling is the development of probabilistic models that predict a word in a sequence given the words. Example: 

`I <?> holidays` ( prob_love = 0.99, prob_hate = 0.009, prob_spaghetti = 0.001).

**How BERT is trained from scratch?**

A common practice in transfer learning (in NLP) is to train from scratch a language model in a semi-unsupervised manner (see the slide below). During training, 15% of tokens within each sentence are randomly selected. 
Then they are masked according to the following rule: 

- 80% of the time they are replaced with the MASK token
(as shown below `I <mask> like a spritz`);
- 10% with random token (`I peperoni like a spritz`);
- 10% original token (`I would like a spritz`).

<center>  <img src="https://docs.google.com/uc?export=download&id=1CUGSrqD6TPmojldiWY0enFganC7PjL8K" width="600" height="350"> </center> 

BERT is also trained to solve another task in addition to language model: next-sentence prediction. Just to avoid to add to much information, I will skip to explain it in detail. It not relevant for our lecture. Just to mention, next-sentence prediction is just a binary classification that given two sentences it outputs if they are correlated (one is the next sentence of the other) or not.

When the training is done and the pre-trained model released, this is when fine-tuning comes in action and we can use the model for a downstream task! One common workflow is to keep the pre-trained model internals unchanged adding more linear layers on top of a pre-trained model, or to use the model output as input to a separate model.

<center>  <img src="https://docs.google.com/uc?export=download&id=1x5eBU67IdCiuQTOGHiZLn75_UDoRigR2" width="600" height="350"> </center> 



---
---


Transformers library provides many pre-trained models:
 * List of models pre-trained by Hugging Face ([look here](https://huggingface.co/transformers/pretrained_models.html))

 * List of models pre-trained and uploaded by the comunity ([here!](https://huggingface.co/models))

Let us start importing this very special pre-trained model. This is simultaneously trained on 104 different languages. That is why it is called `multilingual`. Since a few months ago, it was the only model available trained also on less common languages. 

If you are working on English, I suggest you use a pre-trained model for English only such as `bert-base-cased`.






In [0]:
# Set multilingual pre-trained model
pretrained_model = 'bert-base-multilingual-cased'

# English only
#pretrained_model = 'bert-base-cased'

When a model has been chosen, the are two more steps to be done before starting everything with the library

* Load pre-trained model tokenizer
* load the pre-trained model (there are several possibilities, we will see later)

In [0]:
# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained(pretrained_model)

# Load model, in particular the base model with a head for masked language model
model = BertForMaskedLM.from_pretrained(pretrained_model)

# Set the model in evaluation mode
model = model.eval()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=625.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=714314041.0, style=ProgressStyle(descri…




## Tokenizers and Preprocessing


A tokenizer is in charge of preparing the inputs for a model. In particular

* tokenizing (splitting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e. tokenizing + convert to integers);

* adding new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece…);

* managing special tokens like mask, beginning-of-sentence, etc tokens.

In [0]:
# Define sentences to be processed
sentence_list = ['che tempo farà domani?',
                 'hi, what is your name?',
                 'sono stato al mercato a fare la spesa',
                 'i do not trust artificial intelligence',
                 'prenderò il treno per arrivare a trieste']

In [0]:
# Tokenize sentences

# Set print format
fmt = '{:<8}{:<20}'
# Set padding length
padding_len = 20
# Empty lists
sentence_ids = []
sentence_tokens = []

# Loop over sentences
for sentence in sentence_list:

  # Tokenize (convert to dictionary IDs) and pad sentence adding special tokens at the beginning and at the end
  sentence_ids.append(tokenizer.encode(sentence, add_special_tokens=True, max_length=padding_len, pad_to_max_length=True))

  # Convert back IDs to string for visualization
  sentence_tokens.append(tokenizer.convert_ids_to_tokens(sentence_ids[-1]))

  # Print original sentence and its tokenized version
  print('Original:', sentence)
  print('-----------------')
  print(fmt.format('ID', 'Token'))
  print('-----------------')
  for id, token in zip(sentence_ids[-1], sentence_tokens[-1]):
    print(fmt.format(id, token))
  print()

Original: che tempo farà domani?
-----------------
ID      Token               
-----------------
101     [CLS]               
10262   che                 
12238   tempo               
13301   far                 
10816   ##à                 
70908   doma                
10342   ##ni                
136     ?                   
102     [SEP]               
0       [PAD]               
0       [PAD]               
0       [PAD]               
0       [PAD]               
0       [PAD]               
0       [PAD]               
0       [PAD]               
0       [PAD]               
0       [PAD]               
0       [PAD]               
0       [PAD]               

Original: hi, what is your name?
-----------------
ID      Token               
-----------------
101     [CLS]               
11520   hi                  
117     ,                   
12976   what                
10124   is                  
20442   your                
11324   name                
136     ?           

In [0]:
# Print additional special tokens
print(tokenizer.convert_ids_to_tokens(103), tokenizer.convert_ids_to_tokens(-1))

[MASK] [UNK]


## BERT Language Model

Here we are going to select one token for each sentence to be predicted by the pre-trained model. We will also generate the attention_mask. It helps when dealing with variance in the size of sequences and we need a way to tell the model that we don't want to attend to the padded indices of the sequence.


In [0]:
# Set print format
fmt = '{:<15}{:<15}{:<15}{:<15}{:<3}'

# Input tokens to be masked (predicted)
replace_tokens = ['tempo', 'your', 'mercato', 'artificial', 'treno']

# Original sentences with masked tokens (deepcopy: It means that any changes made to a copy of object do not reflect in the original object. )
input_ids = copy.deepcopy(sentence_ids)
# Labels = original values of masked tokens
label_ids = []
# Mask for padding (0 for PAD token, 1 otherwise)
attn_masks = []

# Loop over sentences
for i, tokens in enumerate(sentence_tokens):

  # Find index of token to be replaced by MASK
  ids_replace_token = tokens.index(replace_tokens[i])

  # Replace token in original sentence with MASK
  input_ids[i][ids_replace_token] = 103

  # Store original value of masked token in labels, everything else is -1
  label_ids.append([-100]*padding_len)
  label_ids[i][ids_replace_token] = sentence_ids[i][ids_replace_token]

  # Create attention mask (or padding mask)
  attn_masks.append([0 if ids == 0 else 1 for ids in input_ids[i]])

  # Print original and new (masked) sentence along with labels and padding masks
  print('Original:', sentence_list[i])
  print('---------------------------------------------------------------------')
  print(fmt.format('ID', 'Token', 'Label ID', 'Label Token', 'Pad mask'))
  print('---------------------------------------------------------------------')
  for id, token, label_id, label_token, pad_mask in zip(input_ids[i], tokenizer.convert_ids_to_tokens(input_ids[i]), 
                                                        label_ids[i], tokenizer.convert_ids_to_tokens(label_ids[i]), attn_masks[i]):
    print(fmt.format(id, token, label_id, label_token, pad_mask))
  print()

Original: che tempo farà domani?
---------------------------------------------------------------------
ID             Token          Label ID       Label Token    Pad mask
---------------------------------------------------------------------
101            [CLS]          -100           [UNK]          1  
10262          che            -100           [UNK]          1  
103            [MASK]         12238          tempo          1  
13301          far            -100           [UNK]          1  
10816          ##à            -100           [UNK]          1  
70908          doma           -100           [UNK]          1  
10342          ##ni           -100           [UNK]          1  
136            ?              -100           [UNK]          1  
102            [SEP]          -100           [UNK]          1  
0              [PAD]          -100           [UNK]          0  
0              [PAD]          -100           [UNK]          0  
0              [PAD]          -100           [UNK]    

In [0]:
# Convert to pytorch tensors
input_ids = torch.tensor(input_ids)
label_ids = torch.tensor(label_ids)
attn_masks = torch.tensor(attn_masks)

In [0]:
# Get predicted tokens with logits
outputs = model(input_ids=input_ids, attention_mask=attn_masks)
predicted_logits = outputs[0]

In [0]:
# Get top k = 5 predicted tokens for each masked token

# Set print format
fmt = '{:<15}{:<15}'

# Select k
k = 5

# Loop over sentences
for i in range(predicted_logits.shape[0]):

  # Find index of masked token within the sentence
  masked_indexes = np.where(label_ids[i].numpy() != -100)
  # Convert logits to probabilities for selected masked token
  predicted_probs = torch.nn.functional.softmax(predicted_logits[i, masked_indexes], dim=2)
  # Get top k probabilities and predicted tokens 
  predicted_topk_probs, predicted_topk_ids = torch.topk(predicted_probs, k=k, dim=2)

  # Print original sentence and masked token (ground truth) vs top k predicted tokens and their probabilities
  print('Original:', sentence_list[i])
  print('Masked token:', replace_tokens[i])
  print('---------------------------')
  print(fmt.format('Prediction', 'Probability'))
  print('---------------------------')
  for token, probability in zip(tokenizer.convert_ids_to_tokens(predicted_topk_ids.view(-1).numpy()), 
                                [round(elem, 2) for elem in predicted_topk_probs.squeeze().tolist()]):
    print(fmt.format(token, probability))
  print()

Original: che tempo farà domani?
Masked token: tempo
---------------------------
Prediction     Probability    
---------------------------
si             0.26           
che            0.13           
chi            0.1            
non            0.06           
cosa           0.04           

Original: hi, what is your name?
Masked token: your
---------------------------
Prediction     Probability    
---------------------------
the            0.8            
your           0.08           
that           0.03           
a              0.02           
my             0.01           

Original: sono stato al mercato a fare la spesa
Masked token: mercato
---------------------------
Prediction     Probability    
---------------------------
mondo          0.26           
centro         0.06           
servizio       0.06           
lavoro         0.05           
##pini         0.03           

Original: i do not trust artificial intelligence
Masked token: artificial
----------------------

## RoBERTa Language Model (Italian)



In [0]:
from transformers import AutoModelWithLMHead

# Set multilingual pre-trained model
pretrained_model = "idb-ita/gilberto-uncased-from-camembert"
#pretrained_model = "Musixmatch/umberto-commoncrawl-cased-v1"


# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
model = AutoModelWithLMHead.from_pretrained(pretrained_model)
model = model.eval()


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=508.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=805870.0, style=ProgressStyle(descripti…






HBox(children=(FloatProgress(value=0.0, description='Downloading', max=445033713.0, style=ProgressStyle(descri…




In [0]:
# Define sentences to be processed
sentence_list = ['che tempo farà domani?',
                 'il mio cane è bello',
                 'ho chiamato il medico ma non ha risposto',
                 'il segreto di un buon caffè è la tazzina',
                 'prenderò il treno per arrivare a trieste']

# Tokenize sentences

# Set print format
fmt = '{:<8}{:<20}'
# Set padding length
padding_len = 20
# Empty lists
sentence_ids = []
sentence_tokens = []

# Loop over sentences
for sentence in sentence_list:

  # Tokenize (convert to dictionary IDs) and pad sentence adding special tokens at the beginning and at the end
  sentence_ids.append(tokenizer.encode(sentence, add_special_tokens=True, max_length=padding_len, pad_to_max_length=True))

  # Convert back IDs to string for visualization
  sentence_tokens.append(tokenizer.convert_ids_to_tokens(sentence_ids[-1]))

  # Print original sentence and its tokenized version
  print('Original:', sentence)
  print('-----------------')
  print(fmt.format('ID', 'Token'))
  print('-----------------')
  for id, token in zip(sentence_ids[-1], sentence_tokens[-1]):
    print(fmt.format(id, token))
  print()

In [0]:
# Set print format
fmt = '{:<15}{:<15}{:<15}{:<15}{:<3}'

# Input tokens to be masked (predicted)
replace_tokens = ['▁tempo', '▁il', '▁chiamato', '▁di', '▁treno']

We must check the special tokens for this tokenizer.

In [0]:
tokenizer.convert_ids_to_tokens([0,1,2,3,4,5,6,32004])

['<s>NOTUSED',
 '<pad>',
 '</s>NOTUSED',
 '<unk>',
 '<unk>',
 '<s>',
 '</s>',
 '<mask>']

In [0]:
# Original sentences with masked tokens (deepcopy: It means that any changes made to a copy of object do not reflect in the original object. )
input_ids = copy.deepcopy(sentence_ids)
# Labels = original values of masked tokens
label_ids = []
# Mask for padding (0 for PAD token, 1 otherwise)
attn_masks = []

# Loop over sentences
for i, tokens in enumerate(sentence_tokens):
  ids_replace_token = tokens.index(replace_tokens[i])
  print(ids_replace_token)
    # Replace token in original sentence with MASK
  input_ids[i][ids_replace_token] = 32004
    # Store original value of masked token in labels, everything else is -3
  label_ids.append([3]*padding_len)
  label_ids[i][ids_replace_token] = sentence_ids[i][ids_replace_token]
    # Create padding mask
  attn_masks.append([0 if ids == 1 else 1 for ids in input_ids[i]])
    # Print original and new (masked) sentence along with labels and padding masks
  print('Original:', sentence_list[i])
  print('---------------------------------------------------------------------')
  print(fmt.format('ID', 'Token', 'Label ID', 'Label Token', 'Pad mask'))
  print('---------------------------------------------------------------------')
  for id, token, label_id, label_token, pad_mask in zip(input_ids[i], tokenizer.convert_ids_to_tokens(input_ids[i]), 
                                                        label_ids[i], tokenizer.convert_ids_to_tokens(label_ids[i]), attn_masks[i]):
    print(fmt.format(id, token, label_id, label_token, pad_mask))
  print()




2
Original: che tempo farà domani?
---------------------------------------------------------------------
ID             Token          Label ID       Label Token    Pad mask
---------------------------------------------------------------------
5              <s>            3              <unk>          1  
59             ▁che           3              <unk>          1  
32004          <mask>         490            ▁tempo         1  
4970           ▁farà          3              <unk>          1  
3814           ▁domani        3              <unk>          1  
31978          ?              3              <unk>          1  
6              </s>           3              <unk>          1  
1              <pad>          3              <unk>          0  
1              <pad>          3              <unk>          0  
1              <pad>          3              <unk>          0  
1              <pad>          3              <unk>          0  
1              <pad>          3              <unk>  

In [0]:
# Convert to pytorch tensors
input_ids = torch.tensor(input_ids)
label_ids = torch.tensor(label_ids)
attn_masks = torch.tensor(attn_masks)

# Get predicted tokens with logits
outputs = model(input_ids=input_ids, attention_mask=attn_masks)
predicted_logits = outputs[0]

In [0]:
# Get top k = 5 predicted tokens for each masked token

# Set print format
fmt = '{:<15}{:<15}'

# Select k
k = 5

# Loop over sentences
for i in range(predicted_logits.shape[0]):

  # Find index of masked token within the sentence
  masked_indexes = np.where(label_ids[i].numpy() != 3)
  # Convert logits to probabilities for selected masked token
  predicted_probs = torch.nn.functional.softmax(predicted_logits[i, masked_indexes], dim=2)
  # Get top k probabilities and predicted tokens 
  predicted_topk_probs, predicted_topk_ids = torch.topk(predicted_probs, k=k, dim=2)

  # Print original sentence and masked token (ground truth) vs top k predicted tokens and their probabilities
  print('Original:', sentence_list[i])
  print('Masked token:', replace_tokens[i])
  print('---------------------------')
  print(fmt.format('Prediction', 'Probability'))
  print('---------------------------')
  for token, probability in zip(tokenizer.convert_ids_to_tokens(predicted_topk_ids.view(-1).numpy()), 
                                [elem for elem in predicted_topk_probs.squeeze().tolist()]):
    print(fmt.format(token, probability))
  print()

Original: che tempo farà domani?
Masked token: ▁tempo
---------------------------
Prediction     Probability    
---------------------------
▁cosa          0.34816819429397583
▁si            0.3047598600387573
▁tempo         0.024781608954072
▁ci            0.024151140823960304
▁lo            0.01677447371184826

Original: il mio cane è nero
Masked token: ▁il
---------------------------
Prediction     Probability    
---------------------------
▁il            0.9495894908905029
▁perché        0.004644022323191166
▁del           0.0011792572913691401
▁-             0.0011448031291365623
▁perchè        0.001053792075254023

Original: ho chiamato il medico ma non ha risposto
Masked token: ▁chiamato
---------------------------
Prediction     Probability    
---------------------------
▁chiamato      0.8601860404014587
▁contattato    0.07490289956331253
▁chiesto       0.017966171726584435
▁consultato    0.016229720786213875
▁sentito       0.012799498625099659

Original: il segreto di un buo

## Pipelines (Transformers API)

The aim of this section is to leverage the use of Transformers library `pipelines API` at the highest level possible. Without any training, we take advantage of pre-trained models to tackle a variety of downstream-tasks. 

In the next lecture, we will dive into a low-level understanding but for the moment let us leverage this high-level API. The idea is to warm-up with the most popular NLP problems.

We will try the following downstream-tasks: 

- ***Sentence Classification _(Sentiment Analysis)_***: Indicate if the overall sentence is either positive or negative, i.e. *binary classification task* or *logitic regression task*.
- ***Token Classification (Named Entity Recognition, Part-of-Speech tagging)***: For each sub-entities _(*tokens*)_ in the input, assign them a label, i.e. classification task.
- ***Question-Answering***: Provided a tuple (`question`, `context`) the model should find the span of text in `content` answering the `question`.
- ***Mask-Filling***: Suggests possible word(s) to fill the masked input with respect to the provided `context`.
- ***Summarization***: Summarizes the ``input`` article to a shorter article.
- ***Translation***: Translates the input from a language to another language.
- ***Feature Extraction***: Maps the input to a higher, multi-dimensional space learned from the data.

Pipelines encapsulate the overall process of every NLP process:
 
 1. *Tokenization*: Split the initial input into multiple sub-entities (i.e. tokens).
 2. *Inference*: Maps every tokens into a more meaningful representation. 
 3. *Decoding*: Use the above representation to generate and/or extract the final output for the underlying task.

The overall API is exposed to the end-user through the `pipeline()` method with the following 
structure:

```python
from transformers import pipeline

# Using custom model/tokenizer as str
pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')
```

In [0]:
from transformers import pipeline

### 1. Sentiment Analysis (English) 

Without any specification, pipeline set a default model for each task. Here a list of [default models](https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines.py#L1459) given a task. 

In [0]:
nlp_sentence_classif = pipeline('sentiment-analysis',) 
# sst = Stanford Sentiment Treebank 
#nlp_sentence_classif = pipeline('sentiment-analysis', model = 'distilbert-base-uncased-finetuned-sst-2-english', tokenizer = 'distilbert-base-uncased')

print(nlp_sentence_classif('It was a lovelly night !'))
print(nlp_sentence_classif('That film is not at all worth seing'))
print(nlp_sentence_classif('The event was pretty but it could be much better'))
print(nlp_sentence_classif('He was kind last year but now I do not trust him'))
print(nlp_sentence_classif('He is pretty ugly'))
print(nlp_sentence_classif('He is pretty ugly but when I am with him I am really happy'))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…


[{'label': 'POSITIVE', 'score': 0.9991254806518555}]
[{'label': 'NEGATIVE', 'score': 0.9998056292533875}]
[{'label': 'NEGATIVE', 'score': 0.9092473983764648}]
[{'label': 'NEGATIVE', 'score': 0.9994627237319946}]
[{'label': 'NEGATIVE', 'score': 0.9998031258583069}]
[{'label': 'POSITIVE', 'score': 0.9996960759162903}]


### 2. Token Classification and Name Entity Recognition (NER)

In [0]:
nlp_token_class = pipeline('ner')
nlp_token_class('Trieste, where SISSA is located, is a beautiful city in Italy, rich of science and sea')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




[{'entity': 'I-LOC', 'index': 1, 'score': 0.9994376301765442, 'word': 'Tri'},
 {'entity': 'I-LOC',
  'index': 2,
  'score': 0.9989838004112244,
  'word': '##este'},
 {'entity': 'I-ORG', 'index': 5, 'score': 0.9903546571731567, 'word': 'S'},
 {'entity': 'I-ORG', 'index': 6, 'score': 0.9937306046485901, 'word': '##IS'},
 {'entity': 'I-ORG', 'index': 7, 'score': 0.9824469685554504, 'word': '##SA'},
 {'entity': 'I-LOC',
  'index': 16,
  'score': 0.9986383318901062,
  'word': 'Italy'}]

### 3. Question and Answering (Q&A)

In [0]:
nlp_qa = pipeline('question-answering')
nlp_qa(context='Trieste, where SISSA is located, is a beautiful city in Italy, rich of science and sea', question='Where is SISSA ?')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=473.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=260793700.0, style=ProgressStyle(descri…




{'answer': 'Trieste,', 'end': 8, 'score': 0.9469309193730311, 'start': 0}

In [0]:
nlp_qa(context='I left my keys at home. I cannot find them in my bag', question='Where are my keys ?')

{'answer': 'at home.', 'end': 23, 'score': 0.30126233578139505, 'start': 15}

In [0]:
nlp_qa(context='I cannot find my keys. I left them at home', question='Where are my keys ?')

{'answer': 'at home', 'end': 41, 'score': 0.8362338183813876, 'start': 35}

In [0]:
tokenizer = AutoTokenizer.from_pretrained("mrm8488/bert-italian-finedtuned-squadv1-it-alfa")

model = AutoModelForQuestionAnswering.from_pretrained("mrm8488/bert-italian-finedtuned-squadv1-it-alfa")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=235127.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=40.0, style=ProgressStyle(description_w…






HBox(children=(FloatProgress(value=0.0, description='Downloading', max=439765068.0, style=ProgressStyle(descri…




In [0]:
nlp_qa_ita = pipeline('question-answering', model=model, tokenizer=tokenizer)
nlp_qa_ita(context='Abito da anni a Trieste e mi trovo molto bene', question='dove vivo ?')

{'answer': 'Trieste', 'end': 23, 'score': 0.9935057631761453, 'start': 16}

In [0]:
nlp_qa_ita(context='Non trovo le mie chiavi nella borsa, saranno a casa !', question='dove sono le chiavi ?')

{'answer': 'a casa', 'end': 51, 'score': 0.17909527376506418, 'start': 45}

In [0]:
nlp_qa_ita(context='Camilla è figlia di mio padre Umberto e mia madre Isabella', question='Come si chiama mia sorella ?')

{'answer': 'Camilla', 'end': 7, 'score': 0.9818619528071864, 'start': 0}

In [0]:
nlp_qa_ita(context='Mio padre Umberto e mia madre Isabella hanno una figlia di nome Camilla', question='Come si chiama mia sorella ?')

{'answer': 'Camilla', 'end': 70, 'score': 0.8663843820615824, 'start': 64}

In [0]:
nlp_qa_ita(context='Mio padre di nome Umberto e mia madre di nome Isabella hanno una figlia di nome Camilla', question='Come si chiama mia mamma ?')

{'answer': 'Isabella', 'end': 54, 'score': 0.24890632735482754, 'start': 46}

In [0]:
nlp_qa_ita(context='Udine è bella ma Trieste lo è ancora di più', question='Quale città è più bella ?')

{'answer': 'Trieste', 'end': 24, 'score': 0.7583352469427496, 'start': 17}

In [0]:
nlp_qa_ita(context='Trieste è stupenda ma anche Udine è da visitare ', question='Quale città è più bella ?')

{'answer': 'Trieste', 'end': 7, 'score': 0.6377941095610851, 'start': 0}

In [0]:
nlp_qa_ita(context='Trieste è stupenda e Udine è più tranquilla ', question='Quale città è più tranquilla?')

{'answer': 'Trieste', 'end': 7, 'score': 0.6217968349910699, 'start': 0}

In [0]:
nlp_qa_ita(context='Mio padre Umberto ha una figlia Camilla con mia madre Isabella', question='Chi è mia madre?')

{'answer': 'Camilla', 'end': 39, 'score': 0.7555628830058865, 'start': 32}

### 4. Mask-filling



In English...

In [0]:
nlp_fill = pipeline('fill-mask')
nlp_fill('Hugging Face is a' + nlp_fill.tokenizer.mask_token + ' company based in Paris' )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




[{'score': 0.07349598407745361,
  'sequence': '<s> Hugging Face is a newscompany based in Paris</s>',
  'token': 340},
 {'score': 0.0586683489382267,
  'sequence': '<s> Hugging Face is a techcompany based in Paris</s>',
  'token': 2903},
 {'score': 0.05019419267773628,
  'sequence': '<s> Hugging Face is a webcompany based in Paris</s>',
  'token': 3748},
 {'score': 0.03670884668827057,
  'sequence': '<s> Hugging Face is a photographycompany based in Paris</s>',
  'token': 11075},
 {'score': 0.031769949942827225,
  'sequence': '<s> Hugging Face is a publishingcompany based in Paris</s>',
  'token': 10467}]

In Italian...

In [0]:

tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
model = AutoModelWithLMHead.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")

nlp_fill = pipeline('fill-mask', model=model, tokenizer=tokenizer)
nlp_fill('Dopo lavoro ci vediamo tutti per un ' + nlp_fill.tokenizer.mask_token)

[{'score': 0.17030610144138336,
  'sequence': '<s> Dopo lavoro ci vediamo tutti per un caffè</s>',
  'token': 5087},
 {'score': 0.08458418399095535,
  'sequence': '<s> Dopo lavoro ci vediamo tutti per un aperitivo</s>',
  'token': 17649},
 {'score': 0.06797470152378082,
  'sequence': '<s> Dopo lavoro ci vediamo tutti per un incontro</s>',
  'token': 2620},
 {'score': 0.04713313281536102,
  'sequence': '<s> Dopo lavoro ci vediamo tutti per un colloquio</s>',
  'token': 12639},
 {'score': 0.033142443746328354,
  'sequence': '<s> Dopo lavoro ci vediamo tutti per un pranzo</s>',
  'token': 5012}]

In [0]:
nlp_fill = pipeline('fill-mask', model=model, tokenizer=tokenizer)
nlp_fill('Prendo il ' + nlp_fill.tokenizer.mask_token + ' così andiamo al mare')

[{'score': 0.052609652280807495,
  'sequence': '<s> Prendo il mare così andiamo al mare</s>',
  'token': 1893},
 {'score': 0.020827364176511765,
  'sequence': '<s> Prendo il sole così andiamo al mare</s>',
  'token': 2580},
 {'score': 0.018294621258974075,
  'sequence': '<s> Prendo il cellulare così andiamo al mare</s>',
  'token': 5776},
 {'score': 0.016604848206043243,
  'sequence': '<s> Prendo il tempo così andiamo al mare</s>',
  'token': 506},
 {'score': 0.015503967180848122,
  'sequence': '<s> Prendo il bagno così andiamo al mare</s>',
  'token': 2600}]

### 5. Summarization

Summarization is currently supported by `Bart` and `T5`.

In [0]:
TEXT_TO_SUMMARIZE = """ 
New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York. 
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband. 
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other. 
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage. 
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the 
2010 marriage license application, according to court documents. 
Prosecutors said the marriages were part of an immigration scam. 
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further. 
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective 
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. 
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say. 
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages. 
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted. 
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s 
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali. 
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force. 
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

summarizer = pipeline('summarization')
summarizer(TEXT_TO_SUMMARIZE)

### 6. Translation

Translation is currently supported by `T5` for the language mappings English-to-French (`translation_en_to_fr`), English-to-German (`translation_en_to_de`) and English-to-Romanian (`translation_en_to_ro`).

In [0]:
# English to French
translator = pipeline('translation_en_to_fr')
translator("HuggingFace is a French company that is based in New York City. HuggingFace's mission is to solve NLP one commit at a time")

HBox(children=(IntProgress(value=0, description='Downloading', max=230, style=ProgressStyle(description_width=…




[{'translation_text': 'HuggingFace est une entreprise française basée à New York et dont la mission est de résoudre les problèmes de NLP, un engagement à la fois.'}]

In [0]:
# English to German
translator = pipeline('translation_en_to_de')
translator("The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.")

HBox(children=(IntProgress(value=0, description='Downloading', max=230, style=ProgressStyle(description_width=…




[{'translation_text': 'Die Geschichte der natürlichen Sprachenverarbeitung (NLP) begann im Allgemeinen in den 1950er Jahren, obwohl die Arbeit aus früheren Zeiten zu finden ist.'}]

### 7. Text Generation

Text generation is currently supported by GPT-2, OpenAi-GPT, TransfoXL, XLNet, CTRL and Reformer.

In [0]:
text_generator = pipeline("text-generation")
text_generator("Today is a beautiful day and I will")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'Today is a beautiful day and I will celebrate my birthday!"\n\nThe mother told CNN the two had planned their meal together. After dinner, she added that she and I walked down the street and stopped at a diner near her home. "He'}]

### 8. Features Extraction and Attention Visualization

In [5]:
nlp_features = pipeline('feature-extraction')
output = nlp_features('Hugging Face is a French company based in Paris')
np.array(output).shape   # (Samples, Tokens, Vector Size)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=263273408.0, style=ProgressStyle(descri…




(1, 12, 768)

Credits to [BertViz repo](https://github.com/jessevig/bertviz) by [Jesse Vig](https://twitter.com/jesse_vig): BertViz is a tool for visualizing attention in the Transformer model, supporting all models.

In [6]:
import sys
!test -d bertviz_repo || git clone https://github.com/jessevig/bertviz bertviz_repo
if not 'bertviz_repo' in sys.path:
  sys.path += ['bertviz_repo']

Cloning into 'bertviz_repo'...
remote: Enumerating objects: 1074, done.[K
remote: Total 1074 (delta 0), reused 0 (delta 0), pack-reused 1074[K
Receiving objects: 100% (1074/1074), 99.41 MiB | 23.72 MiB/s, done.
Resolving deltas: 100% (687/687), done.


In [0]:
from bertviz import head_view,  model_view

In [0]:
def call_html():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))
  

In [11]:
from transformers import BertModel, BertTokenizer

model_version = 'bert-base-uncased'
do_lower_case = True

model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=do_lower_case)

#sentence_a = "Attention is important, we need to understand it necessarily"
sentence_a = "he is going to take his train to milan quite soon"


inputs = tokenizer.encode_plus(sentence_a, return_tensors='pt', add_special_tokens=False)
token_type_ids = inputs['token_type_ids']
input_ids = inputs['input_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)
call_html()

head_view(attention, tokens)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [12]:
def call_html():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))

call_html()

model_view(attention, tokens)



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>