<a href="https://colab.research.google.com/github/embarced/notebooks/blob/master/deep/transformers-pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers: examples from basic pipeline tasks

Examples taken from https://huggingface.co/transformers/task_summary.html

In [1]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
tf.__version__

'2.6.0'

In [2]:
# when we are not training, we do not need a GPU
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [3]:
# https://huggingface.co/transformers/installation.html
!pip install -q transformers

In [4]:
import transformers
transformers.__version__

'4.9.2'

In [5]:
from transformers import pipeline

In [6]:
# shows all possible tasks
pipeline?

## Sentiment Analysis

model trained on the glue dataset: https://huggingface.co/datasets/glue


In [7]:
classifier = pipeline(task="sentiment-analysis")
classifier.model.name_or_path

'distilbert-base-uncased-finetuned-sst-2-english'

In [8]:
classifier("I hate you")

[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]

In [9]:
classifier("I love you")

[{'label': 'POSITIVE', 'score': 0.9998656511306763}]

## Are two sequences paraphrases of each other?

In [10]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

# The tokekenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")

paraphrase_classification_logits = model(paraphrase)[0]
not_paraphrase_classification_logits = model(not_paraphrase)[0]

paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

Some layers from the model checkpoint at bert-base-cased-finetuned-mrpc were not used when initializing TFBertForSequenceClassification: ['dropout_183']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at bert-base-cased-finetuned-mrpc.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


not paraphrase: 10%
is paraphrase: 90%
not paraphrase: 94%
is paraphrase: 6%


## Extractive Question Answering

https://huggingface.co/transformers/task_summary.html#extractive-question-answering

In [11]:
question_answerer = pipeline("question-answering")
question_answerer.model.name_or_path

'distilbert-base-cased-distilled-squad'

In [12]:
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

result = question_answerer(question="What is extractive question answering?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95
Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160


# German 

Models and data sets are rare unfortunately

Models
* https://huggingface.co/bert-base-german-cased

Data Sets
* https://tblock.github.io/10kGNAD/

In [13]:
from transformers import TFAutoModelWithLMHead, AutoTokenizer
import tensorflow as tf

# model_name = "distilbert-base-cased"
model_name = "bert-base-german-cased"
# works with Pytorch only
# model_name = "bert-base-german-dbmdz-cased"
# model_name = "bert-base-multilingual-cased"


tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelWithLMHead.from_pretrained(model_name)

# sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
sequence = f"Deutschland ist ein tolles Land. Als Bürger hast du das Recht {tokenizer.mask_token} zu tun was nicht gegen das Gesetz ist."

input = tokenizer.encode(sequence, return_tensors="tf")
mask_token_index = tf.where(input == tokenizer.mask_token_id)[0, 1]

token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-german-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


Deutschland ist ein tolles Land. Als Bürger hast du das Recht etwas zu tun was nicht gegen das Gesetz ist.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht [unused_punctuation1] zu tun was nicht gegen das Gesetz ist.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht alles zu tun was nicht gegen das Gesetz ist.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht nichts zu tun was nicht gegen das Gesetz ist.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht so zu tun was nicht gegen das Gesetz ist.


In [14]:
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

model_name = "bert-base-german-dbmdz-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelWithLMHead.from_pretrained(model_name)

# sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
sequence = f"Deutschland ist ein tolles Land. Als Bürger hast du das Recht {tokenizer.mask_token} zu tun."

input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]

token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Some weights of the model checkpoint at bert-base-german-dbmdz-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Deutschland ist ein tolles Land. Als Bürger hast du das Recht nichts zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht etwas zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht das zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht was zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht , zu tun.


In [15]:
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

model_name = "bert-base-multilingual-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelWithLMHead.from_pretrained(model_name)

# sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
sequence = f"Deutschland ist ein tolles Land. Als Bürger hast du das Recht {tokenizer.mask_token} zu tun."

input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]

token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Deutschland ist ein tolles Land. Als Bürger hast du das Recht Deutschland zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht , zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht Deutsch zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht zu zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht Freiheit zu tun.
