<a href="https://colab.research.google.com/github/dmitry-kabanov/datascience/blob/main/huggingface-course/chapter1/04_transformers_what_can_they_do.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%html
<div style="font-weight: bold; font-size: 36px">
    Transformers, what can the do?
</div>

HuggingFace is a community. It has `transformers` library in Python, which contains a lot of libraries that are created by a lot of companies, such as Google, Facebook AI, etc. Also, there is a model hub that contains pretrained models. You can also upload your own models to the hub. These models do not have to be based on Transformer approach.

# Setup

In [3]:
!pip install -q transformers

[K     |████████████████████████████████| 4.4 MB 4.9 MB/s 
[K     |████████████████████████████████| 101 kB 6.6 MB/s 
[K     |████████████████████████████████| 6.6 MB 34.0 MB/s 
[K     |████████████████████████████████| 596 kB 46.9 MB/s 
[?25h

# Working with pipelines

In [4]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333}]

In [5]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

# Zero-shot classification

In [6]:
clf = pipeline("zero-shot-classification")
clf(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

{'labels': ['education', 'business', 'politics'],
 'scores': [0.8445987105369568, 0.11197440326213837, 0.043426960706710815],
 'sequence': 'This is a course about the Transformers library'}

In [7]:
clf(
    "Überweisen Sie bitte zum AL35202111090000000001234567",
    candidate_labels=["IBAN", "literature"],
)

{'labels': ['IBAN', 'literature'],
 'scores': [0.6859300136566162, 0.3140700161457062],
 'sequence': 'Überweisen Sie bitte zum AL35202111090000000001234567'}

In [10]:
clf(
    "AL35202111090000000001234567",
    candidate_labels=["IBAN", "MISC", "NUMBER"],
)

{'labels': ['NUMBER', 'IBAN', 'MISC'],
 'scores': [0.681299090385437, 0.19177097082138062, 0.12692993879318237],
 'sequence': 'AL35202111090000000001234567'}

Unfortunately, the zero-shot classification pipeline does not understand IBAN very quickly

In [20]:
cls_ner = pipeline("zero-shot-classification", model="dslim/bert-base-NER-uncased")
cls_ner(
    "AL35202111090000000001234567",
    candidate_labels=["IBAN", "MISC", "NUMBER"],
)

Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


{'labels': ['IBAN', 'MISC', 'NUMBER'],
 'scores': [0.34621167182922363, 0.33154550194740295, 0.3222428262233734],
 'sequence': 'AL35202111090000000001234567'}

# Text generation

In [9]:
gen = pipeline("text-generation")
gen("In this course, we will teach you how to")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to add an individual value to your business, and how to use this for all your purposes. You will learn how to use the values provided through these values in your website or mobile application. In this course'}]

Generate two sentences, each one max of 15 words:

In [11]:
gen("In this course, we will teach you how to", num_return_sequences=2, max_length=15)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to create the following web-'},
 {'generated_text': 'In this course, we will teach you how to create an image which is'}]

In [12]:
gen("Das ist üblich, dass")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Das ist üblich, dass das ipschöter gehölt, der haben Sie gewändigen das es ein, und die ein Wirtschaftsbuch der Er'}]

# Using any model from the Hub in a pipeline

In [13]:
gen = pipeline("text-generation", model="distilgpt2")
gen(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to apply different techniques to various topics across your journey.'},
 {'generated_text': 'In this course, we will teach you how to create an animated video that will allow you to watch all you talk about when you are in jail.'}]

In [15]:
gen(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to master the subject of how to learn to use them on your journey. This course provides a clear understanding'},
 {'generated_text': 'In this course, we will teach you how to perform the basic exercises in the Kinesiology class.\n\n\nTo learn more:'}]

## Mask filling

In [16]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

[{'score': 0.196198508143425,
  'sequence': 'This course will teach you all about mathematical models.',
  'token': 30412,
  'token_str': ' mathematical'},
 {'score': 0.040527332574129105,
  'sequence': 'This course will teach you all about computational models.',
  'token': 38163,
  'token_str': ' computational'}]

In [18]:
unmasker_bert = pipeline("fill-mask", model="bert-base-cased")
unmasker_bert("This course will teach you all about [MASK] models.", top_k=2)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.2596317529678345,
  'sequence': 'This course will teach you all about role models.',
  'token': 1648,
  'token_str': 'role'},
 {'score': 0.0942726731300354,
  'sequence': 'This course will teach you all about the models.',
  'token': 1103,
  'token_str': 'the'}]

# Named entity recognition

Named entity recognition (NER) is a task in which the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations.

In [19]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Dima and I work at Innopract in Karlsruhe")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

  "`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to"


[{'end': 15,
  'entity_group': 'PER',
  'score': 0.9987012,
  'start': 11,
  'word': 'Dima'},
 {'end': 39,
  'entity_group': 'ORG',
  'score': 0.99009633,
  'start': 30,
  'word': 'Innopract'},
 {'end': 52,
  'entity_group': 'LOC',
  'score': 0.971206,
  'start': 43,
  'word': 'Karlsruhe'}]

In [21]:
ner("This is an example of IBAN: AL35202111090000000001234567")

[{'end': 23,
  'entity_group': 'MISC',
  'score': 0.7996624,
  'start': 22,
  'word': 'I'}]

In [23]:
clf_pos = pipeline("ner", model="pos-english-fast")
clf_pos("My name is Dima and I work at Innopract in Karlsruhe")

OSError: ignored