<a href="https://colab.research.google.com/github/fastdaima/huggingface-X-fastai-course/blob/main/fastai%2Bhuggingface_session_1_Transformers%2C_what_can_they_do%3F%5BOriginal%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers, what can they do?

Install the Transformers and Datasets libraries to run this notebook.

In [1]:
!pip install -qq transformers[sentencepiece]
!pip install -qq datasets

from transformers import pipeline

[K     |████████████████████████████████| 2.6 MB 4.2 MB/s 
[K     |████████████████████████████████| 3.3 MB 59.5 MB/s 
[K     |████████████████████████████████| 895 kB 68.9 MB/s 
[K     |████████████████████████████████| 636 kB 62.9 MB/s 
[K     |████████████████████████████████| 1.1 MB 57.9 MB/s 
[K     |████████████████████████████████| 264 kB 4.1 MB/s 
[K     |████████████████████████████████| 76 kB 5.9 MB/s 
[K     |████████████████████████████████| 118 kB 69.4 MB/s 
[K     |████████████████████████████████| 243 kB 64.3 MB/s 
[?25h

## Tokenizer basics

In [3]:
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
print(type(tokenizer))
print(tokenizer)

<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>
PreTrainedTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})


In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
print(type(tokenizer))
print(tokenizer)

<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>
PreTrainedTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})


In [5]:
inputs = tokenizer('hi my name is siva')

print(type(inputs), type(inputs.data))
print(inputs.data)

<class 'transformers.tokenization_utils_base.BatchEncoding'> <class 'dict'>
{'input_ids': [101, 20844, 1139, 1271, 1110, 27466, 2497, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


In [6]:
 for idx, tok in enumerate(tokenizer.vocab.keys()):
  print(f'{tok}: {tokenizer.vocab[tok]}')
  if (idx >4): break

shakes: 11075
Jean: 2893
Chi: 11318
Black: 2117
suites: 26683
stands: 4061


In [7]:
print(tokenizer.decode(inputs.input_ids))
print(tokenizer.convert_ids_to_tokens(inputs.input_ids))

[CLS] hi my name is siva [SEP]
['[CLS]', 'hi', 'my', 'name', 'is', 'si', '##va', '[SEP]']


## Sentiment

In [8]:
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333}]

In [9]:
classifier([
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!"
])

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558095932007}]

In [10]:
classifier.model.name_or_path, classifier.modelcard

('distilbert-base-uncased-finetuned-sst-2-english', None)

Sentiment using a user-contributed model [from the model hub](https://huggingface.co/oliverguhr/german-sentiment-bert?text=Das+it+gut).  Just click on the "Use in Transformers" button to grab the code you need to use it in your code.  There are also options for getting everything you need to run it using the accelerated API and/or in Sagemaker ... there is even the ability to use the inference API to run some tests before you download right on the model's page!

In [11]:
classifier = pipeline("sentiment-analysis", model='oliverguhr/german-sentiment-bert')
print(f'Results: {classifier("Das is gut!")}')
print(classifier.model.name_or_path, classifier.modelcard)

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/161 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Results: [{'label': 'positive', 'score': 0.9705124497413635}]
oliverguhr/german-sentiment-bert None


## Zero-shot

In [12]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

Downloading:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'labels': ['education', 'business', 'politics'],
 'scores': [0.844597339630127, 0.11197540909051895, 0.043427303433418274],
 'sequence': 'This is a course about the Transformers library'}

In [13]:
classifier.model.name_or_path, classifier.modelcard

('facebook/bart-large-mnli', None)

## Text generation

In [14]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to program and develop an excellent online business management professional. We will also help you understand how to implement, enforce and monitor online marketing rules and online advertising policies. In this course, you will learn how to'}]

You can control how many different sequences are randomly generated via `num_return_sequences` and the total output length via `max_length` arguments.

See this article for info about text generation parameters:
https://huggingface.co/blog/how-to-generate

In [15]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to create and manage a network across the network. It would be easy to put in "You Can Build'},
 {'generated_text': 'In this course, we will teach you how to make your own tools.\n\n\nBefore you start on the course, it should be easy to'}]

## Language modeling - MLM

In [23]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=5)

[{'score': 0.19619838893413544,
  'sequence': 'This course will teach you all about mathematical models.',
  'token': 30412,
  'token_str': ' mathematical'},
 {'score': 0.040527306497097015,
  'sequence': 'This course will teach you all about computational models.',
  'token': 38163,
  'token_str': ' computational'},
 {'score': 0.033017951995134354,
  'sequence': 'This course will teach you all about predictive models.',
  'token': 27930,
  'token_str': ' predictive'},
 {'score': 0.031941480934619904,
  'sequence': 'This course will teach you all about building models.',
  'token': 745,
  'token_str': ' building'},
 {'score': 0.024522798135876656,
  'sequence': 'This course will teach you all about computer models.',
  'token': 3034,
  'token_str': ' computer'}]

## Token classification (e.g., NER)

What does `grouped_entities` do?

In [17]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

  f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'


[{'end': 18,
  'entity_group': 'PER',
  'score': 0.9981694,
  'start': 11,
  'word': 'Sylvain'},
 {'end': 45,
  'entity_group': 'ORG',
  'score': 0.97960204,
  'start': 33,
  'word': 'Hugging Face'},
 {'end': 57,
  'entity_group': 'LOC',
  'score': 0.99321055,
  'start': 49,
  'word': 'Brooklyn'}]

In [18]:
ner = pipeline("ner", grouped_entities=False)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

  f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'


[{'end': 12,
  'entity': 'I-PER',
  'index': 4,
  'score': 0.9993828,
  'start': 11,
  'word': 'S'},
 {'end': 14,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.99815476,
  'start': 12,
  'word': '##yl'},
 {'end': 16,
  'entity': 'I-PER',
  'index': 6,
  'score': 0.99590725,
  'start': 14,
  'word': '##va'},
 {'end': 18,
  'entity': 'I-PER',
  'index': 7,
  'score': 0.9992327,
  'start': 16,
  'word': '##in'},
 {'end': 35,
  'entity': 'I-ORG',
  'index': 12,
  'score': 0.97389334,
  'start': 33,
  'word': 'Hu'},
 {'end': 40,
  'entity': 'I-ORG',
  'index': 13,
  'score': 0.976115,
  'start': 35,
  'word': '##gging'},
 {'end': 45,
  'entity': 'I-ORG',
  'index': 14,
  'score': 0.98879766,
  'start': 41,
  'word': 'Face'},
 {'end': 57,
  'entity': 'I-LOC',
  'index': 16,
  'score': 0.99321055,
  'start': 49,
  'word': 'Brooklyn'}]

## Question answering

In [19]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn"
)

Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'answer': 'Hugging Face', 'end': 45, 'score': 0.6949757933616638, 'start': 33}

## Summarization

In [20]:
summarizer = pipeline("summarization")
summarizer("""
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
""")

Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

## Translation

In [21]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Downloading:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/301M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

[{'translation_text': 'This course is produced by Hugging Face.'}]