In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
     ---------------------------------------- 0.0/126.8 kB ? eta -:--:--
     ---------------------------------------- 0.0/126.8 kB ? eta -:--:--
     --- ------------------------------------ 10.2/126.8 kB ? eta -:--:--
     --- ------------------------------------ 10.2/126.8 kB ? eta -:--:--
     --- ------------------------------------ 10.2/126.8 kB ? eta -:--:--
     -------- ---------------------------- 30.7/126.8 kB 145.2 kB/s eta 0:00:01
     ----------- ------------------------- 41.0/126.8 kB 140.3 kB/s eta 0:00:01
     ----------------- ------------------- 61.4/126.8 kB 192.5 kB/s eta 0:00:01
     ----------------------- ------------- 81.9/126.8 kB 241.3 kB/s eta 0:00:01
     ------------------------------- ---- 112.6/126.8 kB 297.7 kB/s eta 0:00:01
     ------------------------------------ 126.8/126.8 kB 287.0 kB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers)

In [2]:
!pip install transformers[sentencepiece]

Collecting sentencepiece!=0.1.92,>=0.1.91 (from transformers[sentencepiece])
  Downloading sentencepiece-0.1.99-cp311-cp311-win_amd64.whl (977 kB)
     ---------------------------------------- 0.0/977.5 kB ? eta -:--:--
     ---------------------------------------- 10.2/977.5 kB ? eta -:--:--
     ---------------------------------------- 10.2/977.5 kB ? eta -:--:--
     - ----------------------------------- 30.7/977.5 kB 187.9 kB/s eta 0:00:06
     - ----------------------------------- 30.7/977.5 kB 187.9 kB/s eta 0:00:06
     - ----------------------------------- 30.7/977.5 kB 187.9 kB/s eta 0:00:06
     -- ---------------------------------- 61.4/977.5 kB 204.8 kB/s eta 0:00:05
     -- ---------------------------------- 71.7/977.5 kB 218.6 kB/s eta 0:00:05
     ---- ------------------------------- 112.6/977.5 kB 312.2 kB/s eta 0:00:03
     ---- ------------------------------- 122.9/977.5 kB 313.8 kB/s eta 0:00:03
     ----- ------------------------------ 153.6/977.5 kB 353.1 kB/s eta 

In [3]:
import transformers

  from .autonotebook import tqdm as notebook_tqdm


# **Transformers**



### Working with pipelinesPipeline function returns an end-to-end object that performs an NLP task on one or several texts.


In [4]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier([
    "I don't like apples!",
    "I hate this so much!"
])

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 100%|████████████████████████████████████████████████████████████████████████████| 629/629 [00:00<?, ?B/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
model.safetensors: 100%|████████████████████████████████████████████████████████████| 268M/268M [00:59<00:00, 4.49MB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████| 48.0/48.0 [00:00<?, ?B/s]
vocab.txt: 100%|█████████████████████████████████████████████████████████████████████| 232k/232k [00

[{'label': 'NEGATIVE', 'score': 0.9972789883613586},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

- This pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English.
- The model is downloaded and cached when you create the classifier object.

Some of the currently available pipelines are:

- feature-extraction (get the vector representation of a text)
- fill-mask
- ner (named entity recognition)
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification






# **Zero-shot Classification**
- Classify texts that haven’t been labelled.
- For this use case, the zero-shot-classification pipeline is very powerful.
  - It allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model.


In [6]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445989489555359, 0.11197426170110703, 0.04342678561806679]}

### Text Generation
- The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text.
- The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text.


In [8]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("WE will learn about Nepal")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'WE will learn about Nepal\'s current and future economic situation and be able to bring us together to help build a sustainable future for Nepal, our neighbors and those we serve on the continent," he said.\n\n"We will work to implement a comprehensive'}]

### Using any model from the Hub in a pipelineLet’s try the distilgpt2 model! Here’s how to load it in the same pipeline as before:


In [9]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

config.json: 100%|████████████████████████████████████████████████████████████████████████████| 762/762 [00:00<?, ?B/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
model.safetensors: 100%|████████████████████████████████████████████████████████████| 353M/353M [01:24<00:00, 4.18MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████| 124/124 [00:00<?, ?B/s]
vocab.json: 100%|█████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 2.30MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 642kB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████| 1.36M/1.36M [00:04<00:00, 318kB/s]
Setting `pad_to

[{'generated_text': 'In this course, we will teach you how to use the tools, and how they could allow you to do both. We have only published a few'},
 {'generated_text': 'In this course, we will teach you how to set up a self-help website and start implementing the best practices. The goal, however, is'}]

#### The Interference API
All the models can be tested directly through our browser using the Inference API, which is available on the Hugging Face website

### Mask filling

The idea of this task is to fill in the blanks in a given text:


In [10]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 100%|████████████████████████████████████████████████████████████████████████████| 480/480 [00:00<?, ?B/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
model.safetensors: 100%|████████████████████████████████████████████████████████████| 331M/331M [01:17<00:00, 4.29MB/s]
Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on an

[{'score': 0.1961977779865265,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052717983722687,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

### Named Entity Recognition

Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations

In [11]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Aavash and I work at Hugging Face in Nepal")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 100%|████████████████████████████████████████████████████████████████████████████| 998/998 [00:00<?, ?B/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
model.safetensors: 100%|██████████████████████████████████████████████████████████| 1.33G/1.33G [05:20<00:00, 4.16MB/s]
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expe

[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

### Question answering
The question-answering pipeline answers questions using information from a given context

Note that this pipeline works by extracting information from the provided context; it does not generate the answer.





In [12]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Aavash and I work in Fusemachines",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 100%|████████████████████████████████████████████████████████████████████████████| 473/473 [00:00<?, ?B/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
model.safetensors: 100%|████████████████████████████████████████████████████████████| 261M/261M [01:05<00:00, 3.98MB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████| 29.0/29.0 [00:00<?, ?B/s]
vocab.txt: 100%|████████████████████████████████████████████████████████████████████| 213k/213k [00:00<00:00, 3.71MB/s]


{'score': 0.9582499861717224, 'start': 32, 'end': 44, 'answer': 'Fusemachines'}

### Summarization

Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text.

In [14]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 100%|████████████████████████████████████████████████████████████████████████| 1.80k/1.80k [00:00<?, ?B/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
pytorch_model.bin: 100%|██████████████████████████████████████████████████████████| 1.22G/1.22G [04:57<00:00, 4.11MB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████| 26.0/26.0 [00:00<00:00, 13.0kB/s]
vocab.json: 100%|███████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.07MB/s]
merges.txt: 100%

[{'summary_text': ' The number of engineering graduates in the United States has declined in recent years . China and India graduate six and eight times as many traditional engineers as the U.S. does . Rapidly developing economies such as China continue to encourage and advance the teaching of engineering . There are declining offerings in engineering subjects dealing with infrastructure, infrastructure, the environment, and related issues .'}]

### Translation

In [15]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

config.json: 100%|████████████████████████████████████████████████████████████████████████| 1.42k/1.42k [00:00<?, ?B/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
pytorch_model.bin: 100%|████████████████████████████████████████████████████████████| 301M/301M [01:13<00:00, 4.07MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████| 293/293 [00:00<00:00, 294kB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████| 42.0/42.0 [00:00<?, ?B/s]
source.spm: 100%|███████████████████████████████████████████████████████████████████| 802k/802k [00:00<00:00, 4.31MB/s]
target.spm: 100%|███████████████████████████████████████████████████████████████████| 778k/778k [00:00<00:00, 3.09MB/s]
vocab.json: 100

[{'translation_text': 'This course is produced by Hugging Face.'}]

### History
![History of transformers](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers_chrono.svg)


The Transformer architecture was introduced in June 2017. The focus of the original research was on translation tasks. This was followed by the introduction of several influential models.

Broadly, they can be grouped into three categories:

- GPT-like (also called auto-regressive Transformer models)
- BERT-like (also called auto-encoding Transformer models)
- BART/T5-like (also called sequence-to-sequence Transformer models)

### Transformers are language models

All the Transformer models have been trained on large amounts of raw text in a self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model.

This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks.

Because of this, the general pretrained model then goes through a process called transfer learning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.

**predicting the next word**

This is called causal language modeling.

![Causal language modeling](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/causal_modeling.svg
                           Another example is masked language modeling, in which the model predicts a masked word in the sentence.

![Masked Language Modeling](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/masked_modeling.svg)




Apart from a few outliers (like DistilBERT), the general strategy to achieve better performance is by increasing the models’ sizes as well as the amount of data they are pretrained on.

![Transformers are big](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/model_parameters.png)

g)

### Transfer Learning

***Pretraining*** is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.

***Fine-tuning***, on the other hand, is the training done after a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task.

### General Architecture

    The model is primarily composed of two blocks:

- Encoder (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
- Decoder (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

![Architecture](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers_blocks.svg)
    

Each of these parts can be used independently, depending on the task:

- **Encoder-only models**: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
- **Decoder-only models**: Good for generative tasks such as text generation.
- Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.



### Attention layers
This layer will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word.

### The original Architecture

- The Transformer architecture was originally designed for translation.
- During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language.
- In the encoder, the attention layers can use all the words in a sentence
- The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated.



- When the model has access to target sentences, the decoder is fed the whole target, but it is not allowed to use future words

![Transformer Architecture](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers.svg)

The *attention mask* can also be used in the encoder/decoder to prevent the model from paying attention to some special words — for instance, the special padding word used to make all the inputs the same length when batching together sentences.



### Architectures vs. checkpoints

- **Architecture**: This is the skeleton of the model — the definition of each layer and each operation that happens within the model.
- **Checkpoints**: These are the weights that will be loaded in a given architecture.
- **Model**: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both. This course will specify architecture or checkpoint when it matters to reduce ambiguity.

For example, BERT is an architecture while bert-base-cased, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert-base-cased model.”




While transformer models are powerful tools, they come with limitations. The biggest of these is that, to enable pretraining on large amounts of data, researchers often scrape all the content they can find, taking the best as well as the worst of what is available on the internet.

# **Using Transformers**

## Introduction
The library’s main features are:

- Ease of use: Downloading, loading, and using a state-of-the-art NLP model for inference can be done in just two lines of code.
- Flexibility: At their core, all models are simple PyTorch nn.Module or TensorFlow tf.keras.Model classes and can be handled like any other models in their respective machine learning (ML) frameworks.
- Simplicity: Hardly any abstractions are made across the library. The “All in one file” is a core concept: a model’s forward pass is entirely defined in a single file, so that the code itself is understandable and hackable.


## Behind the pipeline

The pipeline groups together three steps: preprocessing, passing the inputs through the model, and postprocessing:

![Pipeline](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)



### Preprocessing with a tokenizer

First step of our pipeline
- convert the text inputs into numbers that the model can make sense of.
- use a tokenizer, which will be responsible for:
  - Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
  - Mapping each token to an integer
  - Adding additional inputs that may be useful to the model


- All this preprocessing needs to be done in exactly the same way as when the model was pretrained.
- We use the AutoTokenizer class and its `from_pretrained()` method.
- Using the checkpoint name of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it.

In [22]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [23]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


### Going through the model
Transformers provides an `AutoModel` class which also has a `from_pretrained()` method:



In [24]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

In this code snippet, we have downloaded the same checkpoint we used in our pipeline before (it should actually have been cached already) and instantiated a model with it.

given some inputs, it outputs what we’ll call hidden states, also known as features.

For each model input, we’ll retrieve a high-dimensional vector representing the **contextual understanding of that input by the Transformer model**.

The hidden states are usually inputs to another part of the model, known as the head.

- the different tasks could have been performed with the same architecture, but each of these tasks will have a different head associated with it.

given some inputs, it outputs what we’ll call hidden states, also known as features.

For each model input, we’ll retrieve a high-dimensional vector representing the **contextual understanding of that input by the Transformer model**.

The hidden states are usually inputs to another part of the model, known as the head.

- the different tasks could have been performed with the same architecture, but each of these tasks will have a different head associated with it.




#### A high-dimensional vector?
The vector output by the Transformer module is usually large. It generally has three dimensions:

- Batch size: The number of sequences processed at a time (2 in our example).
- Sequence length: The length of the numerical representation of the sequence (16 in our example).
- Hidden size: The vector dimension of each model input.

The hidden size can be very large

In [25]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


#### Model Heads
- Model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension.
- The output of the Transformer model is sent directly to the model head to be processed.

There are many different architectures available in Transformers, with each one designed around tackling a specific task. Here is a non-exhaustive list:

- *Model (retrieve the hidden states)
- *ForCausalLM
- *ForMaskedLM
- *ForMultipleChoice
- *ForQuestionAnswering
- *ForSequenceClassification
- *ForTokenClassification
- and others 



In [26]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

- if we look at the shape of our outputs, the dimensionality will be much lower:
- the model head takes as input the high-dimensional vectors and outputs vectors containing two values

In [27]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


In [28]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [29]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

## Models
### Creating a Transformer

to initialize a BERT model
- load a configuration object.



In [31]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

In [32]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.36.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



#### Different loading methods

Creating a model from the default configuration initializes it with random values:



In [33]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

In [34]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

config.json: 100%|████████████████████████████████████████████████████████████████████| 570/570 [00:00<00:00, 46.5kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
model.safetensors: 100%|████████████████████████████████████████████████████████████| 436M/436M [01:45<00:00, 4.12MB/s]


In [35]:
model.save_pretrained("/content/models")

### Using a Transformer model for inference
Before we discuss tokenizers, let’s explore what inputs the model accepts.
Let’s say we have a couple of sequences:





In [38]:
sequences = ["Hello!", "Cool.", "Nice!"]

In [39]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

This is a list of encoded sequences: a list of lists. Tensors only accept rectangular shapes (think matrices).



#### Using the tensors as inputs to the model

we just call the model with the inputs:

`output = model(model_inputs)

While the model accepts a lot of different arguments, only the input IDs are necessary.`


## Tokenizers
- Translate text into data that can be processed by the model.
- Models can only process numbers, so tokenizers need to convert our text inputs to numerical data.

### Word-based
in the image below, the goal is to split the raw text into words and find a numerical representation for each of them:

![Word-based](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/word_based_tokenization.svg)

There are different ways to split the text. For example, we could use whitespace to tokenize the text into words by applying Python’s split() function:

In [40]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


- There are also variations of word tokenizers that have extra rules for punctuation.
- Each word gets assigned an ID, starting from 0 and going up to the size of the vocabulary. The model uses these IDs to identify each word.

- If we want to completely cover a language with a word-based tokenizer, we’ll need to have an identifier for each word in the language, which will generate a huge amount of tokens.

- Furthermore, words like “dog” are represented differently from words like “dogs”, and the model will initially have no way of knowing that “dog” and “dogs” are similar:

- Finally, we need a custom token to represent words that are not in our vocabulary. This is known as the “unknown” token, often represented as ”[UNK]” or ””.  Generally a bad sign when a lot of these tokens are produced as the tokenizers wasn’t able to retrieve a sensible representation of a word

One way to reduce the amount of unknown tokens is to go one level deeper, using a character-based tokenizer.


### Character-Based

Character-based tokenizers split the text into characters, rather than words. This has two primary benefits:

- The vocabulary is much smaller.
- There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.

![Character-based](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/character_based_tokenization.svg)



- Since the representation is now based on characters rather than words, one could argue that, intuitively, it’s less meaningful.
- each character doesn’t mean a lot on its own.
- However, this again differs according to the language;
  - in Chinese, for example, each character carries more information than a character in a Latin language.

Another thing to consider is that we’ll end up with a very large amount of tokens to be processed by our model: whereas a word would only be a single token with a word-based tokenizer, it can easily turn into 10 or more tokens when converted into characters.

To get the best of both worlds, we can use a third technique that combines the two approaches: subword tokenization.



### Subword Tokenization
Frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.

-  “annoyingly” might be considered a rare word and could be decomposed into “annoying” and “ly”.
- These are both likely to appear more frequently as standalone subwords.
- while at the same time the meaning of “annoyingly” is kept by the composite meaning of “annoying” and “ly”.




example showing how a subword tokenization algorithm would tokenize the sequence “Let’s do tokenization!“:

This approach is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.


![Subword](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/bpe_subword.svg)


### Loading and saving

Loading and saving tokenizers is as simple as it is with models.
- it’s based on the same two methods: `from_pretrained()` and `save_pretrained()`.

These methods will load or save the algorithm used by the tokenizer as well as its vocabulary

In [42]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████| 29.0/29.0 [00:00<?, ?B/s]
vocab.txt: 100%|█████████████████████████████████████████████████████████████████████| 213k/213k [00:00<00:00, 924kB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████| 436k/436k [00:00<00:00, 2.25MB/s]


In [43]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [44]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [45]:
tokenizer.save_pretrained("tokenizer2")

('tokenizer2\\tokenizer_config.json',
 'tokenizer2\\special_tokens_map.json',
 'tokenizer2\\vocab.txt',
 'tokenizer2\\added_tokens.json',
 'tokenizer2\\tokenizer.json')

### Encoding
- Translating text to numbers is known as encoding.
- Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.
- the first step is to split the text into words (or parts of words, punctuation symbols, etc.), usually called tokens.
- The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model.
- To do this, the tokenizer has a vocabulary, which is the part we download when we instantiate it with the from_pretrained() method.


In [46]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


In [47]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


### Decoding

Decoding is going the other way around:
- from vocabulary indices, we want to get a string.
- can be done with the `decode()` method

In [48]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


Decode method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence.
## Handling multiple sequences
- How do we handle multiple sequences?
- How do we handle multiple sequences of different lengths?
- Are vocabulary indices the only inputs that allow a model to work well?
- Is there such a thing as too long a sequence?


In [50]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])


In [51]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


*Batching* is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence:



In [52]:
batched_ids = [ids, ids]

This is a batch of two identical sequences!

- Batching allows the model to work when you feed it multiple sentences.
- Using multiple sequences is just as simple as building a batch with a single sequence.

When you’re trying to batch together two (or more) sentences, they might be of different lengths. To work around this problem, we usually pad the inputs.



### Padding the inputs
The following list of lists cannot be converted to a tensor:




In [54]:
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

We’ll use *padding* to make our tensors have a rectangular shape.

Padding makes sure all our sentences have the same length by adding a special word called the padding token to the sentences with fewer values.
For example, if you have 10 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words. In our example, the resulting tensor looks like this:



In [55]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]


In [56]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


### Attention Masks
*Attention Masks* aretensors with the exact same shape as the input IDs tensor, filled with 0s and 1s:

1s indicate the corresponding tokes should be attended to, and 0s indicate the corresponding tokens should not be attended to.


In [57]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


### Longer sequences

With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem:

- Use a model with a longer supported sequence length.
- Truncate your sequences.


## Putting it all together

The 🤗 Transformers API can handle  tokenization, conversion to input IDs, padding, truncation, and attention masks for us with a high-level function that we’ll dive into here.

When you call your tokenizer directly on the sentence, you get back inputs that are ready to pass through your model:

When you call your tokenizer directly on the sentence, you get back inputs that are ready to pass through your model:



In [58]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

In [59]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

In [60]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

In [61]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

In [62]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

model_inputs


{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 102], [101, 2061, 2031, 1045, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

In [65]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


In [66]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


In [67]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

output

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [-3.6183,  3.9137]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)