[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1mUI8lDUxtENZFHlhkKMz1e4sp-NNUuKD?usp=sharing)

# Chapter 2. Using 🤗Transformers

## What Happens Inside the pipeline Function?

The ```pipeline()``` groups together three steps:
- preprocessing (i.e., tokenizing and creating an e
- passing the inputs to the model
- postprocessing 

In [2]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 4.1 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 36.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 62.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 60.5 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.6 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  A

In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

sample_sentences = [
                    "I've been waiting for HuggingFace course my whole life.",
                    "I hate this so much!"
]

classifier(sample_sentences)


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.985034167766571},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

Now let's go through each of the three steps one at a time.

## Tokenizer 

Tokenizers serve three purposoes:
- split the input into tokens
- map the tokens to an integer
- add padding and/or special characters (i.e., begin/end of sequence tokens) to the input

**Key Point**: 🤗Transformer models only accept tensors as inputs so we have to specify the type of tensor we want with the ```return_tensor``` argument.  

In [3]:
# Tokenizer 

from transformers import AutoTokenizer

In [4]:
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [5]:
raw_inputs = [
              "I've been waiting for a HuggingFace course my whole life.", 
              "I hate this so much!",
              ]

inputs = tokenizer(raw_inputs, 
                   padding=True,
                   truncation=True,
                   return_tensors='pt')

In [6]:
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [7]:
inputs['attention_mask']

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])

**Key Point**: look at ```attention_mask``` above. See the zeros? Those are correlate to padding tokens so we are going to mask them so that model doesn't worry about them during training. 

## Going through the Model

We downlaod our pretrained model the same way we did our tokenizer.

In [8]:
from transformers import AutoModel

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([2, 16, 768])


The code above outputs the *hidden states* aka *features* of the model.

What does that mean? Basically, we're going to use those features as inputs for the head of the model. 

Now, while human heads can do many things, transformer heads can essentially do one thing REALLY well so, to actually solve our classification problem, we need a model with a **sequence classificaiton** head as opposed to . 

In [9]:
from transformers import  AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

outputs = model(**inputs)

In [10]:
print(outputs.logits.shape)

torch.Size([2, 2])


Since we have two sentences with two labels, we get an output of two by two. 

## Postporcessing the output
Unfortunatley, the output doesn't make any sense on their own because they are raw logits.

In [11]:
outputs.logits

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

The question, of course, is what do thos logits actually mean? To answer that, we move on to Postprocessing. 

We add a SoftMax layer to the logits to get a probablity.

In [12]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


The model predicted [.0402, 0.9598] and [0.9946, 0.0544] for the second. 

But what are the labels for those probablities?

In [13]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [14]:
raw_inputs

["I've been waiting for a HuggingFace course my whole life.",
 'I hate this so much!']

So, for the first sentence, the model predicts with 96% confidence that it is positive while the second is predicted at nearly 100% as being negative.

## Deep Dive on [Models](https://huggingface.co/course/chapter2/3?fw=pt)

It's easy to load a model based on a checkpoint

In [15]:
from transformers import AutoModel

In [16]:
bert_checkpoint = 'bert-base-cased'
gpt2_checkpoint = 'gpt2'
bart_checkpoint = 'facebook/bart-base'

In [17]:
bert_model = AutoModel.from_pretrained(bert_checkpoint)
gpt_model = AutoModel.from_pretrained(gpt2_checkpoint)
bart_model = AutoModel.from_pretrained(bart_checkpoint)

print(type(bert_model))
print(type(gpt_model))
print(type(bart_model))

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/532M [00:00<?, ?B/s]

<class 'transformers.models.bert.modeling_bert.BertModel'>
<class 'transformers.models.gpt2.modeling_gpt2.GPT2Model'>
<class 'transformers.models.bart.modeling_bart.BartModel'>


We need to load the correct config file or else our model will not run.

In [18]:
from transformers import  AutoConfig

bert_config = AutoConfig.from_pretrained(bert_checkpoint)
gpt_config = AutoConfig.from_pretrained(gpt2_checkpoint)
bart_config = AutoConfig.from_pretrained(bart_checkpoint)

In [19]:
print(type(bert_config))
print(type(gpt_config))
print(type(bart_config))

<class 'transformers.models.bert.configuration_bert.BertConfig'>
<class 'transformers.models.gpt2.configuration_gpt2.GPT2Config'>
<class 'transformers.models.bart.configuration_bart.BartConfig'>


## Creating a Transformer

We need to load the architecture and the config.

In [20]:
from transformers import BertConfig, BertModel

config = BertConfig()

model = BertModel(config)

print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.15.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



It is also possible to just import the config for the desired snapshot like below.

In [21]:
from transformers import  BertConfig

bert_config = BertConfig.from_pretrained(bert_checkpoint)
print(type(bert_config))
print(bert_config)

<class 'transformers.models.bert.configuration_bert.BertConfig'>
BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.15.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



__Key Point__: The config provides all the information necessary to load the model. Namely, it provides the information needed to create the archetecture.

## Tokenizers
Machine learning models can only work with numbers so we have to convert our raw text into numbers and that is exactly what our transformer tokenizers do: take raw text and convert them to useful pieces of information for our model to process. 

To get a better understanding of how tokenizers work, we'll go over three kinds: 
- word based
- character based
- subword based 

Which tokenizer you use will be based on your checkpoint which, of course, is based on the task you are performing. 

### Word Based
Every word is treated as a unique token.



In [22]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


Unfortunately, this means "play", "played", "plays", and "playing" are all considered unique words. Needless to say, this creates a HUGE model. 

To combat this issue, we can simply limit the size of our corpus by telling our model to only remember the top k number of words. The problem this practice gives rise to is that we now have several [unknown] words which all have the same meaning. 

One solution is to use, 

### Character-based Tokenizers

Instead of mapping a number to every word, we'll map a number to every character. 

In [23]:
def split(word):
  return[char for char in word if char.strip()]

text = "Jim Henson was a puppeteer"

print(split(text))

['J', 'i', 'm', 'H', 'e', 'n', 's', 'o', 'n', 'w', 'a', 's', 'a', 'p', 'u', 'p', 'p', 'e', 't', 'e', 'e', 'r']


The new problem is quite obvious: indiviual characters in languages which use alphabets do not convey nearly as much meaning as words.

Additionally, we now have to process an incredible number of tokens for our model. 

To that end, our last approach is a fusion to the two previously mentioned. 

### Subword tokenization

**Key Point**: Frequently used words should not be split into subwords while less freqently words should. 

By following the above principle, we will avoid the pitfalls word-based tokenizers (i.e., having a huge number of unique tokens as well as unknown tokents) as well as the issue with character-based (i.e., having tokens which carry little meaning and having too many tokens for our model to process). 

Consequently, all state of the art (SOTA) architectures use a form of subword tokenization. 

## Loading and saving

It's just like loading and saving models.

We can load it directly: 

In [24]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Or we can use ```AutoTokenizer```

In [25]:
from transformers import AutoTokenizer

tokenizer2 = AutoTokenizer.from_pretrained("bert-base-cased")

In [26]:
sample_text = "Using a Transformer network is simple."

In [27]:
tokenizer(sample_text)['input_ids'] == tokenizer2(sample_text)['input_ids']

True

Saving it is just as simple: 

In [28]:
# tokenizer.save_pretrained("some_directory_somewhere")

### [Encoding](https://huggingface.co/course/chapter2/4?fw=pt#encoding)

Encoding is simply converting text to numbers. Since machines don't understand raw text, we have to map text to numbers and then pass those digits as inputs to our model.

Therefore, we have to tokenize our raw text and then map these tokens to numbers which stored in our model's ```vocabulary```. 

### Tokenization 

In [29]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

sequence = "Using a transformer network is piss easy."

tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'transform', '##er', 'network', 'is', 'piss', 'easy', '.']


Since we're using a Bert-based model tokenizer, our tokenizer splits on the subword in order to avoid ```UNK``` tokens.

### From tokens to input ID's

Now we can conver the tokens to numbers aka input IDs.

In [30]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[7993, 170, 11303, 1200, 2443, 1110, 20693, 3123, 119]


### Decoding 
Just like *decrypt* is the opposite of *incrypt*, *decode* is the opposite of *encode*. 

In [31]:
decoded_string = tokenizer.decode(ids)
print(decoded_string)

Using a transformer network is piss easy.


## Handling Multiple Sequnces

### Modles expect a batch of inputs
Of course, we're not going to be working with one sentence at a time so we need to work with batches of inputs. Luckily, transformers makes that dead simple and, infact, expect multiple sequences by default.


In [3]:
import torch 
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

In [8]:
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor([ids])

print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


### Padding the inputs
All sequences must be the same length meaning we either need to truncate or pad our sequences. 

To make this process super simple, we can just pass ```padding=True``` to our tokenizer. 

In [12]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sentences = [
             "I've been waiting for this course for what seems like forever.",
             "I can't stand this!"
]

tokens = tokenizer(sentences, padding=True)

In [19]:
for x in tokens['input_ids']:
  print(x)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 2023, 2607, 2005, 2054, 3849, 2066, 5091, 1012, 102]
[101, 1045, 2064, 1005, 1056, 3233, 2023, 999, 102, 0, 0, 0, 0, 0, 0, 0]


And how do we get the model to ignore the padding? We pass it an attention mask literally telling the model to pay attention to tokens marked 1 and ignore 0.

In [18]:
for x in tokens['attention_mask']:
  print(x)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]


## Putting it all together

As hinted at above, we can trust the API to pad and/or truncate our inputs accordingly. Additionally, we can explicitly tell the API how it should be done like so: 

In [33]:
#pad to the length of the longest sentence
model_inputs = tokenizer(sentences, padding='longest')

#Max length of the model 
# model_inputs = tokenizer(sentences, padding='max_length', truncation=True)

#Specified max lenth
model_inputs = tokenizer(sentences, 
                         padding='max_length', 
                         max_length=8,
                         truncation=True) 

In [34]:
for x in model_inputs['input_ids']:
  print(x)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 102]
[101, 1045, 2064, 1005, 1056, 3233, 2023, 102]


Additionally, we can specify the tensor type to return: NumPy ("np"), TensorFlow ("tf"), or PyTorch ("pt"). 

In [35]:
model_inputs = tokenizer(sentences, 
                         padding='max_length', 
                         max_length=8,
                         truncation=True, 
                         return_tensors="pt")

In [36]:
for x in model_inputs['input_ids']:
  print(x)

tensor([ 101, 1045, 1005, 2310, 2042, 3403, 2005,  102])
tensor([ 101, 1045, 2064, 1005, 1056, 3233, 2023,  102])


All together now! 

In [48]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

output = model(**tokens)

predictions = torch.nn.functional.softmax(output.logits, dim=-1)

In [50]:
for prediction in predictions:
  print(prediction)

tensor([0.0402, 0.9598], grad_fn=<UnbindBackward0>)
tensor([5.3534e-04, 9.9946e-01], grad_fn=<UnbindBackward0>)
