[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1mUI8lDUxtENZFHlhkKMz1e4sp-NNUuKD?usp=sharing)

# Chapter 2. Using 🤗Transformers

## What Happens Inside the pipeline Function?

The ```pipeline()``` groups together three steps:
- preprocessing (i.e., tokenizing and creating an e
- passing the inputs to the model
- postprocessing 

In [None]:
!pip install transformers

In [3]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

sample_sentences = [
                    "I've been waiting for HuggingFace course my whole life.",
                    "I hate this so much!"
]

classifier(sample_sentences)


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.985034167766571},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

Now let's go through each of the three steps one at a time.

## Tokenizer 

Tokenizers serve three purposoes:
- split the input into tokens
- map the tokens to an integer
- add padding and/or special characters (i.e., begin/end of sequence tokens) to the input

**Key Point**: 🤗Transformer models only accept tensors as inputs so we have to specify the type of tensor we want with the ```return_tensor``` argument.  

In [4]:
# Tokenizer 

from transformers import AutoTokenizer

In [5]:
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [6]:
raw_inputs = [
              "I've been waiting for a HuggingFace course my whole life.", 
              "I hate this so much!",
              ]

inputs = tokenizer(raw_inputs, 
                   padding=True,
                   truncation=True,
                   return_tensors='pt')

In [9]:
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


**Key Point**: look at ```attention_mask``` above. See the zeros? Those are correlate to padding tokens so we are going to mask them so that model doesn't worry about them during training. 

## Going through the Model

We downlaod our pretrained model the same way we did our tokenizer.

In [10]:
from transformers import  AutoModel

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([2, 16, 768])


The code above outputs the hidden states of th model.

To actually solve our classification problem, we need a model with a **sequence classificaiton** head. 

In [11]:
from transformers import  AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

outputs = model(**inputs)

In [12]:
print(outputs.logits.shape)

torch.Size([2, 2])


Since we have two sentences with two labels, we get an output of two by two. 

## Postporcessing the output
Unfortunatley, the output doesn't make any sense on their own. 

In [17]:
outputs.logits

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

The question, of course, is what do thos logits actually mean? To answer that, we move on to Postprocessing. 

We add a SoftMax layer to the logits to get a probablity.

In [18]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


The model predicted [.0402, 0.9598] and [0.9946, 0.0544] for the second. 

But what are the labels for those probablities?

In [19]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [20]:
raw_inputs

["I've been waiting for a HuggingFace course my whole life.",
 'I hate this so much!']

So, for the first sentence, the model predicts with 96% confidence that it is positive while the second is predicted at nearly 100% as being negative.

## Deep Dive on [Models](https://huggingface.co/course/chapter2/3?fw=pt)

It's easy to load a model based on a checkpoint

In [None]:
from transformers import AutoModel

In [None]:
bert_checkpoint = 'bert-base-cased'
gpt2_checkpoint = 'gpt2'
bart_checkpoint = 'facebook/bart-base'

In [None]:
bert_model = AutoModel.from_pretrained(bert_checkpoint)
gpt_model = AutoModel.from_pretrained(gpt2_checkpoint)
bart_model = AutoModel.from_pretrained(bart_checkpoint)

print(type(bert_model))
print(type(gpt_model))
print(type(bart_model))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…

KeyboardInterrupt: ignored

We need to load the correct config file or else our model will not run.

In [None]:
from transformers import  AutoConfig

bert_config = AutoConfig.from_pretrained(bert_checkpoint)
gpt_config = AutoConfig.from_pretrained(gpt2_checkpoint)
bart_config = AutoConfig.from_pretrained(bart_checkpoint)

In [None]:
print(type(bert_config))
print(type(gpt_config))
print(type(bart_config))

It is also possible to just import the config for the desired snapshot like below.

In [None]:
from transformers import  BertConfig

bert_config = BertConfig.from_pretrained(bert_checkpoint)
print(type(bert_config))
print(bert_config)

### __Key Point__

The config provides all the information necessary to load the model. Namely, it provides the information needed to create teh archetecture.


## Tokenizers

HuggingFace provides several types of tokenizers including: word-based, character-based, byte-level BPE (used in GPT-2), WordPiece (used in BERT), and SentPiece or Unigram used in several multilingual models.  

### Loading and Saving

Same concepts as loading and saving models and configs: 

In [None]:
albert_checkpoint = 'albert-base-v1'
bart_checkpoint = 'facebook/bart-base'
bert_checkpoint = 'bert-base-cased'
gpt2_checkpoint = 'gpt2'
roberta_checkpoint = 'roberta-base'

from transformers import  BertTokenizer

tokenizer = BertTokenizer.from_pretrained(bert_checkpoint)

In [None]:
from transformers import  AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(bert_checkpoint)

In [None]:
sequence = "Using a Transformer network is simple" 
tokenizer(sequence)

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Encoding




In [None]:
inputs = tokenizer("Let's try to tokenize!")

print(inputs['input_ids'])

[101, 2421, 112, 188, 2222, 1106, 22559, 3708, 106, 102]


In [None]:
tokenizer.tokenize("Let's try to tokenize!")

['Let', "'", 's', 'try', 'to', 'token', '##ize', '!']

In [None]:
print(tokenizer.decode(inputs['input_ids']))

[CLS] Let's try to tokenize! [SEP]


## Handling Multiple Inputs

In [None]:
tokenizer.pad_token_id

0