# Chapter 2. Using 🤗Transformers

## What Happens Inside the pipeline Function?

The ```pipeline()``` groups together three steps:
- preprocessing (i.e., tokenizing and creating an e
- passing the inputs to the model
- postprocessing 

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 4.2 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 446 kB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 50.3 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 65.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 63.1 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
 

In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

sample_sentences = [
                    "I've been waiting for HuggingFace course my whole life.",
                    "I hate this so much!"
]

classifier(sample_sentences)


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.985034167766571},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

Now let's go through each of the three steps one at a time.

## Tokenizer 

Tokenizers serve three purposoes:
- split the input into tokens
- map the tokens to an integer
- add padding and/or special characters (i.e., begin/end of sequence tokens) to the input

**Key Point**: 🤗Transformer models only accept tensors as inputs so we have to specify the type of tensor we want with the ```return_tensor``` argument.  

In [3]:
# Tokenizer 

from transformers import AutoTokenizer

In [4]:
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [5]:
raw_inputs = [
              "I've been waiting for a HuggingFace course my whole life.", 
              "I hate this so much!",
              ]

inputs = tokenizer(raw_inputs, 
                   padding=True,
                   truncation=True,
                   return_tensors='pt')

In [6]:
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [8]:
inputs['attention_mask']

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])

**Key Point**: look at ```attention_mask``` above. See the zeros? Those are correlate to padding tokens so we are going to mask them so that model doesn't worry about them during training. 

## Going through the Model

We downlaod our pretrained model the same way we did our tokenizer.

In [10]:
from transformers import AutoModel

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([2, 16, 768])


The code above outputs the *hidden states* aka *features* of the model.

What does that mean? Basically, we're going to use those features as inputs for the head of the model. 

Now, while human heads can do many things, transformer heads can essentially do one thing REALLY well so, to actually solve our classification problem, we need a model with a **sequence classificaiton** head as opposed to . 

In [13]:
from transformers import  AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

outputs = model(**inputs)

In [14]:
print(outputs.logits.shape)

torch.Size([2, 2])


Since we have two sentences with two labels, we get an output of two by two. 

## Postporcessing the output
Unfortunatley, the output doesn't make any sense on their own because they are raw logits.

In [16]:
outputs.logits

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

The question, of course, is what do thos logits actually mean? To answer that, we move on to Postprocessing. 

We add a SoftMax layer to the logits to get a probablity.

In [25]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


The model predicted [.0402, 0.9598] and [0.9946, 0.0544] for the second. 

But what are the labels for those probablities?

In [26]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [27]:
raw_inputs

["I've been waiting for a HuggingFace course my whole life.",
 'I hate this so much!']

So, for the first sentence, the model predicts with 96% confidence that it is positive while the second is predicted at nearly 100% as being negative.

## Deep Dive on [Models](https://huggingface.co/course/chapter2/3?fw=pt)

It's easy to load a model based on a checkpoint

In [28]:
from transformers import AutoModel

In [29]:
bert_checkpoint = 'bert-base-cased'
gpt2_checkpoint = 'gpt2'
bart_checkpoint = 'facebook/bart-base'

In [30]:
bert_model = AutoModel.from_pretrained(bert_checkpoint)
gpt_model = AutoModel.from_pretrained(gpt2_checkpoint)
bart_model = AutoModel.from_pretrained(bart_checkpoint)

print(type(bert_model))
print(type(gpt_model))
print(type(bart_model))

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/532M [00:00<?, ?B/s]

<class 'transformers.models.bert.modeling_bert.BertModel'>
<class 'transformers.models.gpt2.modeling_gpt2.GPT2Model'>
<class 'transformers.models.bart.modeling_bart.BartModel'>


We need to load the correct config file or else our model will not run.

In [31]:
from transformers import  AutoConfig

bert_config = AutoConfig.from_pretrained(bert_checkpoint)
gpt_config = AutoConfig.from_pretrained(gpt2_checkpoint)
bart_config = AutoConfig.from_pretrained(bart_checkpoint)

In [32]:
print(type(bert_config))
print(type(gpt_config))
print(type(bart_config))

<class 'transformers.models.bert.configuration_bert.BertConfig'>
<class 'transformers.models.gpt2.configuration_gpt2.GPT2Config'>
<class 'transformers.models.bart.configuration_bart.BartConfig'>


## Creating a Transformer

We need to load the architecture and the config.

In [35]:
from transformers import BertConfig, BertModel

config = BertConfig()

model = BertModel(config)

print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.15.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



It is also possible to just import the config for the desired snapshot like below.

In [36]:
from transformers import  BertConfig

bert_config = BertConfig.from_pretrained(bert_checkpoint)
print(type(bert_config))
print(bert_config)

<class 'transformers.models.bert.configuration_bert.BertConfig'>
BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.15.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



__Key Point__: The config provides all the information necessary to load the model. Namely, it provides the information needed to create the archetecture.

## Tokenizers

# START HERE
