<a href="https://colab.research.google.com/github/freud-sensei/imfine_torch/blob/main/%5B%EC%95%88%EC%95%84%EC%A4%98%EC%9A%94%5D_2_Using_HuggingFace_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Behind the pipeline

## Preprocessing with a tokenizer

* splitting the input into tokens (words, subwords, symbols)
* mapping each token to an integer
* adding additional inputs that may be useful to the model

In [None]:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
raw_inputs = [
    "I am very tired today, so if you disturb me, I will breath fire.",
    "What a lovely day! Go back inside!"
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
# PyTorch, TensorFlow, NumPy 중에서 선택 가능
print(inputs)

{'input_ids': tensor([[  101,  1045,  2572,  2200,  5458,  2651,  1010,  2061,  2065,  2017,
         22995,  2033,  1010,  1045,  2097,  3052,  2543,  1012,   102],
        [  101,  2054,  1037,  8403,  2154,   999,  2175,  2067,  2503,   999,
           102,     0,     0,     0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


## Going through the model

In [None]:
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

* outputs **hidden states(features)**: high-dim vectors representing the contextual understanding of that input
* hidden states are usually inputs to another part of the model (**head**)

**vector output**

(batch size, sequence length, hidden size)

* batch size: # of sequences processed at a time (여기선 2개)
* sequence length: the length of the numerical representation of the sequence (여기선 19)
* hidden size: vector dimension of each model input (high-dimensional, 여기선 768)

In [None]:
output = model(**inputs)
# input_ids = 해당하는 텐서. attention_mask = 해당하는 텐서. 이 꼴로 함수 model이 입력을 받게 된다.
print(output["last_hidden_state"].shape)

torch.Size([2, 19, 768])


**model heads**

* *model input* (tokenized input)
* **embeddings layer**: converts each input ID into a vector that represents the token
* **subsequent layers**: manipulates the vectors using the attention vectors
* *hidden states*: final representation of the input
* **head**: projects hidden states onto a different dimension, each head is designed for specific task
* *model output*


In [None]:
from transformers import AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [None]:
print(outputs["logits"].shape)
# 당연 결과. sentence가 2개고 label도 그

torch.Size([2, 2])


## Postprocessing the output

In [None]:
print(outputs["logits"])

tensor([[ 2.6428, -2.2403],
        [-4.3428,  4.6658]], grad_fn=<AddmmBackward0>)


* not probabilities, but logits (raw, unnormalized scores)
* loss function에 softmax 함수가 포함되어 있으니 보통 모형의 구현은 logit 계산으로 종료됨

In [None]:
import torch
predictions = torch.nn.functional.softmax(outputs["logits"], dim=-1)
print(predictions)

tensor([[9.9248e-01, 7.5164e-03],
        [1.2233e-04, 9.9988e-01]], grad_fn=<SoftmaxBackward0>)


In [None]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

# Models

Our objective: making a BERT Model

## Creating a Transformer

In [None]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config) # model is randomly_initialized
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



## Different loading methods

In [None]:
from transformers import BertModel
# AutoModel을 사용해도 됩니다. 그러면 checkpoint를 보고 알맞는 Model을 가져온다.
# BertConfig을 사용하지 않고 pretrained model을 불러온 것
model = BertModel.from_pretrained('bert-base-cased') # BERT 개발자들이 설정한 checkpoint임

## Saving methods

In [None]:
model.save_pretrained("i_am_a_pizza")

In [None]:
! ls i_am_a_pizza

config.json  model.safetensors


저장된 파일 중 `config.json`에는 다양한 설정값이(예: `hidden_size`), `model.safetensors`에는 학습된 가중치값(state dictionary라고도 함)이 저장되어 있다.

In [None]:
pizza_model = BertModel.from_pretrained("i_am_a_pizza")
pizza_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

## AutoConfig

In [None]:
from transformers import AutoConfig
bert_config = AutoConfig.from_pretrained("bert-base-cased")
print(bert_config)

BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



# Tokenizers

* translating text inputs to numerical data that can be processed by the model

## word-based

e.g. "Let's do tokenization!" -> (Let), ('s), (do), (tokenization), (!)

In [None]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


* each word gets assigned an ID (from 0 to size of vocabulary)
* cannot identify related words (dog-dogs, run-running)
* vocabulary size can end up very large -> solution: limiting the amount of vocabulary
* [UNK] (unknown token): represents words not in vocabulary

## character-based

In [None]:
tokenized_text = list("Let's do tokenization!".replace(" ", ""))
print(tokenized_text)

['L', 'e', 't', "'", 's', 'd', 'o', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', '!']


* vocabulary is much smaller
* much fewer unknown tokens
* each character is less meaningful than a word
* large amount of tokens to be processed by model ([Jim] vs [j], [i], [m])

## subword tokenization

* frequently used words should not be split into smaller subwords ('dog' -> [dog])
* complex words should be decomposed into meaningful subwords ('dogs' -> [dog], [##s])

e.g., 'annoyingly' -> [annoying], [ly]

e.g., 'Let's do tokenization!' -> [Let's], [do], [token], [##ization], [!]

* relatively good coverage with small vocabularies
* close to no unknown tokens

## Other algorithms

WordPiece(BERT), Byte-Pair Encoding(GPT-2)...

제대로 이해하려면 논문을 읽어야 합니다

## loading and saving

In [None]:
# loading
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer.tokenize("Using a Transformer network is simple")

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']

In [None]:
tokenizer.save_pretrained("i_am_a_chicken")

('i_am_a_chicken/tokenizer_config.json',
 'i_am_a_chicken/special_tokens_map.json',
 'i_am_a_chicken/vocab.txt',
 'i_am_a_chicken/added_tokens.json',
 'i_am_a_chicken/tokenizer.json')

## encoding

* translating text to numbers
* two steps: tokenization -> conversion to input IDs

In [None]:
# tokenization
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


'##': the token is NOT a beginning of the sentence (tokenizer마다 표기법이 다를 수 있음)

In [None]:
# conversion to input IDS
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


In [None]:
# 이 두 과정이 함께 이루어지는 것
print(tokenizer(sequence))

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


[101] ([CLS]), [102] ([SEP])는 문장 처음 및 마지막에 등장하는 special token으로 보면 됨.

## decoding

In [None]:
# grouping까지 함께 해줌
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])

In [None]:
print(decoded_string)

Using a transformer network is simple


# Handling multiple sequences

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."
tokenized_inputs = tokenizer(sequence, return_tensors='pt')
print(tokenized_inputs["input_ids"])

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])


* tokenizer doesn't just convert the list of IDs into a tensor- it adds a dimension on top of it

In [None]:
# [CLS] [SEP] 제외하곤 이거랑 같은 결과
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor([ids])
print(input_ids)

tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])


In [None]:
output = model(input_ids)
print(output.logits)

tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


* batching: sending multiple sentences through the model

In [None]:
batched_ids = torch.tensor([ids, ids])
output = model(batched_ids)
print(output.logits)

tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


## padding the inputs

* while batching, sentences might have different lengths -> solution: padding
* uses the padding token

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [[200, 200, 200],
               [200, 200, tokenizer.pad_token_id]]

In [None]:
print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


sequence2_ids와 batch_ids의 logits 값이 다른 이유
* attention layers가 padding token까지 고려했기 때문
* logits 값이 같게 나오게끔 하기 위해선, attention layers가 padding token을 무시하게끔 해야 함 -> attention mask

## attention masks
* same shape as the input IDs tensor
* 0s and 1s: 0 means the token must be ignored by the attention layers

In [None]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id]
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0]
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


## practice

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

sequence1 = "I've been waiting for a HuggingFace course my whole life."
sequence2 = "I hate this so much!"

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

ids1 = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sequence1))
ids2 = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sequence2))
output1 = model(torch.tensor([ids1]))
output2 = model(torch.tensor([ids2]))
print(output1.logits)
print(output2.logits)

tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)
tensor([[ 3.1931, -2.6685]], grad_fn=<AddmmBackward0>)


In [None]:
def padding(x, y):
  diff = abs(len(x) - len(y))
  if len(x) < len(y):
    x.extend([tokenizer.pad_token_id] * diff)
  elif len(y) < len(x):
    y.extend([tokenizer.pad_token_id] * diff)
  return torch.tensor([x, y])

def attention_mask(input):
  mask = torch.where(input == tokenizer.pad_token_id, 0, 1)
  return mask

In [None]:
batched_input = padding(ids1, ids2)
batched_mask = attention_mask(batched_input)
batched_output = model(batched_input, attention_mask=batched_mask)
print(batched_output.logits)

tensor([[-2.7276,  2.8789],
        [ 3.1931, -2.6685]], grad_fn=<AddmmBackward0>)


In [None]:
output = tokenizer([sequence1, sequence2], padding=True, return_tensors="pt")
output

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

In [None]:
model(**output).logits

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

In [None]:
?tokenizer

In [None]:
?

## longer sequences

* most transformer models can handle sequences up to 512 or 1024 tokens
* solution 1: use a model w/ longer supported sequence length(Longformer, LED)
* solution 2: truncate sequences

```
sequence = sequence[:max_sequence_length]
```

# Putting it all together

지금까지 배운 것들을 정리해 보면,
* tokenization
* conversion to input IDS
* padding
* truncation
* attention masks
* 그 다음에 모델 input으로 들어갈 수 있다.

In [None]:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sequence = "I've been reading books of old. The legends and the myths. Achilles and his gold. Hercules and his gifts. Spider-Man's control."
model_inputs = tokenizer(sequence)

* model_inputs variable엔 input ids, attention mask(BERT 기준)이 포함되어 있음

In [None]:
sequences = ["Anything you can do, I can do better", "I can do anything better than you"]
model_inputs = tokenizer(sequences)
print(model_inputs)

{'input_ids': [[101, 2505, 2017, 2064, 2079, 1010, 1045, 2064, 2079, 2488, 102], [101, 1045, 2064, 2079, 2505, 2488, 2084, 2017, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}


In [None]:
# truncation
model_inputs = tokenizer(sequences, max_length=6, truncation=True)
print(model_inputs)

{'input_ids': [[101, 2505, 2017, 2064, 2079, 102], [101, 1045, 2064, 2079, 2505, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


In [None]:
# 'pt', 'tf', 'np' 다른 형태로 return 가능
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
print(model_inputs)
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")
print(model_inputs)
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
print(model_inputs)

{'input_ids': tensor([[ 101, 2505, 2017, 2064, 2079, 1010, 1045, 2064, 2079, 2488,  102],
        [ 101, 1045, 2064, 2079, 2505, 2488, 2084, 2017,  102,    0,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}
{'input_ids': <tf.Tensor: shape=(2, 11), dtype=int32, numpy=
array([[ 101, 2505, 2017, 2064, 2079, 1010, 1045, 2064, 2079, 2488,  102],
       [ 101, 1045, 2064, 2079, 2505, 2488, 2084, 2017,  102,    0,    0]],
      dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 11), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]], dtype=int32)>}
{'input_ids': array([[ 101, 2505, 2017, 2064, 2079, 1010, 1045, 2064, 2079, 2488,  102],
       [ 101, 1045, 2064, 2079, 2505, 2488, 2084, 2017,  102,    0,    0]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}


## Special tokens

In [None]:
sequence = "You might know everything I'm going to do, but that's not going to help you, since I know everything you're going to do! Strange, isn't it?"
model_inputs = tokenizer(sequence, return_tensors="pt")
print(model_inputs["input_ids"])

tensor([[ 101, 2017, 2453, 2113, 2673, 1045, 1005, 1049, 2183, 2000, 2079, 1010,
         2021, 2008, 1005, 1055, 2025, 2183, 2000, 2393, 2017, 1010, 2144, 1045,
         2113, 2673, 2017, 1005, 2128, 2183, 2000, 2079,  999, 4326, 1010, 3475,
         1005, 1056, 2009, 1029,  102]])


In [None]:
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[2017, 2453, 2113, 2673, 1045, 1005, 1049, 2183, 2000, 2079, 1010, 2021, 2008, 1005, 1055, 2025, 2183, 2000, 2393, 2017, 1010, 2144, 1045, 2113, 2673, 2017, 1005, 2128, 2183, 2000, 2079, 999, 4326, 1010, 3475, 1005, 1056, 2009, 1029]


In [None]:
# 맨 앞, 뒤의 101, 102 토큰의 정체는?
print(tokenizer.decode(model_inputs['input_ids']))

[CLS] you might know everything i'm going to do, but that's not going to help you, since i know everything you're going to do! strange, isn't it? [SEP]


In [None]:
model(**model_inputs)

SequenceClassifierOutput(loss=None, logits=tensor([[ 3.6823, -2.9822]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

[CLS]는 문장 맨 앞, [SEP]은 문장 맨 뒤에 붙는 special token


## from tokenizer to model

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["Anything you can do, I can do better", "I can do anything better than you"]
tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

In [None]:
output.logits

tensor([[ 2.9842, -2.4813],
        [ 2.0235, -1.6446]], grad_fn=<AddmmBackward0>)