In [1]:
### Tokenizer call
from transformers import AutoTokenizer
name = 'klue/bert-base'

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
# https://huggingface.co/docs/transformers/model_doc/auto
tokenizer = AutoTokenizer.from_pretrained(name)
# 가능한 configuration
# https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/configuration#transformers.PretrainedConfig

In [7]:
tokenizer

PreTrainedTokenizerFast(name_or_path='klue/bert-base', vocab_size=32000, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [8]:
# tokenizer의 model_max_len이 있고, 
# Parameters for sequence generation에 max_length가 있다.

![image.png](attachment:b28e4070-ce6a-4498-9aeb-002e36857e8f.png)

In [9]:
# AutoTokenizer
# Tokenizer 종류
# https://github.com/huggingface/transformers/blob/v4.18.0/src/transformers/models/auto/tokenization_auto.py#L351

![image.png](attachment:b9ffef42-918d-481a-8a58-37ec9fa486db.png)

### Tokenizer의 탄생 과정
* SpecialTokensMixin, PushToHubMixin 를 상속받아서
* [PreTrainedTokenizerBase](https://github.com/huggingface/transformers/blob/31ec2cb2badfbdd4c1ac9c6c9b8a74e974984206/src/transformers/tokenization_utils_base.py#L1428)를 상속받아서
* [PreTrainedTokenizer](https://github.com/huggingface/transformers/blob/31ec2cb2badfbdd4c1ac9c6c9b8a74e974984206/src/transformers/tokenization_utils.py#L332) 를 상속받은
* 구체적인 모델의 Tokenizer

In [None]:
### Tokenizer 가 call 되면 어떤 것이 수행되는걸까


### Tokenizer를 선언할 때의 kwargs와 __call__할때의 kwargs는 무엇이 있을까
* 왜 알려고 하는가? -> Tokenizer의 model_max_length, max_len 을 설정하면 알아서 맞춰진다고 착각했지모야

In [14]:
test_sentence= ['이번 대회는 어렵지만 차근차근 공부를 해봅니다']
tokenizer = AutoTokenizer.from_pretrained(name,
                                          max_len=5,
                                          use_fast=True)

In [15]:
tokenizer(test_sentence) #나는 max_len 해줬다고 생각했는데 왜 안짤리니?

{'input_ids': [[2, 3686, 3931, 2259, 4258, 3683, 16276, 4244, 2138, 1897, 29384, 3]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [18]:
tokenizer(test_sentence,max_length=5) 
# 심지어 max_len 도 아니고, model_max_length도 아니고 max_length 는 뭔가
# 일단 truncation 되어버렸음..

{'input_ids': [[2, 3686, 3931, 2259, 3]], 'token_type_ids': [[0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1]]}

In [19]:
tokenizer(test_sentence,max_length=5, stride=2) # stride 만 넣는다고 능사가 아니여

{'input_ids': [[2, 3686, 3931, 2259, 3]], 'token_type_ids': [[0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1]]}

In [20]:
tokenizer(test_sentence,max_length=5, stride=2,return_overflowing_tokens=True )

{'input_ids': [[2, 3686, 3931, 2259, 3], [2, 3931, 2259, 4258, 3], [2, 2259, 4258, 3683, 3], [2, 4258, 3683, 16276, 3], [2, 3683, 16276, 4244, 3], [2, 16276, 4244, 2138, 3], [2, 4244, 2138, 1897, 3], [2, 2138, 1897, 29384, 3]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1]], 'overflow_to_sample_mapping': [0, 0, 0, 0, 0, 0, 0, 0]}

### Tokenizer 가 __call__될 때 수행되는 것을 살펴보자
* [PreTrainedTokenizerBase](https://github.com/huggingface/transformers/blob/31ec2cb2badfbdd4c1ac9c6c9b8a74e974984206/src/transformers/tokenization_utils_base.py#L2372)에서 `__call__` 을 보면...
* `encode_plus`를 반환하고 encode_plus는 `_encode_plus` 를 반환하고
* _encode_plus 는 NotImplementedError다. 즉 상속받아서 고쳐써라는 것 같다.

### [PreTrainedToeknizer](https://github.com/huggingface/transformers/blob/v4.18.0/src/transformers/tokenization_utils.py#L332)를 통해 `_encode_plus`를 보면!!!
* tokenize가 수행되고, convert_tokens_to_ids가 수행된다
![image.png](attachment:b74d1adf-1014-4938-be29-01f69dce7061.png)
* 그리고 `prepare_for_model`이 수행되는데 이건 다시 [Base 클래스에서](https://github.com/huggingface/transformers/blob/31ec2cb2badfbdd4c1ac9c6c9b8a74e974984206/src/transformers/tokenization_utils_base.py#L2897) 수행된다.

### encode와 encode_plus 차이
* 학습에 필요한거 더 붙여준다[(참고)](https://stackoverflow.com/questions/61708486/whats-difference-between-tokenizer-encode-and-tokenizer-encode-plus-in-hugging)



In [22]:
# https://github.com/huggingface/transformers/blob/31ec2cb2badfbdd4c1ac9c6c9b8a74e974984206/src/transformers/tokenization_utils_base.py#L2191
tokenizer.encode(test_sentence[0],
                 max_length=5, 
                 stride=2,
                 return_overflowing_tokens=True )

[[2, 3686, 3931, 2259, 3],
 [2, 3931, 2259, 4258, 3],
 [2, 2259, 4258, 3683, 3],
 [2, 4258, 3683, 16276, 3],
 [2, 3683, 16276, 4244, 3],
 [2, 16276, 4244, 2138, 3],
 [2, 4244, 2138, 1897, 3],
 [2, 2138, 1897, 29384, 3]]

In [23]:
tokenizer.encode(test_sentence[0],
                 max_length=5, 
                 stride=2, )

[2, 3686, 3931, 2259, 3]

In [24]:
tokenizer.encode_plus(test_sentence[0],
                 max_length=5, 
                 stride=2,
                 return_overflowing_tokens=True )

{'input_ids': [[2, 3686, 3931, 2259, 3], [2, 3931, 2259, 4258, 3], [2, 2259, 4258, 3683, 3], [2, 4258, 3683, 16276, 3], [2, 3683, 16276, 4244, 3], [2, 16276, 4244, 2138, 3], [2, 4244, 2138, 1897, 3], [2, 2138, 1897, 29384, 3]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1]], 'overflow_to_sample_mapping': [0, 0, 0, 0, 0, 0, 0, 0]}

### [prepare_for_model](https://github.com/huggingface/transformers/blob/31ec2cb2badfbdd4c1ac9c6c9b8a74e974984206/src/transformers/tokenization_utils_base.py#L2897)에서 하는 것 == 핵많다
* truncate
* overflowing token 관리
* padding
* 어 근데 어딘가로 이동되었다..[여기](https://github.com/huggingface/transformers/blob/6d80c92c77593dc674052b5a46431902e6adfe88/src/transformers/models/layoutlmv2/tokenization_layoutlmv2.py#L913)..? 통합체인가?
![image.png](attachment:050b8142-27c3-4a58-8d3b-5cd325e470db.png)


### model_max_length, max_len, max_length 뭔데
* [PreTrainedTokenizerBase line 1456](https://github.com/huggingface/transformers/blob/v4.18.0/src/transformers/tokenization_utils_base.py#L1428) 를 보면 Tokenizer의 input인 `model_max_length`, `max_len` 2가지를 아래처럼 처리한다
![image.png](attachment:76433940-7934-45e1-8e00-18759536d1c1.png)

* [INIT_TOKENIZER_DOCSTRING](https://github.com/huggingface/transformers/blob/31ec2cb2badfbdd4c1ac9c6c9b8a74e974984206/src/transformers/tokenization_utils_base.py#L1384) 에서의 model_max_length 는 아래와 같은데,
![image.png](attachment:e7cb0e65-748a-4fb1-9ee6-9c403a3c323d.png)


* 하지만 [ENCODE_WKARGS_DOCSTRING을 통해 실제 truncation에 관여](https://github.com/huggingface/transformers/blob/31ec2cb2badfbdd4c1ac9c6c9b8a74e974984206/src/transformers/tokenization_utils_base.py#L1281)하는건 `max_length`라 어떤 차이가 있는지 모르겠다

![image.png](attachment:6e18e5eb-cb51-4822-8e51-28be8e5f1316.png)
![image.png](attachment:9fe241f6-bbb1-4849-8156-b4c0541181f5.png)

### model_max_length 의 쓰임은
* 실제로 쓰이는 경우를 보면 property 함수 호출 될 때 쓰이는데, 이 함수들이 어느 시점에 쓰이는질 아직 모르겠다.
![image.png](attachment:0db400ab-97c3-4692-8f9a-9f0933c7ff33.png)
![image.png](attachment:0e38cf75-8c82-45a2-b3ef-06bfe08f1b81.png)

### max_length 쓰임은
* [truncate_sequences](https://github.com/huggingface/transformers/blob/31ec2cb2badfbdd4c1ac9c6c9b8a74e974984206/src/transformers/tokenization_utils_base.py#L2979)에 쓰이게된다
![image.png](attachment:ed42b67d-eeb1-474b-865c-64e8f6061ad6.png)
* [BertFast](https://github.com/huggingface/transformers/blob/6d80c92c77593dc674052b5a46431902e6adfe88/src/transformers/tokenization_utils_fast.py#L76) 버전을 쓰면 truncate pad 동시에 적용되어 있음

In [25]:
### encode_batch 를 알아야 offset 원리를 알 수 있는데
# https://github.com/huggingface/transformers/blob/6d80c92c77593dc674052b5a46431902e6adfe88/src/transformers/tokenization_utils_fast.py#L425

![image.png](attachment:0f65c99a-d2ec-4cf7-97e2-309c8bb34150.png)
![image.png](attachment:0410c993-319b-4d29-a4df-808af1043bdc.png)