# 사전 훈련된 BERT 토크나이저란?
**BERT(Bidirectional Encoder Representations from Transformers)**는 구글에서 개발한 사전 훈련된 트랜스포머 모델
BERT는 자연어 처리(NLP)에서 문장 이해, 문장 분류, 질의응답 등 다양한 작업에서 강력한 성능을 보인다.

BERT 토크나이저는 BERT 모델에 입력할 텍스트를 숫자(ID)로 변환하는 도구입니다.
이 토크나이저는 BERT 모델이 학습한 방식과 동일한 규칙을 사용해 텍스트를 처리합니다.
즉, BERT는 텍스트가 아닌 숫자(토큰 ID)로만 동작하기 때문에 텍스트를 토큰으로 분할하고 정수 인덱스로 변환하는 과정이 필요합니다.

### BERT 토크나이저 특징
1. WordPiece 토크나이저 사용
  - 문장을 단어 수준이 아닌 서브워드 단위로 분리
  - 자주 등장하는 단어는 그대로 유지하고,
  - 희귀한 단어는 더 작은 단위(서브워드)로 분해된다.
2. 사전 훈련된 토크나이저
  - 이미 방대한 코퍼스(책, 위키 등)으로 학습되어 있다
  - 즉, 따로 학습할 필요 없이 바로 사용 가능
3. 문장 전처리 자동화
  - 텍스트를 패딩(padding), 자르기(truncation), 토큰화 등을 자동 수행
  - 모델에 맞는 토큰 길이를 조절하고, [CLS]와 [SEP] 같은 툭수 토큰을 자동으로 추가한다.

In [91]:
import torch

tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-uncased')

# 예시 문장
text = """
SA new survey in Japan has shown that many people may be giving up on "happily ever after" when they get older than 35.

A survey by the Nippon Foundation asked 6,000 people, aged 15 to 45, about the country's falling birth rate. Among the 3,935 single participants, 46% said they want to get married, while 33% said they don't.

However, when asked if they really think they will get married, only 27% said yes, while 39% said no.

Younger people were more hopeful about marriage, with about 40% of those aged 15 to 25 thinking they will marry. But over 50% of single people aged 36 to 45 believe they won't get married, suggesting people may give up on marriage as they age.

Of the 1,313 people who said they don't want to get married, 40% said they prefer being single. Other reasons for staying single included not being able to see any benefit to marrying, and wanting to put their own life first. More than 20% didn't have a clear reason or just didn't know why.

For the 267 people who want to marry but think they won't, the top reason for thinking it won't happen is not having chances to meet potential partners, which was mentioned by 49% of these people. Others said they weren't good at getting along with people of the opposite sex, or felt financially insecure.

Where people lived also affected views on marriage. About 30% of people in central Tokyo and big cities thought they would marry, but only just over 20% of those in smaller towns and villages felt the same.

Marriages in Japan have been decreasing for years. In 2023, there were 3.9 marriages per 1,000 people in Japan — down from a high of 12 marriages per 1,000 people in 1947.
"""

# 텍스트 토큰화
tokens = tokenizer.tokenize(text)
print(len(text))
print(len(tokens))
print(tokens)

Using cache found in /Users/kimhongil/.cache/torch/hub/huggingface_pytorch-transformers_main


1660
378
['sa', 'new', 'survey', 'in', 'japan', 'has', 'shown', 'that', 'many', 'people', 'may', 'be', 'giving', 'up', 'on', '"', 'happily', 'ever', 'after', '"', 'when', 'they', 'get', 'older', 'than', '35', '.', 'a', 'survey', 'by', 'the', 'nippon', 'foundation', 'asked', '6', ',', '000', 'people', ',', 'aged', '15', 'to', '45', ',', 'about', 'the', 'country', "'", 's', 'falling', 'birth', 'rate', '.', 'among', 'the', '3', ',', '93', '##5', 'single', 'participants', ',', '46', '%', 'said', 'they', 'want', 'to', 'get', 'married', ',', 'while', '33', '%', 'said', 'they', 'don', "'", 't', '.', 'however', ',', 'when', 'asked', 'if', 'they', 'really', 'think', 'they', 'will', 'get', 'married', ',', 'only', '27', '%', 'said', 'yes', ',', 'while', '39', '%', 'said', 'no', '.', 'younger', 'people', 'were', 'more', 'hopeful', 'about', 'marriage', ',', 'with', 'about', '40', '%', 'of', 'those', 'aged', '15', 'to', '25', 'thinking', 'they', 'will', 'marry', '.', 'but', 'over', '50', '%', 'of', 

In [92]:
from datasets import load_dataset
from torch.utils.data import DataLoader
ds = load_dataset("stanfordnlp/imdb")
print(ds)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [93]:
print(ds['train'])

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})


In [94]:
print(ds['train'].shape)

(25000, 2)


In [95]:
print(ds['train'][0])

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [96]:
#ds['train']['text']

In [97]:
tokens = tokenizer(ds['train']['text'][:10], truncation=True, max_length=400)
print(len(tokens))
print(tokens)

3
{'input_ids': [[101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689, 

In [98]:
tokens_with_padding = tokenizer(ds['train']['text'][:10], padding=True, truncation=True, max_length=400)
print(len(tokens_with_padding))
print(tokens_with_padding)


3
{'input_ids': [[101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689, 

In [99]:
print("== 패딩 적용 전 ==")
print(f"토큰 ID: {tokens['input_ids']}")
print(f"길이: {len(tokens['input_ids'])}\n")


== 패딩 적용 전 ==
토큰 ID: [[101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3

In [100]:
print("== 패딩 적용 후 ==")
print(f"토큰 ID: {tokens_with_padding['input_ids']}")

== 패딩 적용 후 ==
토큰 ID: [[101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3

In [101]:
#print(tokens_with_padding['input_ids'].shape)
print(f"길이: {len(tokens_with_padding['input_ids'])}\n")


길이: 10



In [102]:
#print(f"패딩 적용된 input_ids:\n{tokens_with_padding['input_ids']}")
print(f"패딩 마스크:\n{tokens_with_padding['attention_mask']}")  # 패딩된 부분은 0, 실제 토큰은 1

패딩 마스크:
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [103]:
t = torch.LongTensor(tokens.input_ids)
# 길이가 다 달라서 안되무니다.
print(t.shape)

ValueError: expected sequence of length 363 at dim 1 (got 304)

In [None]:
t = torch.LongTensor(tokens_with_padding.input_ids)
print(t.shape)