### BERT와 GPT-2 모델을 활용할 때 허깅페이스 트랜스포머 코드 비교

In [2]:
from transformers import AutoTokenizer, AutoModel

text = "What is Huggingface Transformers?"

# BERT 모델 활용 - AutoModel 로 모델을 자동으로 로드 
bert_model = AutoModel.from_pretrained("bert-base-uncased")

# AutoTokenizer 로 모델에 맞는 토크나이저를 자동으로 로드 
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
encoded_input = bert_tokenizer(text, return_tensors='pt')
bert_output = bert_model(**encoded_input)

# GPT-2 모델 활용
gpt_model = AutoModel.from_pretrained('gpt2')
gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2')
encoded_input = gpt_tokenizer(text, return_tensors='pt')
gpt_output = gpt_model(**encoded_input)

  from .autonotebook import tqdm as notebook_tqdm
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


### 모델 아이디로 모델 불러오기

In [3]:
from transformers import AutoModel
model_id = 'klue/roberta-base'
model = AutoModel.from_pretrained(model_id)

# config.json 파일 내용
# {
#   "architectures": ["RobertaForMaskedLM"],
#   "attention_probs_dropout_prob": 0.1,
#   "bos_token_id": 0,
#   "eos_token_id": 2,
#   "gradient_checkpointing": false,
#   "hidden_act": "gelu",
#   "hidden_dropout_prob": 0.1,
#   "hidden_size": 768,
#   "initializer_range": 0.02,
#   "intermediate_size": 3072,
#   "layer_norm_eps": 1e-05,
#   "max_position_embeddings": 514,
#   "model_type": "roberta",
#   "num_attention_heads": 12,
#   "num_hidden_layers": 12,
#   "pad_token_id": 1,
#   "type_vocab_size": 1,
#   "vocab_size": 32000,
#   "tokenizer_class": "BertTokenizer"
# }


Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 토크나이저 불러오기 

In [4]:
from transformers import AutoTokenizer
model_id = 'klue/roberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_id)

### 토크나이저 사용하기

In [5]:
tokenized = tokenizer("토크나이저는 텍스트를 토큰 단위로 나눈다")


print(tokenized)
# input_ids 는 토큰 아이디임 
# token_type_ids 는 문장 구분을 위한 id, 단일 문장이라 값이 모두 0임 
# attention_mask 는 실제 토큰과 패딩 토큰을 확인하기 위한 정보임 
# 
# {'input_ids': [0, 9157, 7461, 2190, 2259, 8509, 2138, 1793, 2855, 5385, 2200, 20950, 2],
#  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


print(tokenizer.convert_ids_to_tokens(tokenized['input_ids']))
# 숫자 ID 를 실제 토큰으로 변환하는 역할을 함. 
# CLS 와 SEP 는 각각 문장의 시작과 끝을 함. 
# ## 는 하나의 단어가 여러 토큰으로 나눠졌을 때 첫 토큰 이후에 붙는 접두사임 

# ['[CLS]', '토크', '##나이', '##저', '##는', '텍스트', '##를', '토', '##큰', '단위', '##로', '나눈다', '[SEP]']

print(tokenizer.decode(tokenized['input_ids']))
# [CLS] 토크나이저는 텍스트를 토큰 단위로 나눈다 [SEP]
# decode 함수는 토큰 아이디를 실제 토큰으로 변환하는 역할을 한다. 

print(tokenizer.decode(tokenized['input_ids'], skip_special_tokens=True))
# 토크나이저는 텍스트를 토큰 단위로 나눈다
# skip_special_tokens=True 는 특수 토큰을 제외하고 텍스트를 보여준다. 


{'input_ids': [0, 9157, 7461, 2190, 2259, 8509, 2138, 1793, 2855, 5385, 2200, 20950, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', '토크', '##나이', '##저', '##는', '텍스트', '##를', '토', '##큰', '단위', '##로', '나눈다', '[SEP]']
[CLS] 토크나이저는 텍스트를 토큰 단위로 나눈다 [SEP]
토크나이저는 텍스트를 토큰 단위로 나눈다


### KLUE MRC 데이터셋 다운로드

In [7]:
from datasets import load_dataset
klue_mrc_dataset = load_dataset('klue', 'mrc')
# klue_mrc_dataset_only_train = load_dataset('klue', 'mrc', split='train')

Generating train split: 100%|██████████| 17554/17554 [00:00<00:00, 335532.09 examples/s]
Generating validation split: 100%|██████████| 5841/5841 [00:00<00:00, 476614.33 examples/s]


### 모델 학습 시키기

In [1]:
from datasets import load_dataset
klue_tc_train = load_dataset('klue', 'ynat', split='train')
klue_tc_eval = load_dataset('klue', 'ynat', split='validation')
klue_tc_train

  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['guid', 'title', 'label', 'url', 'date'],
    num_rows: 45678
})

In [2]:
klue_tc_train.features['label'].names


['IT과학', '경제', '사회', '생활문화', '세계', '스포츠', '정치']

In [3]:
klue_tc_train = klue_tc_train.remove_columns(['guid', 'url', 'date'])
klue_tc_eval = klue_tc_eval.remove_columns(['guid', 'url', 'date'])
klue_tc_train

Dataset({
    features: ['title', 'label'],
    num_rows: 45678
})

In [4]:
klue_tc_train.features['label']
# ClassLabel(names=['IT과학', '경제', '사회', '생활문화', '세계', '스포츠', '정치'], id=None)

klue_tc_train.features['label'].int2str(1)
# '경제'

klue_tc_label = klue_tc_train.features['label']

def make_str_label(batch):
  batch['label_str'] = klue_tc_label.int2str(batch['label'])
  return batch

klue_tc_train = klue_tc_train.map(make_str_label, batched=True, batch_size=1000)

klue_tc_train[0]
# {'title': '유튜브 내달 2일까지 크리에이터 지원 공간 운영', 'label': 3, 'label_str': '생활문화'}

{'title': '유튜브 내달 2일까지 크리에이터 지원 공간 운영', 'label': 3, 'label_str': '생활문화'}

In [5]:
train_dataset = klue_tc_train.train_test_split(test_size=10000, shuffle=True, seed=42)['test']
dataset = klue_tc_eval.train_test_split(test_size=1000, shuffle=True, seed=42)
test_dataset = dataset['test']
valid_dataset = dataset['train'].train_test_split(test_size=1000, shuffle=True, seed=42)['test']


In [6]:
import torch
import numpy as np
from transformers import (
    Trainer,
    TrainingArguments,
    AutoModelForSequenceClassification,
    AutoTokenizer
)

def tokenize_function(examples):
    return tokenizer(examples["title"], padding="max_length", truncation=True)

model_id = "klue/roberta-base"
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=len(train_dataset.features['label'].names))
tokenizer = AutoTokenizer.from_pretrained(model_id)

train_dataset = train_dataset.map(tokenize_function, batched=True)
valid_dataset = valid_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1000/1000 [00:00<00:00, 10519.11 examples/s]


In [9]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=5e-5,
    push_to_hub=False,
    eval_strategy="epoch",
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": (predictions == labels).mean()}

In [10]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

trainer.evaluate(test_dataset) # 정확도 0.84

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5136,0.523952,0.84




{'eval_loss': 0.4556775987148285,
 'eval_accuracy': 0.858,
 'eval_runtime': 19.7255,
 'eval_samples_per_second': 50.696,
 'eval_steps_per_second': 6.337,
 'epoch': 1.0}

### 모델 푸시

In [13]:
token = ""

from huggingface_hub import login

login(token=token)
repo_id = f"JMYEO/roberta-base-klue-ynat-classification"
# Trainer를 사용한 경우
trainer.push_to_hub(repo_id)
# 직접 학습한 경우
model.push_to_hub(repo_id)
tokenizer.push_to_hub(repo_id)


No files have been modified since last commit. Skipping to prevent empty commit.
model.safetensors: 100%|██████████| 443M/443M [01:18<00:00, 5.65MB/s] 


CommitInfo(commit_url='https://huggingface.co/JMYEO/roberta-base-klue-ynat-classification/commit/07289b2e12b05ea5dc4fe9dd70088f16635840ac', commit_message='Upload tokenizer', commit_description='', oid='07289b2e12b05ea5dc4fe9dd70088f16635840ac', pr_url=None, repo_url=RepoUrl('https://huggingface.co/JMYEO/roberta-base-klue-ynat-classification', endpoint='https://huggingface.co', repo_type='model', repo_id='JMYEO/roberta-base-klue-ynat-classification'), pr_revision=None, pr_num=None)

In [None]:
from transformers import pipeline

model_id = "본인의 아이디 입력/roberta-base-klue-ynat-classification"

model_pipeline = pipeline("text-classification", model=model_id)

model_pipeline(dataset["title"][:5])