<a href="https://colab.research.google.com/github/freud-sensei/lab_intern/blob/main/headline_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 기사 제목을 통한 주제 예측 모형 만들기

In [None]:
!pip install transformers[sentencepiece]

In [None]:
!pip install transformers[torch]

In [None]:
!pip install evaluate

In [None]:
!pip install datasets

아주 간단한 파인튜닝을 진행해 보았다.

Pretrained Model은 한국어 BERT 계열 모형인 `klue/bert-base`이며, downstream task는 정치, 경제, 스포츠 등 주제로의 기사 제목 분류이다.

이를 위해 `klue/ynat` dataset에 fine-tuning을 진행하였다. 해당 dataset은 한국어 신문기사 제목으로 구성되어 있으며, label 값은 해당 기사의 분류에 해당하는 숫자이다.

In [None]:
id2label = raw_datasets['train'].features['label']._str2int
id2label

{'IT과학': 0, '경제': 1, '사회': 2, '생활문화': 3, '세계': 4, '스포츠': 5, '정치': 6}

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModel

raw_datasets = load_dataset("klue", "ynat")
checkpoint = "klue/bert-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading readme:   0%|          | 0.00/22.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.17M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/847k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/45678 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/9107 [00:00<?, ? examples/s]

In [None]:
raw_datasets["train"][0]

{'guid': 'ynat-v1_train_00000',
 'title': '유튜브 내달 2일까지 크리에이터 지원 공간 운영',
 'label': 3,
 'url': 'https://news.naver.com/main/read.nhn?mode=LS2D&mid=shm&sid1=105&sid2=227&oid=001&aid=0008508947',
 'date': '2016.06.30. 오전 10:36'}

In [None]:
labels = raw_datasets["train"].features['label'].names
label2id = {label: i for i, label in enumerate(labels)}
id2label = {i: label for i, label in enumerate(labels)}

In [None]:
def tokenize_function(example):
  tokenized_example = tokenizer(example["title"], truncation=True)
  return tokenized_example

token_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/45678 [00:00<?, ? examples/s]

Map:   0%|          | 0/9107 [00:00<?, ? examples/s]

In [None]:
label2id

{'IT과학': 0, '경제': 1, '사회': 2, '생활문화': 3, '세계': 4, '스포츠': 5, '정치': 6}

In [None]:
from transformers import TrainingArguments
training_args = TrainingArguments("dummy-kosts", save_strategy="epoch", evaluation_strategy="epoch", label_names=labels, push_to_hub=True)

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=7)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at klue/bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import evaluate
import numpy as np
def compute_metrics(eval_preds):
  metric = evaluate.load("accuracy")
  logits = np.argmax(eval_preds.predictions, axis=-1)
  return metric.compute(predictions=logits, references=eval_preds.label_ids)

In [None]:
from transformers import Trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset=token_datasets["train"],
    eval_dataset=token_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4139,0.476314,0.852751
2,0.3358,0.552685,0.865049
3,0.2045,0.643323,0.86417


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

TrainOutput(global_step=17130, training_loss=0.34157393066352804, metrics={'train_runtime': 1415.3665, 'train_samples_per_second': 96.819, 'train_steps_per_second': 12.103, 'total_flos': 1391867548479300.0, 'train_loss': 0.34157393066352804, 'epoch': 3.0})

In [None]:
trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/freud-sensei/dummy-kosts/commit/8c5194ca8b452f6ce9a0b84625ce1dcacd66cf16', commit_message='End of training', commit_description='', oid='8c5194ca8b452f6ce9a0b84625ce1dcacd66cf16', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained('freud-sensei/dummy-kosts', id2label=id2label, label2id=label2id)

약 86%의 Validation Accuracy를 보였다. 이를 dummy-kosts repository에 저장해 두었다.

https://huggingface.co/freud-sensei/headline_classification