문서분류(document classification)란 문서가 주어졌을 때 해당 문서의 범주를 분류하는 태스크이다. 뉴스를 입력으로 받아 정치,경제,연계 등의 범주를 맞히거나, 리뷰의 긍/부정을 분류한다.

# 모델 구조
Bert 모델에서 문장 수준의 pooler_output을 뽑아 출력을 계산

In [2]:
from google.colab import drive
drive.mount('/gdrive', force_remount=True)

Mounted at /gdrive


In [None]:
!pip install ratsnlp

In [13]:
import torch
from ratsnlp.nlpbook.classification import ClassificationTrainArguments
args = ClassificationTrainArguments(
    pretrained_model_name='beomi/kcbert-base',      # 허깅페이스에 등록된 언어모델의 이름
    downstream_corpus_name='nsmc',                  # 다운스트림 데이터의 이름
    downstream_model_dir='/gdrive/MyDrive/Colab Notebooks/Do it 자연어처리/checkpoint-doccls',
    downstream_corpus_root_dir = '/gdrive/MyDrive/Colab Notebooks/Do it 자연어처리/checkpoint-doccls',
    batch_size=32,
    learning_rate=5e-5,
    max_seq_length=128,
    epochs=3,
    tpu_cores=0,
    seed=7
)

In [14]:
from ratsnlp import nlpbook
# args에 지정된 시드로 고정시킴
nlpbook.set_seed(args)
# 로그에 대한 출력을 설정
nlpbook.set_logger(args)

INFO:ratsnlp:Training/evaluation parameters ClassificationTrainArguments(pretrained_model_name='beomi/kcbert-base', downstream_task_name='document-classification', downstream_corpus_name='nsmc', downstream_corpus_root_dir='/gdrive/MyDrive/Colab Notebooks/Do it 자연어처리/checkpoint-doccls', downstream_model_dir='/gdrive/MyDrive/Colab Notebooks/Do it 자연어처리/checkpoint-doccls', max_seq_length=128, save_top_k=1, monitor='min val_loss', seed=7, overwrite_cache=False, force_download=False, test_mode=False, learning_rate=5e-05, epochs=3, batch_size=32, cpu_workers=2, fp16=False, tpu_cores=0)


set seed: 7


In [8]:
# 말뭉치 내려받기
from Korpora import Korpora
Korpora.fetch(
    corpus_name=args.downstream_corpus_name,
    root_dir=args.downstream_model_dir,
    force_download=True
)

[nsmc] download ratings_train.txt: 14.6MB [00:00, 85.5MB/s]                            
[nsmc] download ratings_test.txt: 4.90MB [00:00, 48.5MB/s]


In [9]:
# 토크나이저
# kcbert-base 모델이 사용하는 토크나이저를 선언
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(
    args.pretrained_model_name,
    do_lower_case=False
)

(…)beomi/kcbert-base/resolve/main/vocab.txt:   0%|          | 0.00/250k [00:00<?, ?B/s]

(…)-base/resolve/main/tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

(…)omi/kcbert-base/resolve/main/config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

In [15]:
# 데이터셋
from ratsnlp.nlpbook.classification import NsmcCorpus, ClassificationDataset
corpus = NsmcCorpus()
train_DS = ClassificationDataset(
    args=args, corpus=corpus, tokenizer=tokenizer, mode='train'
)

INFO:ratsnlp:Creating features from dataset file at /gdrive/MyDrive/Colab Notebooks/Do it 자연어처리/checkpoint-doccls/nsmc
INFO:ratsnlp:loading train data... LOOKING AT /gdrive/MyDrive/Colab Notebooks/Do it 자연어처리/checkpoint-doccls/nsmc/ratings_train.txt
INFO:ratsnlp:tokenize sentences, it could take a lot of time...
INFO:ratsnlp:tokenize sentences [took 53.513 s]
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:sentence: 아 더빙.. 진짜 짜증나네요 목소리
INFO:ratsnlp:tokens: [CLS] 아 더 ##빙 . . 진짜 짜증나네 ##요 목소리 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [

NsmcCorpus는 csv 파일 형식의 NSMC 데이터를 문장(영화 리뷰)와 레이블로 읽어들인다

ClassificationDataset은 NsmcCorpus가 넘겨준 문장과 레이블을 각각 tokenizer를 활용해 모델이 학습할 수 있는 형태(Classification Features)로 가공한다.

In [20]:
# Classification Features의 요소
# token_type_idx : 이어진 문서 맞히기 분류 역시 Bert의 pretrain 과제이므로 해당 내용을 갖고 있다.
#                  해당 실습에서는 하나의 리뷰만 학습하므로 모두 0으로 채워져 있음
# 모든 요소는 length가 128인데 max_seq_length를 128로 정의했기 때문임(짧으면 0으로 padding)

[x for x in dir(train_DS[0]) if x.startswith('__')==False]

['attention_mask', 'input_ids', 'label', 'token_type_ids']

In [34]:
from torch.utils.data import DataLoader, RandomSampler
train_loader = DataLoader(train_DS,
                          batch_size=args.batch_size,
                          sampler=RandomSampler(train_DS, replacement=False),
                          # 배치 내의 여러 인스턴스에 대해 input_ids, attention_mask등을 종류별로 모으고 tensor 형태로 바꿈
                          collate_fn=nlpbook.data_collator,
                          drop_last=False,
                          num_workers=args.cpu_workers)


In [29]:
# 평가용 데이터셋 구축
from torch.utils.data import DataLoader, SequentialSampler
val_DS = ClassificationDataset(
    args=args, corpus=corpus, tokenizer=tokenizer, mode='test')
val_loader = DataLoader(val_DS, batch_size=args.batch_size, sampler=SequentialSampler(val_DS),
                        collate_fn=nlpbook.data_collator, drop_last=False, num_workers=args.cpu_workers)

INFO:ratsnlp:Loading features from cached file /gdrive/MyDrive/Colab Notebooks/Do it 자연어처리/checkpoint-doccls/nsmc/cached_test_BertTokenizer_128_nsmc_document-classification [took 12.159 s]


# 모델

In [31]:
# 모델 초기화
from transformers import BertConfig, BertForSequenceClassification
pretrained_model_config = BertConfig.from_pretrained(
    args.pretrained_model_name,
    num_labels=corpus.num_labels)
model = BertForSequenceClassification.from_pretrained(
    args.pretrained_model_name,
    config=pretrained_model_config)

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at beomi/kcbert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initiali

In [41]:
# input, output
temp_input = next(iter(train_loader))
temp_output = model(**temp_input)

In [43]:
temp_output.logits

tensor([[ 1.2235, -0.7268],
        [-0.7749,  1.1573],
        [ 1.0981, -0.4304],
        [ 0.8057, -0.6648],
        [ 0.9424, -0.6168],
        [ 0.6497, -0.5349],
        [ 1.0555, -0.5246],
        [ 1.1967, -0.8944],
        [-1.0130,  1.1617],
        [ 0.2904,  0.0126],
        [-0.9663,  0.9679],
        [ 1.2870, -0.7691],
        [ 0.4145, -0.0040],
        [ 1.1725, -0.8149],
        [ 0.1263, -0.0654],
        [ 0.1338,  0.4108],
        [ 0.5446, -0.1864],
        [ 1.0788, -0.5609],
        [ 0.4037,  0.1351],
        [-1.2228,  1.3812],
        [-0.0162,  0.1407],
        [ 0.1811, -0.3455],
        [-0.8531,  0.8717],
        [ 0.9532, -0.2924],
        [ 0.4112,  0.0530],
        [ 1.6251, -0.7879],
        [ 0.4582, -0.2145],
        [ 0.7524, -0.2621],
        [ 0.9470, -0.3083],
        [ 0.4364, -0.5868],
        [ 1.2692, -0.7501],
        [-0.8572,  1.1828]], grad_fn=<AddmmBackward0>)

In [37]:
# 학습
# 토치라이트닝의 LightningModule 클래스를 상속받아 태스크를 정의함
# optimizer, learning rate scheduler 등이 정의되어있음

# 시간이 오래걸리는 관계로 중단
from ratsnlp.nlpbook.classification import ClassificationTask
task = ClassificationTask(model, args)
trainer = nlpbook.get_trainer(args)
trainer.fit(task,train_loader,val_loader)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
  rank_zero_warn(
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                          | Params
--------------------------------------------------------
0 | model | BertForSequenceClassification | 108 M 
--------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
435.680   Total estimated model params size (MB)


Training: 0it [00:00, ?it/s]

  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")


In [46]:
# inference 함수
def inference_fn(sentence):
    inputs = tokenizer(
        [sentence], max_length=args.max_seq_length, padding='max_length', truncation=True)
    model.eval()
    with torch.no_grad():
        outputs = model(**{k:torch.tensor(v) for k,v in inputs.items()})
        prob = outputs.logits.softmax(dim=1)
        pos_prob = round(prob[0][1].item(), 4)
        neg_prob = round(prob[0][0].item(), 4)
    return {'sentence':sentence, 'positive_probability':pos_prob, 'negative_probability':neg_prob}

In [47]:
inference_fn('이 영화 정말 쓰레기네요. 돈 주고 보기 아까워요')

{'sentence': '이 영화 정말 쓰레기네요. 돈 주고 보기 아까워요',
 'positive_probability': 0.1367,
 'negative_probability': 0.8633}