# 환경 확인
현재 프로젝트의 경로를 확인하면서 프로젝트 경로로 디렉토리 이동을 수행합니다.

In [1]:
from chrisbase.util import to_dataframe
from chrislab.common.util import GpuProjectEnv

env = GpuProjectEnv(project_name="DeepKorean", working_gpus="0")
to_dataframe(env)

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,key,value
0,hostname,dl012
1,hostaddr,129.254.182.78
2,python_path,/data/dlt/anaconda3/envs/DeepKorean-23.03/bin/python
3,project_name,DeepKorean
4,project_path,/data/dlt/proj/DeepKorean-23.03
5,working_path,/data/dlt/proj/DeepKorean-23.03
6,running_file,tests/1-doc_cls-train.ipynb
7,working_gpus,0
8,number_of_gpus,1
9,cuda_home_dir,/usr/local/cuda-11.4


# 각종 설정
모델 하이퍼파라메터(hyperparameter)와 저장 위치 등 설정 정보를 선언합니다.

In [2]:
from ratsnlp.nlpbook.classification import ClassificationTrainArguments

args = ClassificationTrainArguments(
    pretrained_model_name="pretrained/KcBERT-Base",
    downstream_corpus_name="nsmc",
    downstream_corpus_root_dir="data",
    downstream_model_dir="checkpoints/nsmc",
    batch_size=32,
    learning_rate=5e-5,
    max_seq_length=128,
    epochs=3,
    seed=7,
)

# 랜덤 시드 고정
학습 재현을 위해 랜덤 시드를 고정합니다.

In [3]:
from ratsnlp import nlpbook

nlpbook.set_seed(args)

set seed: 7


# 로거 설정
메세지 출력 등을 위한 logger를 설정합니다.

In [4]:
nlpbook.set_logger(args)

INFO:ratsnlp:Training/evaluation parameters ClassificationTrainArguments(pretrained_model_name='pretrained/KcBERT-Base', downstream_task_name='document-classification', downstream_corpus_name='nsmc', downstream_corpus_root_dir='data', downstream_model_dir='checkpoints/nsmc', max_seq_length=128, save_top_k=1, monitor='min val_loss', seed=7, overwrite_cache=False, force_download=False, test_mode=False, learning_rate=5e-05, epochs=3, batch_size=32, cpu_workers=64, fp16=False, tpu_cores=0)


# 말뭉치 다운로드
실습에 사용할 말뭉치를 다운로드합니다.

In [5]:
from Korpora import Korpora

Korpora.fetch(
    corpus_name=args.downstream_corpus_name,
    root_dir=args.downstream_corpus_root_dir,
)

[Korpora] Corpus `nsmc` is already installed at /data/dlt/proj/DeepKorean-23.03/data/nsmc/ratings_train.txt
[Korpora] Corpus `nsmc` is already installed at /data/dlt/proj/DeepKorean-23.03/data/nsmc/ratings_test.txt


# 토크나이저 준비
토큰화를 수행하는 토크나이저를 선언합니다

In [6]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(
    args.pretrained_model_name,
    do_lower_case=False,
)
print(tokenizer.tokenize("안녕하세요. 반갑습니다."))
tokenizer

['안녕', '##하세요', '.', '반', '##갑', '##습니다', '.']


BertTokenizer(name_or_path='pretrained/KcBERT-Base', vocab_size=30000, model_max_length=300, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

# 학습데이터 구축
학습데이터를 구축합니다.

In [7]:
from ratsnlp.nlpbook.classification import NsmcCorpus, ClassificationDataset
from torch.utils.data import DataLoader, RandomSampler

corpus = NsmcCorpus()
train_dataset = ClassificationDataset(
    args=args,
    corpus=corpus,
    tokenizer=tokenizer,
    mode="train",
)
train_dataloader = DataLoader(
    train_dataset,
    batch_size=args.batch_size,
    sampler=RandomSampler(train_dataset, replacement=False),
    collate_fn=nlpbook.data_collator,
    drop_last=False,
    num_workers=args.cpu_workers,
)

INFO:ratsnlp:Loading features from cached file data/nsmc/cached_train_BertTokenizer_128_nsmc_document-classification [took 21.244 s]


# 평가데이터 구축
학습 중에 사용할 평가데이터를 구축합니다.

In [8]:
from torch.utils.data import SequentialSampler

val_dataset = ClassificationDataset(
    args=args,
    corpus=corpus,
    tokenizer=tokenizer,
    mode="test",
)
val_dataloader = DataLoader(
    val_dataset,
    batch_size=args.batch_size,
    sampler=SequentialSampler(val_dataset),
    collate_fn=nlpbook.data_collator,
    drop_last=False,
    num_workers=args.cpu_workers,
)

INFO:ratsnlp:Loading features from cached file data/nsmc/cached_test_BertTokenizer_128_nsmc_document-classification [took 7.323 s]


# 모델 초기화
사전학습 모델을 읽고, 문서 분류를 수행할 모델을 초기화합니다.

In [9]:
from transformers import BertConfig, BertForSequenceClassification

pretrained_model_config = BertConfig.from_pretrained(
    args.pretrained_model_name,
    num_labels=corpus.num_labels,
)
model = BertForSequenceClassification.from_pretrained(
    args.pretrained_model_name,
    config=pretrained_model_config,
)

Some weights of the model checkpoint at pretrained/KcBERT-Base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not ini

# 학습 준비
Task와 Trainer를 준비합니다.

In [10]:
from ratsnlp.nlpbook.classification import ClassificationTask

task = ClassificationTask(model, args)

In [11]:
trainer = nlpbook.get_trainer(args)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


# 학습 개시
준비한 데이터와 모델로 학습을 시작합니다. 학습 결과물은 미리 세팅한 위치(`args.downstream_model_dir`)에 저장됩니다.

In [12]:
import torch

torch.set_float32_matmul_precision('high')
trainer.fit(
    task,
    train_dataloaders=train_dataloader,
    val_dataloaders=val_dataloader,
)

Missing logger folder: /data/dlt/proj/DeepKorean-23.03/checkpoints/nsmc/lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                          | Params
--------------------------------------------------------
0 | model | BertForSequenceClassification | 108 M 
--------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
435.680   Total estimated model params size (MB)


Epoch 0: 100%|██████████| 4688/4688 [04:36<00:00, 16.95it/s, v_num=0, acc=0.938]
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/1563 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/1563 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/1563 [00:00<00:45, 34.35it/s][A
Validation DataLoader 0:   0%|          | 2/1563 [00:00<00:35, 43.61it/s][A
Validation DataLoader 0:   0%|          | 3/1563 [00:00<00:32, 47.80it/s][A
Validation DataLoader 0:   0%|          | 4/1563 [00:00<00:31, 49.63it/s][A
Validation DataLoader 0:   0%|          | 5/1563 [00:00<00:30, 50.97it/s][A
Validation DataLoader 0:   0%|          | 6/1563 [00:00<00:30, 50.51it/s][A
Validation DataLoader 0:   0%|          | 7/1563 [00:00<00:30, 51.57it/s][A
Validation DataLoader 0:   1%|          | 8/1563 [00:00<00:31, 49.57it/s][A
Validation DataLoader 0:   1%|          | 9/1563 [00:00<00:32, 47.25it/s][A
Validation DataLoader 0:   1%|          | 10/1563 [00:00<00:32, 48.

`Trainer.fit` stopped: `max_epochs=3` reached.


Epoch 2: 100%|██████████| 4688/4688 [05:06<00:00, 15.28it/s, v_num=0, acc=1.000, val_loss=0.310, val_acc=0.891]
