<a href="https://colab.research.google.com/github/changhorang/SSAC_study/blob/main/DL3%20(PyTorch)/%EC%9E%90%EC%97%B0%EC%96%B4%EC%B2%98%EB%A6%AC/Chapter4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter4. 문서에 꼬리표 달기

## 4-1 문서 분류 모델 훑어보기
- 문서분류: 주어진 문서의 범주를 분류하는 과제 (긍/부정, 정치/경제/연예...)
- 네이버 영화 리뷰 말뭉치(NSMC) 이용

### 모델 구조
- CLS, SEP의 시작과 끝을 알리는 토큰을 기존 토큰 앞뒤에 추가
- BERT 모델에 토큰 입력 후 pooler_output(문장 수준의 벡터) 출력
- 추가 모듈을 덧붙여 긍정/부정 예측

### 태스크 모듈
- pooler_output 벡터에 dropout 적용
- pooler_output => 분류해야할 범주의 수의 차원을 갖는 벡터로 변환
- softmax 적용
- 최종 예측값과 target을 비교해 모델 전체를 업데이트 (*파인튜닝)

## 4-2 문서뷴류 모델 학습하기
### 영화 리뷰 감성 분석 모델 만들기
#### 각종 설정하기

In [1]:
!pip install ratsnlp

Collecting ratsnlp
  Downloading ratsnlp-0.0.9999-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.6 MB/s 
[?25hCollecting flask-cors>=3.0.10
  Downloading Flask_Cors-3.0.10-py2.py3-none-any.whl (14 kB)
Collecting Korpora>=0.2.0
  Downloading Korpora-0.2.0-py3-none-any.whl (57 kB)
[K     |████████████████████████████████| 57 kB 5.4 MB/s 
Collecting flask-ngrok>=0.0.25
  Downloading flask_ngrok-0.0.25-py3-none-any.whl (3.1 kB)
Collecting pytorch-lightning==1.3.4
  Downloading pytorch_lightning-1.3.4-py3-none-any.whl (806 kB)
[K     |████████████████████████████████| 806 kB 48.1 MB/s 
[?25hCollecting transformers==4.10.0
  Downloading transformers-4.10.0-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 40.9 MB/s 
[?25hCollecting torchmetrics>=0.2.0
  Downloading torchmetrics-0.6.2-py3-none-any.whl (332 kB)
[K     |████████████████████████████████| 332 kB 55.1 MB/s 
[?25hCollecting fsspec[http]>=2021.4.0
  Downloading fsspec-202

In [2]:
import torch
from ratsnlp.nlpbook.classification import ClassificationTrainArguments

args = ClassificationTrainArguments(pretrained_model_name='beomi/kcbert-base',
                                    downstream_corpus_name='nsmc',
                                    downstream_model_dir='/drive/My drive/nlpbook/checkpoint-doccls',
                                    batch_size=32 if torch.cuda.is_available() else 4,
                                    learning_rate=5e-5,
                                    max_seq_length=128,
                                    epochs=3,
                                    tpu_cores=0 if torch.cuda.is_available() else 8,
                                    seed=7)

In [3]:
from ratsnlp import nlpbook

nlpbook.set_seed(args)

set seed: 7


In [4]:
# 코드 실행해 각종 로그를 출력하는 로거 설정
nlpbook.set_logger(args)

INFO:ratsnlp:Training/evaluation parameters ClassificationTrainArguments(pretrained_model_name='beomi/kcbert-base', downstream_task_name='document-classification', downstream_corpus_name='nsmc', downstream_corpus_root_dir='/root/Korpora', downstream_model_dir='/drive/My drive/nlpbook/checkpoint-doccls', max_seq_length=128, save_top_k=1, monitor='min val_loss', seed=7, overwrite_cache=False, force_download=False, test_mode=False, learning_rate=5e-05, epochs=3, batch_size=32, cpu_workers=2, fp16=False, tpu_cores=0)


#### 말뭉치 내려받기

In [5]:
from Korpora import Korpora

Korpora.fetch(corpus_name=args.downstream_corpus_name,
              root_dir=args.downstream_corpus_root_dir,
              force_download=True)

[nsmc] download ratings_train.txt: 14.6MB [00:00, 93.6MB/s]                            
[nsmc] download ratings_test.txt: 4.90MB [00:00, 51.0MB/s]


#### 토크나이저 준비하기

In [6]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(args.pretrained_model_name, do_lower_case=False)

Downloading:   0%|          | 0.00/250k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/619 [00:00<?, ?B/s]

#### Data Preprocessing

In [7]:
from ratsnlp.nlpbook.classification import NsmcCorpus, ClassificationDataset

corpus = NsmcCorpus()
# pytorch의 dataloader와 동일한 기능
train_dataset = ClassificationDataset(args=args, 
                                      corpus=corpus,
                                      tokenizer=tokenizer,
                                      mode='train')

INFO:ratsnlp:Creating features from dataset file at /root/Korpora/nsmc
INFO:ratsnlp:loading train data... LOOKING AT /root/Korpora/nsmc/ratings_train.txt
INFO:ratsnlp:tokenize sentences, it could take a lot of time...
INFO:ratsnlp:tokenize sentences [took 85.799 s]
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:sentence: 아 더빙.. 진짜 짜증나네요 목소리
INFO:ratsnlp:tokens: [CLS] 아 더 ##빙 . . 진짜 짜증나네 ##요 목소리 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [

In [12]:
corpus.num_labels

2

- ClassificationDataset
    - NsmcCorpus에서 넘긴 문장과 레이블을 각각 tokenizer를 활용해 모델이 학습할 수 있는 형태(ClassificationFeatures)로 가공

- ClassificationFeatures
    - input_ids: index로 변환된 토큰 시퀀스 (list_int)
    - attention_mask: 해당 토큰이 패딩(0)인지 아닌지(1) 표현 (list_int)
    - token_type_ids: segment 정보 (list_int)
    - label: 정수로 바뀐 레이블 정보 (int)

- ClassificationFeatures.token_type_ids: BERT 모델의 경우 pretrain과제가 빈칸 맞히기와 이어진 문서 맞히기 (next sentence prediction, 2개의 문서가 이어진지 아닌지 ***이진분류***) 수행함. 여기서 token sequence가 0이면 첫번째 문서, 1이면 두번째 문서를 의미하는 정보 

In [8]:
# Example
train_dataset[0]

ClassificationFeatures(input_ids=[2, 2170, 832, 5045, 17, 17, 7992, 29734, 4040, 10720, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], attention_mask=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], token_type_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [9]:
from torch.utils.data import DataLoader, RandomSampler

train_dataloader = DataLoader(train_dataset,
                              batch_size=args.batch_size,
                              sampler=RandomSampler(train_dataset, replacement=False), # 비복원(False) 랜덤 추출
                              collate_fn=nlpbook.data_collator, # 인스턴스들을 종류별로 모아 tensor로 변경
                              drop_last=False,
                              num_workers=args.cpu_workers)

In [10]:
# valid
from torch.utils.data import SequentialSampler

val_dataset = ClassificationDataset(args=args,
                                    corpus=corpus,
                                    tokenizer=tokenizer,
                                    mode='test')

val_dataloader = DataLoader(val_dataset,
                            batch_size=args.batch_size,
                            sampler=SequentialSampler(val_dataset),
                            collate_fn=nlpbook.data_collator,
                            drop_last=False,
                            num_workers=args.cpu_workers)

INFO:ratsnlp:Creating features from dataset file at /root/Korpora/nsmc
INFO:ratsnlp:loading test data... LOOKING AT /root/Korpora/nsmc/ratings_test.txt
INFO:ratsnlp:tokenize sentences, it could take a lot of time...
INFO:ratsnlp:tokenize sentences [took 28.280 s]
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:sentence: 굳 ㅋ
INFO:ratsnlp:tokens: [CLS] 굳 ㅋ [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [

#### 모델 불러오기
- BertForSequenceClassification: pretrained BERT모델 위에 분류용 태스크 모듈이 붙어 있는 class

In [13]:
from transformers import BertConfig, BertForSequenceClassification

pretrained_model_config = BertConfig.from_pretrained(args.pretrained_model_name,
                                                     num_labels=corpus.num_labels) # 2

model = BertForSequenceClassification.from_pretrained(args.pretrained_model_name,
                                                      config=pretrained_model_config)

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at beomi/kcbert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initiali

#### 모델학습시키기
- 파이토치 라이트닝에서 제공하는 LightningModule 클래스를 상속받아 task 정의
- task에는 모델, 옵티마이저, 학습 과정 등이 정의

In [None]:
from ratsnlp.nlpbook.classification import ClassificationTask

task = ClassificationTask(model, args)
trainer = nlpbook.get_trainer(args)

trainer.fit(task, train_dataloader=train_dataloader, val_dataloaders=val_dataloader)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                          | Params
--------------------------------------------------------
0 | model | BertForSequenceClassification | 108 M 
--------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
435.680   Total estimated model params size (MB)


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

- ClassificationTask: optimizer(AdamW), learning scheduler (ExponentialLR) 이용
    - ExponentialLR: 현재 에포크의 lr를 이전에포크의 lr*gamma(0.9)로 스케줄링

    - https://ratsgo.github.io/nlpbook/docs/doc_cls/detail 참고

## 4-3 학습 마친 모델을 실전 투입하기