# 본 특강에서는 기존의 학습 모델인 SKT KoBERT를 불러와 추가 사전학습을 수행하고, fine-tuning까지 진행하도록 한다.(MLM이용)

## 0. 패키지 설치 

In [1]:
# 구글 드라이브 연결
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
#사전 패키지 설치
!pip install transformers
!pip install sentencepiece

Collecting transformers
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp38-cp38-win_amd64.whl (3.3 MB)
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.8.1 tokenizers-0.12.1 transformers-4.21.1
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-win_amd64.whl (1.1 MB)
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97


In [5]:
#트렌스포머 버전 확인
import transformers
transformers.__version__

'4.21.1'

In [6]:
#사용할 패키지 불러오기
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset, RandomSampler, SequentialSampler
import os
import pandas as pd
import numpy as np


## 1. NSMC 데이터 불러오고 정제하기

In [8]:
# NSMC 데이터 불러오기
# %%time
!rm -f ratings_train.txt ratings_test.txt
!wget -nc https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt
!wget -nc https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt

'rm'은(는) 내부 또는 외부 명령, 실행할 수 있는 프로그램, 또는
배치 파일이 아닙니다.
--2022-08-21 23:09:28--  https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
ERROR: cannot verify raw.githubusercontent.com's certificate, issued by 'CN=DigiCert TLS RSA SHA256 2020 CA1,O=DigiCert Inc,C=US':
  Unable to locally verify the issuer's authority.
To connect to raw.githubusercontent.com insecurely, use `--no-check-certificate'.
--2022-08-21 23:09:28--  https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
ERROR: cannot verify raw.githubusercontent.com's

In [6]:
# txt 데이터 정제
import codecs
with codecs.open("ratings_train.txt", encoding='utf-8') as f:
    data = [line.split('\t') for line in f.read().splitlines()]
    data = data[1:]   # header 제외

In [7]:
# 데이터 프레임으로 바꾼다.
text_list=[]
rate_list=[]
for i in range(len(data)):
  text_list.append(data[i][1])
  rate_list.append(data[i][2])
df_train=pd.DataFrame(columns=['text','label'])
df_train['text']=text_list
df_train['label']=rate_list

df_train=pd.DataFrame(columns=['text','label'])
df_train['text']=text_list
df_train['label']=rate_list

In [8]:
df_train

Unnamed: 0,text,label
0,아 더빙.. 진짜 짜증나네요 목소리,0
1,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,너무재밓었다그래서보는것을추천한다,0
3,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1
...,...,...
149995,인간이 문제지.. 소는 뭔죄인가..,0
149996,평점이 너무 낮아서...,1
149997,이게 뭐요? 한국인은 거들먹거리고 필리핀 혼혈은 착하다?,0
149998,청춘 영화의 최고봉.방황과 우울했던 날들의 자화상,1


## 2. KoBERT Tokenizer 불러오기

In [9]:
# 코랩에서 KoBERT를 쓰기위해 아래의 코드를 수행하고 동봉된 tokenization_kobert.py를 불러오자
from google.colab import files
src = list(files.upload().values())[0]
open('file1.py','wb').write(src)


Saving tokenization_kobert.py to tokenization_kobert (1).py


10994

In [10]:
# 위의 코드를 드라이브에 적재했으면 아래 코드를 실행해서 본격적으로 불러온다.
from tokenization_kobert import KoBertTokenizer
model_version = 'monologg/kobert'
tokenizer = KoBertTokenizer.from_pretrained(model_version)
# kobert만 가능

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'KoBertTokenizer'.


In [11]:
# 불러온 토크나이저를 한번 테스트 해보자
tokenizer.tokenize('안녕하세요, 저는 서수민입니다')

['▁안', '녕', '하세요', ',', '▁저', '는', '▁서', '수', '민', '입니다']

## 3.데이터 프레임으로 만든 학습 데이터 분할 & 토크나이징 

In [12]:
#### 학습데이터 테스트 데이터 분할, 본 예제에서는 2000개만 했다. 

from sklearn.model_selection import train_test_split
x_mlm, x_fine, y_mlm, y_fine =train_test_split(df_train['text'][:2000], df_train['label'][:2000], test_size=0.5,
                                                  random_state=4444,stratify=df_train['label'][:2000])

In [13]:
# 추가 사전 학습을 수행할 때는 라벨 정보가 필요 없기 때문에 텍스트 데이터만 사용한다.  
sentences = list(x_mlm)

In [14]:
# Print the original sentence.
print(' Original: ', sentences[2])

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(sentences[2]))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[2])))

 Original:  김태균은 감독으로써의 재능이 의심된다
Tokenized:  ['▁김태', '균', '은', '▁감독', '으로써', '의', '▁재능', '이', '▁의심', '된다']
Token IDs:  [1334, 5536, 7086, 786, 7081, 7095, 3975, 7096, 3632, 5900]


In [15]:
#MLM 학습을 위해 각 문장들을 아래의 코드를 실행하여 전처리 한다. 출력 데이터를 확인해보자. 
examples=tokenizer(sentences, return_special_tokens_mask=True, truncation=True, max_length=128, padding='max_length', return_tensors='pt')

In [16]:
examples

{'input_ids': tensor([[   2, 1585, 6928,  ...,    1,    1,    1],
        [   2, 3969, 6269,  ...,    1,    1,    1],
        [   2, 1334, 5536,  ...,    1,    1,    1],
        ...,
        [   2, 1773, 6160,  ...,    1,    1,    1],
        [   2, 1406, 4208,  ...,    1,    1,    1],
        [   2, 5182, 7170,  ...,    1,    1,    1]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'special_tokens_mask': tensor([[1, 0, 0,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 1, 1],
        ...,
        [1, 0, 0,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1

In [17]:
#input_ids
print(examples['input_ids'])

tensor([[   2, 1585, 6928,  ...,    1,    1,    1],
        [   2, 3969, 6269,  ...,    1,    1,    1],
        [   2, 1334, 5536,  ...,    1,    1,    1],
        ...,
        [   2, 1773, 6160,  ...,    1,    1,    1],
        [   2, 1406, 4208,  ...,    1,    1,    1],
        [   2, 5182, 7170,  ...,    1,    1,    1]])


In [18]:
#special_tokens_mask
#print(examples['special_tokens_mask'])

In [19]:
#attention_mask
print(examples['attention_mask'])

tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])


## 4.MLM Masking


In [20]:
#Text to Token_id 작업이 끝나면 MLM 학습을 위해 무작위로 [MASK] 토큰을 씌워야한다. 아래의 패키지는 해당 작업을 매우 유용하게 해준다. 
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling( 
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

In [21]:
#위에서 만든 전처리 데이터를 []씌어서 넣어보자
mlm_data=data_collator([examples])

In [22]:
# 결과값 확인
mlm_data

{'input_ids': tensor([[[   2, 1585, 6928,  ...,    1,    1,    1],
         [   2,    4, 6269,  ...,    1,    1,    1],
         [   2, 1334, 5536,  ...,    1,    1,    1],
         ...,
         [   2, 1773, 6160,  ...,    1,    1,    1],
         [   2, 1406, 4208,  ...,    1,    1,    1],
         [   2, 5182, 7170,  ...,    1,    1,    1]]]), 'token_type_ids': tensor([[[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]]), 'attention_mask': tensor([[[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]]), 'labels': tensor([[[-100, -100, -100,  ..., -100, -100, -100],
         [-100, 3969, -100,  ..., -100, -100, -100],
         [-100, -100, -100,  ..., -100, -100,

In [23]:
# 문장이 짧은 경우 mask된 부분의 비율을 낮추기

## 5.학습셋 구성

In [24]:
# 학습셋을 구성하자 
from torch.utils.data import TensorDataset, random_split

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(mlm_data['input_ids'].squeeze(), mlm_data['attention_mask'].squeeze(), mlm_data['labels'].squeeze())

# Create a 90-10 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(len(train_dataset)))
print('{:>5,} validation samples'.format(len(val_dataset)))

  900 training samples
  100 validation samples


In [25]:
#from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
#GPU 성능에 맞춰 batch_size를 조절하자
batch_size = 26

train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

In [26]:
##GPU 사용 시
device = torch.device(0)

## 7.BERT 불러오기

In [27]:
# SKT KoBERT를 불러오는 과정
# NSP : BertforNestsectenceP
# 둘 다 하겠다 : BertforPretraining
from transformers import BertForMaskedLM, AdamW
model_version = 'monologg/kobert'
model = BertForMaskedLM.from_pretrained(model_version)

Some weights of BertForMaskedLM were not initialized from the model checkpoint at monologg/kobert and are newly initialized: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
#GPU 사용을 위해 device 설정
model.cuda(device)

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(8002, 768, padding_idx=1)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tru

In [29]:
# Get all of the model's parameters as a list of tuples.
params = list(model.named_parameters())

print('The BERT model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The BERT model has 202 different named parameters.

==== Embedding Layer ====

bert.embeddings.word_embeddings.weight                   (8002, 768)
bert.embeddings.position_embeddings.weight                (512, 768)
bert.embeddings.token_type_embeddings.weight                (2, 768)
bert.embeddings.LayerNorm.weight                              (768,)
bert.embeddings.LayerNorm.bias                                (768,)

==== First Transformer ====

bert.encoder.layer.0.attention.self.query.weight          (768, 768)
bert.encoder.layer.0.attention.self.query.bias                (768,)
bert.encoder.layer.0.attention.self.key.weight            (768, 768)
bert.encoder.layer.0.attention.self.key.bias                  (768,)
bert.encoder.layer.0.attention.self.value.weight          (768, 768)
bert.encoder.layer.0.attention.self.value.bias                (768,)
bert.encoder.layer.0.attention.output.dense.weight        (768, 768)
bert.encoder.layer.0.attention.output.dense.bias              (

## 8. 학습을 위한 각종 파라미터 설정

In [30]:
#옵티마이저 설정
# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
# I believe the 'W' stands for 'Weight Decay fix"
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )



In [31]:
# 에폭 설정
from transformers import get_linear_schedule_with_warmup

epochs = 10
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

In [32]:
# 학습 시간 설정
import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [33]:
# 위에서 한번 불러오긴 했는데 혹시 중복된건 빼셔도 됩니다.
import random
import pickle as pkl
import shutil
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning) 
import copy
from tqdm.notebook import tqdm_notebook

In [34]:
#GPU 상태 확인
!nvidia-smi

Sun Feb 27 06:43:06 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   70C    P0    73W / 149W |   1006MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [35]:
# 저장위치 및 seed 설정
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# We'll store a number of quantities such as training and validation loss, 
# validation accuracy, and timings.
training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()


#이 부분에서 본인의 구글 드라이브 주소를 기입하자
out_dir = '/content/drive/MyDrive/BERT_멘토링/BERT_further_pre_training/save_bert'
output_dir = out_dir + 'random_all'

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

In [36]:
### 버전에 따라 토크나이저 저장이 잘 안된다. 본 예제에서는 약간의 편법으로 토크나이저를 저장하였다.
from transformers import BertTokenizer
def save_models(out_dir):
    global tokenizer
    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()

    output_dir = out_dir

    # Create output directory if needed
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    print("Saving model to %s" % output_dir)

    # Save a trained model, configuration and tokenizer using `save_pretrained()`.
    # They can then be reloaded using `from_pretrained()`
    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
    model_to_save.save_pretrained(output_dir)

    tokenizer.save_vocabulary(output_dir)
    vocab_path = output_dir + "/vocab.txt"
    tokenizer = BertTokenizer(vocab_file=vocab_path, do_lower_case=True) 
    tokenizer.save_pretrained(output_dir)
    # Good practice: save your training arguments together with the trained model
    # torch.save(args, os.path.join(output_dir, 'training_args.bin'))

## 9.학습수행

In [37]:
# 학습 및 평가 시작
# For each epoch...
for epoch_i in range(0, epochs):
  
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    t0 = time.time()

    total_train_loss = 0

    model.train()

    for step, batch in enumerate(train_dataloader):

        if step % 40 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        model.zero_grad()        

        train_outs = model(b_input_ids, 
                          token_type_ids=None, 
                          attention_mask=b_input_mask,
                          labels=b_labels)
        loss = train_outs['loss']
        logits = train_outs['logits']

        total_train_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

    avg_train_loss = total_train_loss / len(train_dataloader)            
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.5f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(training_time))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    total_eval_loss = 0
    nb_eval_steps = 0

    for step, batch in enumerate(validation_dataloader):

        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        with torch.no_grad():        
            valid_outs= model(b_input_ids,
                             token_type_ids=None, 
                             attention_mask=b_input_mask,
                             labels=b_labels)
            
            loss = valid_outs['loss']
            logits = valid_outs['logits']

        total_eval_loss += loss.item()

    avg_val_loss = total_eval_loss / len(validation_dataloader)
    validation_time = format_time(time.time() - t0)
    
    print("  Validation Loss: {0:.5f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))
    
    
    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )
    
    if epoch_i > 0 and avg_val_loss < training_stats[epoch_i-1]['Valid. Loss']:
        epoch_dir = out_dir + 'random_all_epoch_' + str(epoch_i+1) + '/'
        print(epoch_dir)
        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
        model_to_save.save_pretrained(epoch_dir)

        tokenizer.save_vocabulary(epoch_dir)
        vocab_path = epoch_dir + "vocab.txt"
        custom_tokenizer = BertTokenizer(vocab_file=vocab_path, do_lower_case=True) 
        custom_tokenizer.save_pretrained(epoch_dir)


print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Training...

  Average training loss: 7.79959
  Training epcoh took: 0:00:39

Running Validation...
  Validation Loss: 7.28602
  Validation took: 0:00:02

Training...

  Average training loss: 7.02180
  Training epcoh took: 0:00:39

Running Validation...
  Validation Loss: 7.23210
  Validation took: 0:00:02
/content/drive/MyDrive/BERT_멘토링/BERT_further_pre_training/save_bertrandom_all_epoch_2/

Training...

  Average training loss: 6.79875
  Training epcoh took: 0:00:39

Running Validation...
  Validation Loss: 7.17758
  Validation took: 0:00:02
/content/drive/MyDrive/BERT_멘토링/BERT_further_pre_training/save_bertrandom_all_epoch_3/

Training...

  Average training loss: 6.53698
  Training epcoh took: 0:00:39

Running Validation...
  Validation Loss: 7.11187
  Validation took: 0:00:02
/content/drive/MyDrive/BERT_멘토링/BERT_further_pre_training/save_bertrandom_all_epoch_4/

Training...

  Average training loss: 6.19101
  Training epcoh took: 0:00:39

Running Validation...
  V

In [38]:
# 사전학습을 할 경우는 validation loss가 적은 걸로 하지 않음

#학습 결과 확인
import pandas as pd

# Display floats with two decimal places.
pd.set_option('precision', 4)

# Create a DataFrame from our training statistics.
df_stats = pd.DataFrame(data=training_stats)

# Use the 'epoch' as the row index.
df_stats = df_stats.set_index('epoch')

#df_stats.to_excel('') 학습 결과를 저장하고 싶으면 해당 빈칸에 주소 입력
df_stats

Unnamed: 0_level_0,Training Loss,Valid. Loss,Training Time,Validation Time
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,7.7996,7.286,0:00:39,0:00:02
2,7.0218,7.2321,0:00:39,0:00:02
3,6.7988,7.1776,0:00:39,0:00:02
4,6.537,7.1119,0:00:39,0:00:02
5,6.191,7.0169,0:00:39,0:00:02
6,5.8819,6.9079,0:00:39,0:00:02
7,5.5835,6.7705,0:00:39,0:00:02
8,5.3836,6.7755,0:00:39,0:00:02
9,5.2142,6.7255,0:00:39,0:00:02
10,5.0967,6.6798,0:00:39,0:00:02


## 10.추가 사전학습 이후 파인튜닝을 해보자


In [40]:
# 추가 사전학습된 토크나이저 불러오자
from tokenization_kobert import KoBertTokenizer
tokenizer = KoBertTokenizer.from_pretrained('/content/drive/MyDrive/BERT_멘토링/BERT_further_pre_training/save_bertrandom_all_epoch_6/')

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'KoBertTokenizer'.


In [41]:
#앞에서 분할했던 데이터를 파인튜닝, 테스트로 재분할

from sklearn.model_selection import train_test_split
X_fine, x_test, Y_fine, y_test =train_test_split(x_fine, y_fine, test_size=0.3,
                                                  random_state=4444,stratify=y_fine)


In [42]:
# 형태 변환
sentences = list(X_fine)
labels = list(map(int, Y_fine))

In [43]:
print(' Original: ', sentences[0])
print('Tokenized: ', tokenizer.tokenize(sentences[0]))
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0])))

 Original:  그냥 불법체류자 때려잡는 영화면 좋았을텐데...무슨 우상화를 만든다고 미국의 따뜻한 설정...이건 뭥미??
Tokenized:  ['▁그냥', '▁불법', '체', '류', '자', '▁때', '려', '잡', '는', '▁영화', '면', '▁좋았', '을', '텐', '데', '...', '무', '슨', '▁우', '상', '화를', '▁만든', '다', '고', '▁미국의', '▁따뜻한', '▁설정', '...', '이', '건', '▁', '뭥', '미', '??']
Token IDs:  [1189, 2496, 7436, 6107, 7147, 1844, 6060, 7176, 5760, 3394, 6198, 4208, 7088, 7621, 5850, 55, 6228, 6696, 3498, 6527, 7942, 1939, 5782, 5439, 2151, 1834, 2778, 55, 7096, 5384, 517, 0, 6255, 260]


## 11.fine-tuning 용 데이터 셋 구축

In [44]:
# 텍스트를 토큰 아이디로(MLM 작업에서 수행한 것과는 다르다.) 코드를 확인하자. 
from tqdm.notebook import tqdm_notebook
import logging
logging.basicConfig(level=logging.ERROR)

input_ids = []
attention_masks = []

# For every sentence...
for sent in tqdm_notebook(sentences):
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length =128,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',    # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[2])
print('Token IDs:', input_ids[2])
print('labels:', attention_masks[2])

  0%|          | 0/700 [00:00<?, ?it/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Original:  이게 어떻게 평점이 낮을수가 있지?
Token IDs: tensor([   2, 3647, 5400, 3225, 4841, 7224, 1429, 7088, 6630, 3884,  258,    3,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1])
labels: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0,

In [45]:
#from torch.utils.data import TensorDataset, random_split

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, labels)

# Create a 90-10 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(len(train_dataset)))
print('{:>5,} validation samples'.format(len(val_dataset)))

  630 training samples
   70 validation samples


In [46]:
#from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 16

train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

In [47]:
!nvidia-smi

Sun Feb 27 07:05:24 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    74W / 149W |   5603MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## 12 Fine-tuning 분류를 위해 추가 사전학습한 가중치 불러오기

In [49]:
from transformers import BertForSequenceClassification, AdamW, BertConfig
model = BertForSequenceClassification.from_pretrained(
    '/content/drive/MyDrive/BERT_멘토링/BERT_further_pre_training/save_bertrandom_all_epoch_6/',
    num_labels=2,
    output_attentions = True, # Whether the model returns attentions weights.
    output_hidden_states = True
)

Some weights of the model checkpoint at /content/drive/MyDrive/BERT_멘토링/BERT_further_pre_training/save_bertrandom_all_epoch_6/ were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification wer

In [50]:
model.cuda(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(8002, 768, padding_idx=1)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementw

In [51]:
# Get all of the model's parameters as a list of tuples.
params = list(model.named_parameters())

print('The BERT model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The BERT model has 201 different named parameters.

==== Embedding Layer ====

bert.embeddings.word_embeddings.weight                   (8002, 768)
bert.embeddings.position_embeddings.weight                (512, 768)
bert.embeddings.token_type_embeddings.weight                (2, 768)
bert.embeddings.LayerNorm.weight                              (768,)
bert.embeddings.LayerNorm.bias                                (768,)

==== First Transformer ====

bert.encoder.layer.0.attention.self.query.weight          (768, 768)
bert.encoder.layer.0.attention.self.query.bias                (768,)
bert.encoder.layer.0.attention.self.key.weight            (768, 768)
bert.encoder.layer.0.attention.self.key.bias                  (768,)
bert.encoder.layer.0.attention.self.value.weight          (768, 768)
bert.encoder.layer.0.attention.self.value.bias                (768,)
bert.encoder.layer.0.attention.output.dense.weight        (768, 768)
bert.encoder.layer.0.attention.output.dense.bias              (

## 13. 학습을 위한 각종 파라미터 설정 MLM과는 달리, 정확도와 f1 스코어 등 몇 가지 함수가 추가됨

In [52]:
#옵티마이저 설정
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )



In [53]:
#from transformers import get_linear_schedule_with_warmup
#에폭 설정

epochs = 3

total_steps = len(train_dataloader) * epochs

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

In [54]:
# 정확도 측정 함수
import numpy as np

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [55]:
# f1-score 측정 함수
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

In [56]:
# 평가를 위한 함수
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [57]:
import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [58]:
##GPU 사용 시
device = torch.device("cuda:0")

In [61]:
# fine-tuning 학습 가중치 저장 주소 설정
path='/content/drive/MyDrive/BERT_멘토링/BERT_further_pre_training/fine_tuning/'

## 14. fine-tuning 시작

In [62]:
import random
import numpy as np

seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

training_stats = []
total_t0 = time.time()


for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0
    total_train_accuracy = 0
  
    model.train()

    for step, batch in enumerate(train_dataloader):

        if step % 40 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        model.zero_grad()        
        train_outs= model(b_input_ids, 
                          token_type_ids=None, 
                          attention_mask=b_input_mask, 
                          labels=b_labels)
        
        loss = train_outs['loss']
        logits = train_outs['logits']

        total_train_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        total_train_accuracy += flat_accuracy(logits, label_ids)


    avg_train_accuracy = total_train_accuracy / len(train_dataloader)
    avg_train_loss = total_train_loss / len(train_dataloader)           
    training_time = format_time(time.time() - t0)

    print("")
    print("  Accuracy: {0:.3f}".format(avg_train_accuracy))
    print("  Average training loss: {0:.3f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(training_time))
    
    torch.save(model, path +"movie_train_v1_" + str(epoch_i) +".pt")  # 전체 모델 저장
    torch.save(model.state_dict(), path + "movie_train_state_dict_v1_" +str(epoch_i) +".pt")  # 모델 객체의 state_dict 저장    
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()
    model.eval()
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    for batch in validation_dataloader:
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        with torch.no_grad():        
            valid_outs = model(b_input_ids, 
                                   token_type_ids=None, 
                                   attention_mask=b_input_mask,
                                   labels=b_labels)
            
            loss = valid_outs['loss']
            logits = valid_outs['logits']

        total_eval_loss += loss.item()
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        total_eval_accuracy += flat_accuracy(logits, label_ids)

    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    print("  Accuracy: {0:.3f}".format(avg_val_accuracy))
    avg_val_loss = total_eval_loss / len(validation_dataloader)
    
    validation_time = format_time(time.time() - t0)
    
    print("  Validation Loss: {0:.3f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    val_loss, predictions, true_vals = evaluate(validation_dataloader)
    val_f1 = f1_score_func(predictions, true_vals)
    print(("Validation loss: {0:.3f}".format(val_loss)))
    print(("F1 Score (Weighted): {0:.3f}".format(val_f1)))


    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Training Accur' : avg_train_accuracy,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Valid. f1.': val_f1,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Training...

  Accuracy: 0.811
  Average training loss: 0.496
  Training epcoh took: 0:00:28

Running Validation...
  Accuracy: 0.725
  Validation Loss: 0.549
  Validation took: 0:00:01
Validation loss: 0.549
F1 Score (Weighted): 0.685

Training...

  Accuracy: 0.833
  Average training loss: 0.413
  Training epcoh took: 0:00:28

Running Validation...
  Accuracy: 0.738
  Validation Loss: 0.520
  Validation took: 0:00:01
Validation loss: 0.520
F1 Score (Weighted): 0.698

Training...

  Accuracy: 0.869
  Average training loss: 0.354
  Training epcoh took: 0:00:28

Running Validation...
  Accuracy: 0.738
  Validation Loss: 0.520
  Validation took: 0:00:01
Validation loss: 0.520
F1 Score (Weighted): 0.698

Training complete!
Total training took 0:01:40 (h:mm:ss)


In [63]:
#결과 확인
import pandas as pd

# Display floats with two decimal places.
pd.set_option('precision', 4)

# Create a DataFrame from our training statistics.
df_stats = pd.DataFrame(data=training_stats)

# Use the 'epoch' as the row index.
df_stats = df_stats.set_index('epoch')

# A hack to force the column headers to wrap.
#df = df.style.set_table_styles([dict(selector="th",props=[('max-width', '70px')])])

# Display the table.
df_stats

Unnamed: 0_level_0,Training Loss,Training Accur,Valid. Loss,Valid. Accur.,Valid. f1.,Training Time,Validation Time
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0.4956,0.8109,0.5495,0.725,0.6847,0:00:28,0:00:01
2,0.4128,0.8333,0.52,0.7375,0.6983,0:00:28,0:00:01
3,0.3542,0.8688,0.52,0.7375,0.6983,0:00:28,0:00:01
