## 팀원
- 김장현 / 컴퓨터공학부/ 2019-26471 
- 안형서 / 전기정보공학부 / 2017-12932
- 양서연 /
- 이욱재 / 

## Intro 
- Here, we train neural networks solving the four Korean language tasks ([link](https://corpus.korean.go.kr/task/taskDownload.do?taskId=1&clCd=END_TASK&subMenuId=sub02)). 
- Our basic approach is to fine-tune pre-trained **korean language models** (https://huggingface.co/models?language=ko&sort=downloads). 
- We basically use **KLUE-RoBERTa** models from https://github.com/KLUE-benchmark/KLUE, which is the state-of-the-art korean language model. 
- We refer the following sources for the some parts of data processing and fine-tuning techniques. 
 - Sun et al., 'How to Fine-Tune BERT for Text Classification?', https://arxiv.org/abs/1905.05583
 - https://github.com/NIKL-Team-BC/NIKL-KLUE
- For more **detailed codes and experiment logs**, please refer to our [github page](https://github.com/codestella/nlp_final_project)

## Import functions

In [1]:
import os
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import transformers
import pandas
from transformers import AutoTokenizer
from transformers import RobertaModel, RobertaConfig
from transformers import AdamW
import time
import argparse

transformers.logging.set_verbosity(40) # Turn off warning
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

## 1. 판정의문문 (BoolQ)

아래 하이퍼파라미터 및 모델 세팅을 탐색 하였음. 가장 성능이 좋은 세팅은 강조되어 있음. 이 실험에 대한 보다 자세한 코드 및 로그파일은 [github page](https://github.com/codestella/nlp_final_project) 참조. 
- Model size: **Large** (\~85%) / Base (\~79%)
- Epoch: 10
- warm up: 10% training step (없으면 학습 불안정)
- Learning rate: 1e-5, **8e-6**, 5e-6 (큰 차이는 없으나, 커지면 학습 불안정)
- Batch size: **5**, 20, 60
- Finetuning: **All**, Only classifier (i.e., freeze feature extractor, \~57%)
- Classifier: linear model 충분 (multi-layer로 늘려도 gain 작음)

In [2]:
def load_data(path, tokenizer):
    ''' Tokenization for BoolQ data'''
    dataset = pandas.read_csv(path,
                              delimiter='\t',
                              names=['ID', 'text', 'question', 'answer'],
                              header=0)

    tokenized = tokenizer(dataset['text'].tolist(),
                          dataset['question'].tolist(),
                          padding=True,
                          truncation=True,
                          return_tensors="pt")
    dataset['label'] = torch.tensor(dataset['answer'])
    return dataset, tokenized


class Roberta(RobertaModel):
    ''' Classification layer added Roberta model'''
    def __init__(self, config, model_name):
        super(Roberta, self).__init__(config)
        self.roberta = RobertaModel.from_pretrained(model_name, config=config)
        self.hdim = config.hidden_size
        self.nclass = config.nclass
        self.classifier = nn.Linear(self.hdim, self.nclass)

    def forward(self, input_ids, attention_mask, **kwargs):
        outputs = self.roberta(input_ids, attention_mask=attention_mask)
        h = outputs[0][:, 0, :]
        logits = self.classifier(h)
        return logits

In [3]:
class TensorDataset(Dataset):
    ''' Define Torch Dataset '''
    def __init__(self, tokenized_dataset, labels):
        self.tokenized_dataset = tokenized_dataset
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.tokenized_dataset.items()}
        label = self.labels[idx]
        return item, label

    def __len__(self):
        return len(self.labels)

In [4]:
# Define tokenizer and model type
model_type = "Roberta"
size = 'large'
model_name = f"klue/roberta-{size}"
tokenizer = AutoTokenizer.from_pretrained(model_name)

Set data path below!

In [5]:
# Data path
base_path = './data'

train_dataset, train_tokenized = load_data(os.path.join(base_path, 'SKT_BoolQ_Train.tsv'), tokenizer)
val_dataset, val_tokenized = load_data(os.path.join(base_path, 'SKT_BoolQ_Dev.tsv'), tokenizer)

train_dataset = TensorDataset(train_tokenized, train_dataset['label'])
val_dataset = TensorDataset(val_tokenized, val_dataset['label'])

# Define loader
if size == 'base':
    batch_size = 16
else:
    batch_size = 5
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

In [6]:
# Data example
tokenizer.decode(train_tokenized['input_ids'][0])

'[CLS] 로마 시대의 오리엔트의 범위는 제국 내에 동부 지방은 물론 제국 외부에 있는 다른 국가에 광범위하게 쓰이는 단어였다. 그 후에 로마 제국이 분열되고 서유럽이 그들의 중심적인 세계를 형성하는 과정에서 자신들을 옥시덴트 ( occident ), 서방이라 부르며 오리엔트는 이와 대조되는 문화를 가진 동방세계라는 뜻이 부가되어, 인도와 중국, 일본을 이루는 광범위한 지역을 지칭하는 단어가 되었다. [SEP] 오리엔트는 인도와 중국, 일본을 이루는 광범위한 지역을 지칭하는 단어로 쓰인다. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [P

### 학습 결과 (BoolQ)
- 최적 세팅에서 10번 반복한 결과 84% \~ 87% 의 결과를 얻음
- 각 실험은 1시간 정도 소요 (6min/epoch) 
- (참고) jupyter를 서버에서 돌려서 컴퓨터 연결이 끊겼을때 print가 잘 되지 않은 경우가 있으나, 모델 학습 및 저장은 이상 없음.
- 본 모델을 앙상블하여 최종 **88.14\%**의 validation accuracy 달성.

In [7]:
def train_epoch(epoch, model, train_loader, optimizer, scheduler):
    ''' One epoch fine-tuning '''
    model.train()
    total_loss = 0
    cor = 0
    n_sample = 0
    s = time.time()
    criterion = nn.CrossEntropyLoss()

    for data, target in train_loader:
        item = {key: val.to(device) for key, val in data.items()}
        target = target.to(device)

        logits = model(**item)
        loss = criterion(logits, target)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

        with torch.no_grad():
            preds = torch.argmax(logits, dim=-1)

        total_loss += loss.item()
        cor += (preds == target).sum().item()
        n_sample += len(target)

        print(f"{cor}/{n_sample}", end='\r')

    loss_avg = total_loss / n_sample
    acc = cor / n_sample
    print(
        f"[Epoch {epoch}] Train loss: {loss_avg:.3f}, acc: {acc*100:.2f}, time: {time.time()-s:.1f}s"
    )
    return acc


def validate(epoch, model, val_loader, verbose=True):
    ''' Evaluate on validation set '''
    model.eval()
    total_loss = 0
    cor = 0
    n_sample = 0
    criterion = nn.CrossEntropyLoss()
    pred_all = []
    
    with torch.no_grad():
        for data, target in val_loader:
            item = {key: val.to(device) for key, val in data.items()}
            target = target.to(device)

            logits = model(**item)
            loss = criterion(logits, target)
            preds = torch.argmax(logits, dim=-1)
            pred_all.append(preds)

            total_loss += loss.item()
            cor += (preds == target).sum().item()
            n_sample += len(target)

    loss_avg = total_loss / n_sample
    acc = cor / n_sample
    pred_all = torch.cat(pred_all)
    
    if verbose:
        print(f"[Epoch {epoch}] Valid loss: {loss_avg:.3f}, acc: {acc*100:.2f}")
    return acc, pred_all


def train(idx, num_epochs, lr, train_loader, val_loader, config, save_dir='./results'):
    ''' Train for multiple epochs and validate '''    
    print(f"Start trining {idx}th model")
    model = Roberta(config, model_name).to(device)
    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = transformers.get_scheduler("linear",
                                           optimizer=optimizer,
                                           num_warmup_steps=num_epochs * len(train_loader) // 10,
                                           num_training_steps=num_epochs * len(train_loader))
    best_acc = 0
    for epoch in range(num_epochs):
        train_acc = train_epoch(epoch, model, train_loader, optimizer, scheduler)
        val_acc, _ = validate(epoch, model, val_loader)
        if val_acc > best_acc:
            best_acc = val_acc

            model_to_save = model.module if hasattr(model, "module") else model
            model_to_save.save_pretrained(os.path.join(save_dir, f'{idx}'))
            
    print(f"Training finish! Best validation accuracy: {best_acc*100:.2f}\n")

In [8]:
def validate_ensemble(save_dir, model_name, val_loader, answer, idx_max=10, acc_threshold=0.85):
    ''' Measure ensemble accuracy '''
    pred_ensemble = []
    for idx in range(idx_max):
        model = Roberta.from_pretrained(os.path.join(save_dir, f'{idx}'), model_name)
        model.to(device)
        acc, pred_all = validate('best', model, val_loader, verbose=False)
        print(f"Load {idx}th model (acc: {acc*100:.2f})")
        if acc >= acc_threshold:
            pred_ensemble.append(pred_all)
        
    pred_ensemble = torch.stack(pred_ensemble, dim=-1).float()
    pred_ensemble = (pred_ensemble.mean(-1) >= 0.5).long().to(answer.device)
    acc_ensemble = (pred_ensemble == answer).sum() / len(answer)
    print(f"\nEnsemble accuracy: {acc_ensemble*100:.2f}")

In [9]:
# Fine-tuning networks (10 repeat)
lr = 8e-6
num_epochs = 10
save_dir = './results_qa'

config = RobertaConfig.from_pretrained(model_name)
config.nclass = 2
for i in range(10):
    train(i, num_epochs, lr, train_loader, val_loader, config=config, save_dir=save_dir)

Start trining 0th model


Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

[Epoch 0] Train loss: 0.135, acc: 55.66, time: 346.7s
[Epoch 0] Valid loss: 0.111, acc: 72.00
[Epoch 1] Train loss: 0.086, acc: 81.53, time: 348.5s
[Epoch 1] Valid loss: 0.090, acc: 78.71
[Epoch 2] Train loss: 0.037, acc: 93.40, time: 350.1s
[Epoch 2] Valid loss: 0.094, acc: 82.57
[Epoch 3] Train loss: 0.016, acc: 97.33, time: 349.0s
[Epoch 3] Valid loss: 0.112, acc: 84.00
[Epoch 4] Train loss: 0.006, acc: 98.99, time: 347.3s
[Epoch 4] Valid loss: 0.167, acc: 84.00
[Epoch 5] Train loss: 0.004, acc: 99.40, time: 350.4s
[Epoch 5] Valid loss: 0.165, acc: 84.00
[Epoch 6] Train loss: 0.003, acc: 99.48, time: 350.7s
[Epoch 6] Valid loss: 0.144, acc: 87.00
[Epoch 7] Train loss: 0.001, acc: 99.81, time: 349.5s
[Epoch 7] Valid loss: 0.181, acc: 85.71
[Epoch 8] Train loss: 0.001, acc: 99.89, time: 353.4s
[Epoch 8] Valid loss: 0.189, acc: 85.00
[Epoch 9] Train loss: 0.001, acc: 99.89, time: 350.9s
[Epoch 9] Valid loss: 0.193, acc: 84.43
Training finish! Best validation accuracy: 87.00

Start trin

Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

[Epoch 0] Train loss: 0.130, acc: 59.15, time: 351.2s
[Epoch 0] Valid loss: 0.101, acc: 75.86
[Epoch 1] Train loss: 0.081, acc: 82.18, time: 349.3s
[Epoch 1] Valid loss: 0.070, acc: 84.71
[Epoch 2] Train loss: 0.031, acc: 94.22, time: 350.0s
[Epoch 2] Valid loss: 0.126, acc: 84.71
[Epoch 3] Train loss: 0.013, acc: 97.93, time: 354.0s
[Epoch 3] Valid loss: 0.132, acc: 84.29
[Epoch 4] Train loss: 0.007, acc: 99.07, time: 353.1s
[Epoch 4] Valid loss: 0.152, acc: 84.57
[Epoch 5] Train loss: 0.002, acc: 99.54, time: 353.0s
[Epoch 5] Valid loss: 0.164, acc: 84.71
[Epoch 6] Train loss: 0.003, acc: 99.54, time: 352.5s
[Epoch 6] Valid loss: 0.150, acc: 86.14
[Epoch 7] Train loss: 0.002, acc: 99.73, time: 351.3s
[Epoch 7] Valid loss: 0.168, acc: 85.71
[Epoch 8] Train loss: 0.001, acc: 99.97, time: 354.4s
[Epoch 8] Valid loss: 0.179, acc: 85.29
[Epoch 9] Train loss: 0.001, acc: 99.95, time: 353.6s
[Epoch 9] Valid loss: 0.179, acc: 85.71
Training finish! Best validation accuracy: 86.14

Start trin

Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

[Epoch 0] Train loss: 0.140, acc: 50.97, time: 353.7s
[Epoch 0] Valid loss: 0.134, acc: 59.14
[Epoch 1] Train loss: 0.104, acc: 73.29, time: 353.2s
[Epoch 1] Valid loss: 0.083, acc: 81.14
3170/3485

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 4] Train loss: 0.006, acc: 99.13, time: 352.3s
[Epoch 4] Valid loss: 0.148, acc: 83.00
[Epoch 5] Train loss: 0.004, acc: 99.29, time: 354.9s
[Epoch 5] Valid loss: 0.143, acc: 83.43
[Epoch 6] Train loss: 0.003, acc: 99.56, time: 353.8s
[Epoch 6] Valid loss: 0.193, acc: 82.86
[Epoch 7] Train loss: 0.000, acc: 99.95, time: 356.7s
[Epoch 7] Valid loss: 0.202, acc: 84.71
425/425

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 5] Train loss: 0.002, acc: 99.70, time: 353.3s
[Epoch 5] Valid loss: 0.179, acc: 85.00
[Epoch 6] Train loss: 0.002, acc: 99.59, time: 353.9s
[Epoch 6] Valid loss: 0.188, acc: 84.86
[Epoch 7] Train loss: 0.001, acc: 99.84, time: 355.2s
[Epoch 7] Valid loss: 0.184, acc: 85.00
[Epoch 8] Train loss: 0.001, acc: 99.92, time: 355.2s
[Epoch 8] Valid loss: 0.184, acc: 85.86
3584/3585

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 2] Train loss: 0.035, acc: 94.00, time: 354.7s
[Epoch 2] Valid loss: 0.100, acc: 82.00
[Epoch 3] Train loss: 0.012, acc: 98.25, time: 357.2s
[Epoch 3] Valid loss: 0.128, acc: 83.57
[Epoch 4] Train loss: 0.007, acc: 98.80, time: 354.8s
[Epoch 4] Valid loss: 0.150, acc: 82.57
[Epoch 5] Train loss: 0.005, acc: 98.99, time: 356.3s
[Epoch 5] Valid loss: 0.187, acc: 81.43
1705/1705

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 7] Train loss: 0.002, acc: 99.75, time: 358.5s
[Epoch 7] Valid loss: 0.181, acc: 83.57
[Epoch 8] Train loss: 0.001, acc: 99.92, time: 358.0s
[Epoch 8] Valid loss: 0.193, acc: 82.57
[Epoch 9] Train loss: 0.000, acc: 99.97, time: 356.7s
[Epoch 9] Valid loss: 0.209, acc: 81.86
Training finish! Best validation accuracy: 83.57

Start trining 5th model


Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

[Epoch 0] Train loss: 0.140, acc: 51.24, time: 357.3s
[Epoch 0] Valid loss: 0.123, acc: 68.14
2083/2690

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 9] Train loss: 0.001, acc: 99.95, time: 359.1s
[Epoch 9] Valid loss: 0.218, acc: 84.43
Training finish! Best validation accuracy: 84.86

Start trining 6th model


Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

[Epoch 0] Train loss: 0.137, acc: 53.12, time: 359.3s
[Epoch 0] Valid loss: 0.098, acc: 76.57
[Epoch 1] Train loss: 0.086, acc: 81.17, time: 354.8s
[Epoch 1] Valid loss: 0.086, acc: 81.86
[Epoch 2] Train loss: 0.033, acc: 94.00, time: 358.5s
[Epoch 2] Valid loss: 0.113, acc: 81.14
2626/2690

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 6] Train loss: 0.002, acc: 99.86, time: 356.8s
[Epoch 6] Valid loss: 0.181, acc: 84.57
[Epoch 7] Train loss: 0.001, acc: 99.89, time: 358.8s
[Epoch 7] Valid loss: 0.199, acc: 84.57
[Epoch 8] Train loss: 0.000, acc: 99.95, time: 357.7s
[Epoch 8] Valid loss: 0.205, acc: 85.29
[Epoch 9] Train loss: 0.000, acc: 100.00, time: 355.7s
[Epoch 9] Valid loss: 0.209, acc: 85.43
Training finish! Best validation accuracy: 85.43

Start trining 7th model


Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

400/780

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 1] Train loss: 0.085, acc: 80.90, time: 355.3s
[Epoch 1] Valid loss: 0.097, acc: 81.00
[Epoch 2] Train loss: 0.037, acc: 93.70, time: 356.8s
[Epoch 2] Valid loss: 0.088, acc: 84.57
[Epoch 3] Train loss: 0.013, acc: 98.25, time: 355.4s
[Epoch 3] Valid loss: 0.115, acc: 84.57
[Epoch 4] Train loss: 0.009, acc: 98.64, time: 357.4s
[Epoch 4] Valid loss: 0.121, acc: 85.00
2834/2855

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 3] Train loss: 0.015, acc: 97.44, time: 357.1s
[Epoch 3] Valid loss: 0.130, acc: 83.71
[Epoch 4] Train loss: 0.006, acc: 99.02, time: 356.7s
[Epoch 4] Valid loss: 0.135, acc: 84.29
[Epoch 5] Train loss: 0.003, acc: 99.48, time: 356.5s
[Epoch 5] Valid loss: 0.177, acc: 84.00
[Epoch 6] Train loss: 0.001, acc: 99.95, time: 359.4s
[Epoch 6] Valid loss: 0.232, acc: 84.00
3238/3245

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 9] Train loss: 0.001, acc: 99.86, time: 361.4s
[Epoch 9] Valid loss: 0.224, acc: 83.71
Training finish! Best validation accuracy: 84.29

Start trining 9th model


Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

[Epoch 0] Train loss: 0.141, acc: 51.65, time: 359.3s
[Epoch 0] Valid loss: 0.153, acc: 46.57
[Epoch 1] Train loss: 0.104, acc: 74.05, time: 357.5s
[Epoch 1] Valid loss: 0.088, acc: 79.43
[Epoch 2] Train loss: 0.048, acc: 90.86, time: 356.8s
[Epoch 2] Valid loss: 0.084, acc: 82.71
[Epoch 3] Train loss: 0.017, acc: 97.57, time: 357.0s
[Epoch 3] Valid loss: 0.150, acc: 79.57
25/25

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 4] Train loss: 0.007, acc: 98.83, time: 360.3s
[Epoch 4] Valid loss: 0.162, acc: 83.14
[Epoch 5] Train loss: 0.004, acc: 99.32, time: 360.5s
[Epoch 5] Valid loss: 0.185, acc: 83.57
[Epoch 6] Train loss: 0.004, acc: 99.51, time: 360.3s
[Epoch 6] Valid loss: 0.161, acc: 84.29
[Epoch 7] Train loss: 0.003, acc: 99.62, time: 360.3s
[Epoch 7] Valid loss: 0.164, acc: 83.86
1589/1590

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [10]:
answer = torch.tensor(val_dataset.labels)
save_dir = './results_qa'

validate_ensemble(save_dir, model_name, val_loader, answer, idx_max=10, acc_threshold=0.85)

Load 0th model (acc: 87.00)
Load 1th model (acc: 86.14)
Load 2th model (acc: 85.43)
Load 3th model (acc: 85.86)
Load 4th model (acc: 83.57)
Load 5th model (acc: 84.86)
Load 6th model (acc: 85.43)
Load 7th model (acc: 85.43)
Load 8th model (acc: 84.29)
Load 9th model (acc: 84.29)

Ensemble accuracy: 88.14


## 2. 동형이의어 구별 (BoolQ)

- 전체적인 코드의 뼈대 / 하이퍼 파라미터는 같은 조의 BoolQ의 코드를 사용함
- 사용모델 : KLUE-roberta-large
- Epoch: 10
- warm up: 10% training step
- Learning rate: 8e-6
- Batch size: 4 (colab에서 4를 넘어가면 memory 초과)
- Classifier: 1 layer linear model ( layer/activaation을 추가해도 효과 미미)

In [1]:
#!pip install transformers  #if colab !!!!!!!!
#cur_dir="./drive/MyDrive/NLP/final/"   #if colab !!!!!!!!

#cur_dir="../" # if local!!!!!!!!!!!!!!1
cur_dir="./" # if local!!!!!!!!!!!!!!1

In [2]:
import os
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import transformers
import pandas
from transformers import AutoTokenizer
from transformers import RobertaModel, RobertaConfig ,RobertaPreTrainedModel
from transformers import AdamW
import time
import argparse
from tqdm.auto import tqdm
import numpy as np
import random

transformers.logging.set_verbosity(40) # Turn off warning

In [3]:
#from google.colab import drive
#drive.mount('/content/drive') #if colab!

In [4]:
ensen_num=6

save_dir = cur_dir+'result_wic'

if not os.path.exists( save_dir):
    os.makedirs(save_dir)


In [5]:
# for testing randomness

"""
seed=5555

torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(seed)
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

"""

"\nseed=5555\n\ntorch.manual_seed(seed)\ntorch.backends.cudnn.deterministic = True\ntorch.backends.cudnn.benchmark = False\nnp.random.seed(seed)\nrandom.seed(seed)\nos.environ['PYTHONHASHSEED'] = str(seed)\nif torch.cuda.is_available():\n    torch.cuda.manual_seed(seed)\n    torch.cuda.manual_seed_all(seed)\n\n"

### Load data
- Train, Dev 데이터가 base_path에 들어있어야 합니다. (default: './data')

# 데이터셋 만들기
 - 그 토큰이 명확하게 쪼개져야 하기때문에 앞뒤로 special token을 삽입해 줌으로써 그 토큰이 앞이나 뒤에서 다른 글자와 합쳐지지 않고 명확하게 쪼개지게 해준다.
 - 또한, 모델의 끝부분에서 그 토큰의 마지막 layer의 출력 값을 입력으로 쓰고 나머지 토큰들의 결과는 안쓰니, 그 부분만 1, 나머지는 0 인 mask가 필요하다. 이를 위해 만든 speical token을 이용하여 mask를 만들어준다.
 - 처음 시도는 torch.roll(shift 연산)을 사용해 동형이의어의 끝 부분을 한칸 앞으로 당긴곳을 1로 만들었지만, 이 경우, 동형이의어가 2개이상의 토큰으로 나뉜경우 완전히 커버할 수 없음.
 - 따라서 torch.cumsum(누적합 연산)을 사용하여 동형이의어의 시작/끝부분 사이를 전부 1이되도록 만들어줌
 - 1인 곳의 embedding vector는 전부 합함

In [6]:
def load_data(path, tokenizer):
    dataset = pandas.read_csv(path,
                              delimiter='\t',
                              names=['ID', 'Target', 'text1', 'text2','answer','s_s1','e_s1','s_s2','e_s2'],
                              header=0)

    dataset_text1=[]
    dataset_text2=[]

    for i,text1 in enumerate(dataset['text1']):
        text= text1[:dataset["s_s1"][i]]+"[WORD1S]"+text1[dataset["s_s1"][i]:dataset["e_s1"][i]]+"[WORD1E]"+text1[dataset["e_s1"][i]:]
        dataset_text1.append(text)

    for i,text2 in enumerate(dataset['text2']):
        text= text2[:dataset["s_s2"][i]]+"[WORD2S]"+text2[dataset["s_s2"][i]:dataset["e_s2"][i]]+"[WORD2E]"+text2[dataset["e_s2"][i]:]
        dataset_text2.append(text)
    
    word1s_tok_idx=tokenizer.encode("[WORD1S]")[1]
    word1e_tok_idx=tokenizer.encode("[WORD1E]")[1]
    word2s_tok_idx=tokenizer.encode("[WORD2S]")[1]
    word2e_tok_idx=tokenizer.encode("[WORD2E]")[1]

    tokenized = tokenizer(dataset_text1,
                          dataset_text2,
                          padding=True,
                          truncation=True,
                          return_tensors="pt")
    dataset['label'] = torch.tensor(dataset['answer'],dtype=int)

    #print(tokenized["input_ids"][0])

    #word_mask1=  torch.roll((tokenized["input_ids"]==word1e_tok_idx),-1,dims=1).int()
    #word_mask2=  torch.roll((tokenized["input_ids"]==word2e_tok_idx),-1,dims=1).int()
    
    word_mask1 = torch.roll((tokenized["input_ids"]==word1s_tok_idx),1,dims=1).int()
    word_mask1 = word_mask1 - (tokenized["input_ids"]==word1e_tok_idx).int()
    word_mask1 = torch.cumsum(word_mask1,dim=1)

    word_mask2 = torch.roll((tokenized["input_ids"]==word2s_tok_idx),1,dims=1).int()
    word_mask2 = word_mask2 - (tokenized["input_ids"]==word2e_tok_idx).int()
    word_mask2 = torch.cumsum(word_mask2,dim=1)
    

    #print(word_mask[0][0])
    #print(word_mask[1][0])

    #print( torch.sum(torch.roll((tokenized["input_ids"]==worde_tok_idx),-2,dims=1).int()-(tokenized["input_ids"]==words_tok_idx).int()) )

    return dataset, tokenized, word_mask1, word_mask2


class TensorDataset(Dataset):
    def __init__(self, tokenized_dataset, word_mask1, word_mask2, labels):
        self.tokenized_dataset = tokenized_dataset
        self.word_mask1= word_mask1
        self.word_mask2= word_mask2
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.tokenized_dataset.items()}
        item["word_mask1"]=self.word_mask1[idx]
        item["word_mask2"]=self.word_mask2[idx]
        label = self.labels[idx]
        return item, label

    def __len__(self):
        return len(self.labels)

In [7]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

model_type = "Roberta"
size = 'large'
model_name = f"klue/roberta-{size}"
tokenizer = AutoTokenizer.from_pretrained(model_name)

new_special_tokens=["[WORD1S]","[WORD1E]","[WORD2S]","[WORD2E]"]
tokenizer.add_special_tokens({"additional_special_tokens":new_special_tokens})
print(f"added {len(new_special_tokens)} new special token")

print(len(tokenizer))

# add special token - 동형이의어를 문장과 구별하기 위해서

Downloading:   0%|          | 0.00/375 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/243k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/734k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/173 [00:00<?, ?B/s]

added 4 new special token
32004


In [8]:

base_path = cur_dir+'data'

train_dataset, train_tokenized, train_word_mask1, train_word_mask2 = load_data(os.path.join(base_path, 'NIKL_SKT_WiC_Train.tsv'),
                                           tokenizer)
val_dataset, val_tokenized, val_word_mask1, val_word_mask2 = load_data(os.path.join(base_path, 'NIKL_SKT_WiC_Dev.tsv'), tokenizer)

train_dataset = TensorDataset(train_tokenized, train_word_mask1, train_word_mask2, train_dataset['label'])
val_dataset = TensorDataset(val_tokenized, val_word_mask1, val_word_mask2, val_dataset['label'])

if size == 'base':
    batch_size = 16
else:
    batch_size = 4
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

In [9]:
# 데이터 예시
tokenizer.decode(train_tokenized['input_ids'][0])

'[CLS] 그의 죽음은 타살로 [WORD1S] 단정 [WORD1E] 이 되었다. [SEP] [WORD2S] 단정 [WORD2E] 이 된 교실은 정돈되어 있다. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD

In [10]:
# 데이터 예시2, 동형이의어 마스크
train_word_mask1[0]

tensor([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0])

### Load pretrained model
- 모델은 KLUE-RoBERTa를 사용하였습니다. (https://github.com/KLUE-benchmark/KLUE) 
- 각 문장의 동형이의어 마스크를 통과한 결과를 합해줌으로써 각 문장의 동형이의어를 대표하는 hid_dim크기의 벡터를 2개 만듭니다.
- 2개의 벡터를 concat하여 classifier에 넣고 구분하였습니다.

### 모델 수정 시도
- 1. **1 layer classifier** (92.54%)
- 2. 2 layer classifier (92.11%)
- 3. 2 layer classifier with relu (91.25%)
- 4. 2 layer classifier with tanh (91.42%)
- layer의 복잡성을 올려도, 효과는 미미한데 학습 속도는 느려짐

In [11]:
class Roberta(RobertaPreTrainedModel):
    def __init__(self, config, model_name):
        super(Roberta, self).__init__(config)
        self.roberta = RobertaModel.from_pretrained(model_name, config=RobertaConfig.from_pretrained(model_name))
        self.roberta.resize_token_embeddings(config.new_tok_size)
        self.hdim = config.hidden_size
        self.nclass = config.nclass
        self.classifier = nn.Linear(self.hdim*2, self.nclass)
        #self.classifier = nn.Linear(self.hdim*2, self.hdim)
        #self.activation = nn.ReLU()
        #self.activation = nn.Tanh()
        #self.classifier2 = nn.Linear(self.hdim, self.nclass)

    def forward(self, input_ids, attention_mask, word_mask1, word_mask2 , **kwargs):
        outputs = self.roberta(input_ids, attention_mask=attention_mask)[0]
        #(batch_size, sequence_length, hidden_size)

        #word_mask  ( 2, batch , seq_len)

        #word_mask[0] (batch, seq_len)  [[0,0,0,0,0,1,0,0,0,0],[...],..]
        #word_mask[0].unsqueeze(2) (batch, seq_len,1) 
        
        #torch.sum( word_mask[0].unsqueeze(2) * outputs , dim=1 )   # (batch,  hidden_size)
        #word_mask[1].unsqueeze(2) * outputs
        
        h = torch.cat( [torch.sum( word_mask1.unsqueeze(2) * outputs , dim=1 ),torch.sum( word_mask2.unsqueeze(2) * outputs , dim=1 )], dim=1)
        # (batch, hidden_size*2)


        logits = self.classifier(h)

        #h1 = self.classifier(h)
        #h1_a = self.activation(h1)
        #logits = self.classifier2(h1_a)

        return logits


config = RobertaConfig.from_pretrained(model_name)
config.nclass = 2
config.new_tok_size = len(tokenizer)
print(config)

Downloading:   0%|          | 0.00/547 [00:00<?, ?B/s]

RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "nclass": 2,
  "new_tok_size": 32004,
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "tokenizer_class": "BertTokenizer",
  "transformers_version": "4.12.5",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 32000
}



In [12]:
def train_epoch(epoch, model, train_loader, optimizer, scheduler):
    model.train()
    total_loss = 0
    cor = 0
    n_sample = 0
    s = time.time()
    criterion = nn.CrossEntropyLoss()

    for data, target in tqdm(train_loader):
        item = {key: val.to(device) for key, val in data.items()}
        target = target.to(device)

        logits = model(**item)
        loss = criterion(logits, target)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

        with torch.no_grad():
            preds = torch.argmax(logits, dim=-1)

        total_loss += loss.item()
        cor += (preds == target).sum().item()
        n_sample += len(target)

        print(f"{cor}/{n_sample}", end='\r')

    loss_avg = total_loss / n_sample
    acc = cor / n_sample
    print(
        f"[Epoch {epoch}] Train loss: {loss_avg:.3f}, acc: {acc*100:.2f}, time: {time.time()-s:.1f}s"
    )
    return acc


def validate(epoch, model, val_loader, verbose=True):
    model.eval()
    total_loss = 0
    cor = 0
    n_sample = 0
    criterion = nn.CrossEntropyLoss()
    pred_all = []
    
    with torch.no_grad():
        for data, target in val_loader:
            item = {key: val.to(device) for key, val in data.items()}
            target = target.to(device)

            logits = model(**item)
            loss = criterion(logits, target)
            preds = torch.argmax(logits, dim=-1)
            pred_all.append(preds)

            total_loss += loss.item()
            cor += (preds == target).sum().item()
            n_sample += len(target)

    loss_avg = total_loss / n_sample
    acc = cor / n_sample
    pred_all = torch.cat(pred_all)
    
    if verbose:
        print(f"[Epoch {epoch}] Valid loss: {loss_avg:.3f}, acc: {acc*100:.2f}")
    return acc, pred_all


def train(idx, num_epochs, lr, train_loader, val_loader, tokenizer):
    print(f"Start trining {idx}th model")
    model = Roberta(config, model_name).to(device)


    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = transformers.get_scheduler("linear",
                                           optimizer=optimizer,
                                           num_warmup_steps=num_epochs * len(train_loader) // 10,
                                           num_training_steps=num_epochs * len(train_loader))
    best_acc = 0
    for epoch in range(num_epochs):
        train_acc = train_epoch(epoch, model, train_loader, optimizer, scheduler)
        val_acc, _ = validate(epoch, model, val_loader)
        if val_acc > best_acc:
            best_acc = val_acc

            model_to_save = model.module if hasattr(model, "module") else model

            if not os.path.exists( os.path.join(save_dir, f'{idx}') ):
                os.makedirs( os.path.join(save_dir, f'{idx}') )

            model_to_save.save_pretrained(os.path.join(save_dir, f'{idx}'))
            
    print(f"Training finish! Best validation accuracy: {best_acc*100:.2f}\n")

In [13]:
lr = 8e-6
num_epochs = 10

##  실행 결과
 - 각 모델은 대략 91%~92% 정도의 결과를 보임
 - 6번 실행. 이후 앙상블 적용

In [14]:
for i in range(ensen_num):
    train(i, num_epochs, lr, train_loader, val_loader, tokenizer)

Start trining 0th model


Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 0] Train loss: 0.127, acc: 73.70, time: 429.8s
[Epoch 0] Valid loss: 0.084, acc: 87.82


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 1] Train loss: 0.064, acc: 90.49, time: 429.7s
[Epoch 1] Valid loss: 0.071, acc: 88.16


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 2] Train loss: 0.026, acc: 96.32, time: 429.9s
[Epoch 2] Valid loss: 0.073, acc: 91.17


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 3] Train loss: 0.012, acc: 98.46, time: 430.1s
[Epoch 3] Valid loss: 0.093, acc: 90.14


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 4] Train loss: 0.007, acc: 99.01, time: 429.5s
[Epoch 4] Valid loss: 0.090, acc: 91.25


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 5] Train loss: 0.005, acc: 99.48, time: 432.8s
[Epoch 5] Valid loss: 0.106, acc: 90.91


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 6] Train loss: 0.002, acc: 99.81, time: 432.7s
[Epoch 6] Valid loss: 0.117, acc: 91.42


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 7] Train loss: 0.001, acc: 99.91, time: 432.9s
[Epoch 7] Valid loss: 0.133, acc: 92.02


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 8] Train loss: 0.000, acc: 99.99, time: 433.2s
[Epoch 8] Valid loss: 0.148, acc: 92.28


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 9] Train loss: 0.001, acc: 99.94, time: 432.8s
[Epoch 9] Valid loss: 0.142, acc: 92.28
Training finish! Best validation accuracy: 92.28

Start trining 1th model


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 0] Train loss: 0.127, acc: 72.66, time: 432.9s
[Epoch 0] Valid loss: 0.085, acc: 85.42


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 1] Train loss: 0.057, acc: 91.29, time: 433.7s
[Epoch 1] Valid loss: 0.076, acc: 88.59


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 2] Train loss: 0.021, acc: 97.11, time: 433.5s
[Epoch 2] Valid loss: 0.069, acc: 90.22


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 3] Train loss: 0.011, acc: 98.66, time: 432.9s
[Epoch 3] Valid loss: 0.103, acc: 88.85


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 4] Train loss: 0.006, acc: 99.37, time: 433.1s
[Epoch 4] Valid loss: 0.104, acc: 90.82


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 5] Train loss: 0.004, acc: 99.57, time: 433.3s
[Epoch 5] Valid loss: 0.089, acc: 90.91


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 6] Train loss: 0.002, acc: 99.81, time: 432.8s
[Epoch 6] Valid loss: 0.101, acc: 92.54


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 7] Train loss: 0.000, acc: 99.95, time: 433.1s
[Epoch 7] Valid loss: 0.130, acc: 91.60


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 8] Train loss: 0.001, acc: 99.94, time: 433.1s
[Epoch 8] Valid loss: 0.140, acc: 91.60


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 9] Train loss: 0.000, acc: 99.99, time: 433.3s
[Epoch 9] Valid loss: 0.130, acc: 92.20
Training finish! Best validation accuracy: 92.54

Start trining 2th model


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 0] Train loss: 0.130, acc: 72.73, time: 433.2s
[Epoch 0] Valid loss: 0.091, acc: 84.73


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 1] Train loss: 0.056, acc: 91.62, time: 433.4s
[Epoch 1] Valid loss: 0.071, acc: 88.34


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 2] Train loss: 0.022, acc: 97.02, time: 433.3s
[Epoch 2] Valid loss: 0.079, acc: 90.14


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 3] Train loss: 0.010, acc: 98.50, time: 433.1s
[Epoch 3] Valid loss: 0.096, acc: 90.31


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 4] Train loss: 0.006, acc: 99.16, time: 433.2s
[Epoch 4] Valid loss: 0.095, acc: 91.08


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 5] Train loss: 0.003, acc: 99.64, time: 434.0s
[Epoch 5] Valid loss: 0.121, acc: 90.48


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 6] Train loss: 0.002, acc: 99.74, time: 433.7s
[Epoch 6] Valid loss: 0.107, acc: 91.17


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 7] Train loss: 0.002, acc: 99.83, time: 433.5s
[Epoch 7] Valid loss: 0.114, acc: 91.85


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 8] Train loss: 0.000, acc: 100.00, time: 433.7s
[Epoch 8] Valid loss: 0.128, acc: 92.11


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 9] Train loss: 0.000, acc: 99.96, time: 433.2s
[Epoch 9] Valid loss: 0.134, acc: 91.94
Training finish! Best validation accuracy: 92.11

Start trining 3th model


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 0] Train loss: 0.126, acc: 73.52, time: 434.0s
[Epoch 0] Valid loss: 0.076, acc: 87.99


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 1] Train loss: 0.059, acc: 90.95, time: 433.2s
[Epoch 1] Valid loss: 0.070, acc: 89.62


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 2] Train loss: 0.023, acc: 96.77, time: 433.9s
[Epoch 2] Valid loss: 0.076, acc: 90.22


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 3] Train loss: 0.013, acc: 98.26, time: 433.5s
[Epoch 3] Valid loss: 0.132, acc: 88.25


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 4] Train loss: 0.007, acc: 99.04, time: 433.3s
[Epoch 4] Valid loss: 0.072, acc: 91.42


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 5] Train loss: 0.003, acc: 99.65, time: 434.6s
[Epoch 5] Valid loss: 0.127, acc: 90.82


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 6] Train loss: 0.002, acc: 99.73, time: 433.4s
[Epoch 6] Valid loss: 0.166, acc: 87.82


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 7] Train loss: 0.001, acc: 99.86, time: 433.6s
[Epoch 7] Valid loss: 0.122, acc: 90.74


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 8] Train loss: 0.001, acc: 99.92, time: 433.9s
[Epoch 8] Valid loss: 0.118, acc: 92.20


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 9] Train loss: 0.000, acc: 99.94, time: 433.7s
[Epoch 9] Valid loss: 0.122, acc: 92.11
Training finish! Best validation accuracy: 92.20

Start trining 4th model


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 0] Train loss: 0.126, acc: 74.42, time: 432.7s
[Epoch 0] Valid loss: 0.085, acc: 84.56


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 1] Train loss: 0.058, acc: 91.34, time: 433.5s
[Epoch 1] Valid loss: 0.064, acc: 89.88


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 2] Train loss: 0.022, acc: 96.97, time: 432.9s
[Epoch 2] Valid loss: 0.062, acc: 90.74


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 3] Train loss: 0.011, acc: 98.70, time: 433.6s
[Epoch 3] Valid loss: 0.079, acc: 91.51


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 4] Train loss: 0.007, acc: 99.06, time: 433.2s
[Epoch 4] Valid loss: 0.077, acc: 91.25


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 5] Train loss: 0.003, acc: 99.54, time: 433.6s
[Epoch 5] Valid loss: 0.106, acc: 91.08


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 6] Train loss: 0.002, acc: 99.73, time: 433.4s
[Epoch 6] Valid loss: 0.113, acc: 91.60


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 7] Train loss: 0.001, acc: 99.88, time: 432.9s
[Epoch 7] Valid loss: 0.108, acc: 92.02


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 8] Train loss: 0.000, acc: 99.94, time: 433.9s
[Epoch 8] Valid loss: 0.119, acc: 92.28


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 9] Train loss: 0.000, acc: 99.97, time: 433.4s
[Epoch 9] Valid loss: 0.125, acc: 92.28
Training finish! Best validation accuracy: 92.28

Start trining 5th model


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 0] Train loss: 0.129, acc: 72.23, time: 433.1s
[Epoch 0] Valid loss: 0.111, acc: 80.02


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 1] Train loss: 0.058, acc: 91.21, time: 433.8s
[Epoch 1] Valid loss: 0.059, acc: 90.14


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 2] Train loss: 0.022, acc: 97.11, time: 433.7s
[Epoch 2] Valid loss: 0.081, acc: 90.48


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 3] Train loss: 0.012, acc: 98.40, time: 433.1s
[Epoch 3] Valid loss: 0.108, acc: 89.79


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 4] Train loss: 0.007, acc: 99.23, time: 434.3s
[Epoch 4] Valid loss: 0.093, acc: 90.48


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 5] Train loss: 0.003, acc: 99.56, time: 433.6s
[Epoch 5] Valid loss: 0.134, acc: 90.22


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 6] Train loss: 0.002, acc: 99.73, time: 433.0s
[Epoch 6] Valid loss: 0.129, acc: 90.99


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 7] Train loss: 0.001, acc: 99.83, time: 434.0s
[Epoch 7] Valid loss: 0.120, acc: 91.60


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 8] Train loss: 0.001, acc: 99.94, time: 433.5s
[Epoch 8] Valid loss: 0.121, acc: 91.51


  0%|          | 0/1937 [00:00<?, ?it/s]

[Epoch 9] Train loss: 0.000, acc: 99.99, time: 433.1s
[Epoch 9] Valid loss: 0.126, acc: 91.68
Training finish! Best validation accuracy: 91.68



In [15]:
def train_from_my_model(idx, num_epochs, lr, train_loader, val_loader, tokenizer,model_saved_name):
    print(f"Start trining on {model_saved_name} directory ")
    model = Roberta.from_pretrained(os.path.join(save_dir, model_saved_name), model_name).to(device)

    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = transformers.get_scheduler("linear",
                                           optimizer=optimizer,
                                           num_warmup_steps=num_epochs * len(train_loader) // 10,
                                           num_training_steps=num_epochs * len(train_loader))
    best_acc = 0
    for epoch in range(num_epochs):
        train_acc = train_epoch(epoch, model, train_loader, optimizer, scheduler)
        val_acc, _ = validate(epoch, model, val_loader)
        if val_acc > best_acc:
            best_acc = val_acc

            model_to_save = model.module if hasattr(model, "module") else model

            if not os.path.exists( os.path.join(save_dir, f'{idx}') ):
                os.makedirs( os.path.join(save_dir, f'{idx}') )

            model_to_save.save_pretrained(os.path.join(save_dir, f'{idx}'))
            
    print(f"Training finish! Best validation accuracy: {best_acc*100:.2f}\n")


#train_from_my_model(0, num_epochs, lr, train_loader, val_loader, tokenizer,"layer2_relu_0_epoch5_91.25_seed18")

## Test models (validation: 93.31%)
- 위의 6개의 모델을 앙상블 적용하여 최종모델 만들기 
- 정확도가 87%이하인 모델은 제외

In [16]:
"""

testing

"""




'\n\ntesting\n\n'

In [17]:
def validate_tqdm(epoch, model, val_loader, verbose=True):
    model.eval()
    total_loss = 0
    cor = 0
    n_sample = 0
    criterion = nn.CrossEntropyLoss()
    pred_all = []
    
    with torch.no_grad():
        for data, target in tqdm(val_loader):
            item = {key: val.to(device) for key, val in data.items()}
            target = target.to(device)

            logits = model(**item)
            loss = criterion(logits, target)
            preds = torch.argmax(logits, dim=-1)
            pred_all.append(preds)

            total_loss += loss.item()
            cor += (preds == target).sum().item()
            n_sample += len(target)

    loss_avg = total_loss / n_sample
    acc = cor / n_sample
    pred_all = torch.cat(pred_all)
    
    if verbose:
        print(f"[Epoch {epoch}] Valid loss: {loss_avg:.3f}, acc: {acc*100:.2f}")
    return acc, pred_all

def validate_mymodel(val_loader, answer, model_saved_name):
    model = Roberta.from_pretrained(os.path.join(save_dir, model_saved_name), model_name)
    model.to(device)
    acc, pred_all = validate_tqdm('best', model, val_loader, verbose=False)
    print(f"Load {idx}th model (acc: {acc*100:.2f})")

answer = torch.tensor(val_dataset.labels)
#validate_mymodel(val_loader, answer, "0_91.51_6epoch_seed12")

In [20]:
def validate_ensemble(val_loader, answer, idx_max=10):
    pred_ensemble = []
    for idx in range(idx_max):
        model = Roberta.from_pretrained(os.path.join(save_dir, f'{idx}'), model_name)
        model.to(device)
        acc, pred_all = validate('best', model, val_loader, verbose=False)
        print(f"Load {idx}th model (acc: {acc*100:.2f})")
        if acc >= 0.87:
            pred_ensemble.append(pred_all)
        
    pred_ensemble = torch.stack(pred_ensemble, dim=-1).float()
    pred_ensemble = (pred_ensemble.mean(-1) >= 0.5).long().to(answer.device)
    acc_ensemble = (pred_ensemble == answer).sum() / len(answer)
    print(f"\nEnsemble accuracy: {acc_ensemble*100:.2f}")

In [21]:
answer = torch.tensor(val_dataset.labels)

validate_ensemble(val_loader, answer, idx_max=ensen_num)

Load 0th model (acc: 92.28)
Load 1th model (acc: 92.54)
Load 2th model (acc: 92.11)
Load 3th model (acc: 92.20)
Load 4th model (acc: 92.28)
Load 5th model (acc: 91.68)

Ensemble accuracy: 93.31
