## 팀원
- 김장현 / 컴퓨터공학부/ 2019-26471 
- 안형서 / 
- 양서연 /
- 이욱재 / 

## Intro 
- Here, we train neural networks solving the four Korean language tasks ([link](https://corpus.korean.go.kr/task/taskDownload.do?taskId=1&clCd=END_TASK&subMenuId=sub02)). 
- Our basic approach is to fine-tune pre-trained **korean language models** (https://huggingface.co/models?language=ko&sort=downloads). 
- We basically use **KLUE-RoBERTa** models from https://github.com/KLUE-benchmark/KLUE, which is the state-of-the-art korean language model. 
- We refer the following sources for the some parts of data processing and fine-tuning techniques. 
 - Sun et al., 'How to Fine-Tune BERT for Text Classification?', https://arxiv.org/abs/1905.05583
 - https://github.com/NIKL-Team-BC/NIKL-KLUE
- For more **detailed codes and experiment logs**, please refer to our [github page](https://github.com/codestella/nlp_final_project)

## Import functions

In [1]:
import os
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import transformers
import pandas
from transformers import AutoTokenizer
from transformers import RobertaModel, RobertaConfig
from transformers import AdamW
import time
import argparse

transformers.logging.set_verbosity(40) # Turn off warning
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

## 1. 판정의문문 (BoolQ)

아래 하이퍼파라미터 및 모델 세팅을 탐색 하였음. 가장 성능이 좋은 세팅은 강조되어 있음. 이 실험에 대한 보다 자세한 코드 및 로그파일은 [github page](https://github.com/codestella/nlp_final_project) 참조. 
- Model size: **Large** (\~85%) / Base (\~79%)
- Epoch: 10
- warm up: 10% training step (없으면 학습 불안정)
- Learning rate: 1e-5, **8e-6**, 5e-6 (큰 차이는 없으나, 커지면 학습 불안정)
- Batch size: **5**, 20, 60
- Finetuning: **All**, Only classifier (i.e., freeze feature extractor, \~57%)
- Classifier: linear model 충분 (multi-layer로 늘려도 gain 작음)

In [2]:
def load_data(path, tokenizer):
    ''' Tokenization for BoolQ data'''
    dataset = pandas.read_csv(path,
                              delimiter='\t',
                              names=['ID', 'text', 'question', 'answer'],
                              header=0)

    tokenized = tokenizer(dataset['text'].tolist(),
                          dataset['question'].tolist(),
                          padding=True,
                          truncation=True,
                          return_tensors="pt")
    dataset['label'] = torch.tensor(dataset['answer'])
    return dataset, tokenized


class Roberta(RobertaModel):
    ''' Classification layer added Roberta model'''
    def __init__(self, config, model_name):
        super(Roberta, self).__init__(config)
        self.roberta = RobertaModel.from_pretrained(model_name, config=config)
        self.hdim = config.hidden_size
        self.nclass = config.nclass
        self.classifier = nn.Linear(self.hdim, self.nclass)

    def forward(self, input_ids, attention_mask, **kwargs):
        outputs = self.roberta(input_ids, attention_mask=attention_mask)
        h = outputs[0][:, 0, :]
        logits = self.classifier(h)
        return logits

In [3]:
class TensorDataset(Dataset):
    ''' Define Torch Dataset '''
    def __init__(self, tokenized_dataset, labels):
        self.tokenized_dataset = tokenized_dataset
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.tokenized_dataset.items()}
        label = self.labels[idx]
        return item, label

    def __len__(self):
        return len(self.labels)

In [4]:
# Define tokenizer and model type
model_type = "Roberta"
size = 'large'
model_name = f"klue/roberta-{size}"
tokenizer = AutoTokenizer.from_pretrained(model_name)

Set data path below!

In [5]:
# Data path
base_path = './data'

train_dataset, train_tokenized = load_data(os.path.join(base_path, 'SKT_BoolQ_Train.tsv'), tokenizer)
val_dataset, val_tokenized = load_data(os.path.join(base_path, 'SKT_BoolQ_Dev.tsv'), tokenizer)

train_dataset = TensorDataset(train_tokenized, train_dataset['label'])
val_dataset = TensorDataset(val_tokenized, val_dataset['label'])

# Define loader
if size == 'base':
    batch_size = 16
else:
    batch_size = 5
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

In [6]:
# Data example
tokenizer.decode(train_tokenized['input_ids'][0])

'[CLS] 로마 시대의 오리엔트의 범위는 제국 내에 동부 지방은 물론 제국 외부에 있는 다른 국가에 광범위하게 쓰이는 단어였다. 그 후에 로마 제국이 분열되고 서유럽이 그들의 중심적인 세계를 형성하는 과정에서 자신들을 옥시덴트 ( occident ), 서방이라 부르며 오리엔트는 이와 대조되는 문화를 가진 동방세계라는 뜻이 부가되어, 인도와 중국, 일본을 이루는 광범위한 지역을 지칭하는 단어가 되었다. [SEP] 오리엔트는 인도와 중국, 일본을 이루는 광범위한 지역을 지칭하는 단어로 쓰인다. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [P

### 학습 결과 (BoolQ)
- 최적 세팅에서 10번 반복한 결과 84% \~ 87% 의 결과를 얻음
- 각 실험은 1시간 정도 소요 (6min/epoch) 
- (참고) jupyter를 서버에서 돌려서 컴퓨터 연결이 끊겼을때 print가 잘 되지 않은 경우가 있으나, 모델 학습 및 저장은 이상 없음.
- 본 모델을 앙상블하여 최종 **88.14\%**의 validation accuracy 달성.

In [7]:
def train_epoch(epoch, model, train_loader, optimizer, scheduler):
    ''' One epoch fine-tuning '''
    model.train()
    total_loss = 0
    cor = 0
    n_sample = 0
    s = time.time()
    criterion = nn.CrossEntropyLoss()

    for data, target in train_loader:
        item = {key: val.to(device) for key, val in data.items()}
        target = target.to(device)

        logits = model(**item)
        loss = criterion(logits, target)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

        with torch.no_grad():
            preds = torch.argmax(logits, dim=-1)

        total_loss += loss.item()
        cor += (preds == target).sum().item()
        n_sample += len(target)

        print(f"{cor}/{n_sample}", end='\r')

    loss_avg = total_loss / n_sample
    acc = cor / n_sample
    print(
        f"[Epoch {epoch}] Train loss: {loss_avg:.3f}, acc: {acc*100:.2f}, time: {time.time()-s:.1f}s"
    )
    return acc


def validate(epoch, model, val_loader, verbose=True):
    ''' Evaluate on validation set '''
    model.eval()
    total_loss = 0
    cor = 0
    n_sample = 0
    criterion = nn.CrossEntropyLoss()
    pred_all = []
    
    with torch.no_grad():
        for data, target in val_loader:
            item = {key: val.to(device) for key, val in data.items()}
            target = target.to(device)

            logits = model(**item)
            loss = criterion(logits, target)
            preds = torch.argmax(logits, dim=-1)
            pred_all.append(preds)

            total_loss += loss.item()
            cor += (preds == target).sum().item()
            n_sample += len(target)

    loss_avg = total_loss / n_sample
    acc = cor / n_sample
    pred_all = torch.cat(pred_all)
    
    if verbose:
        print(f"[Epoch {epoch}] Valid loss: {loss_avg:.3f}, acc: {acc*100:.2f}")
    return acc, pred_all


def train(idx, num_epochs, lr, train_loader, val_loader, config, save_dir='./results'):
    ''' Train for multiple epochs and validate '''    
    print(f"Start trining {idx}th model")
    model = Roberta(config, model_name).to(device)
    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = transformers.get_scheduler("linear",
                                           optimizer=optimizer,
                                           num_warmup_steps=num_epochs * len(train_loader) // 10,
                                           num_training_steps=num_epochs * len(train_loader))
    best_acc = 0
    for epoch in range(num_epochs):
        train_acc = train_epoch(epoch, model, train_loader, optimizer, scheduler)
        val_acc, _ = validate(epoch, model, val_loader)
        if val_acc > best_acc:
            best_acc = val_acc

            model_to_save = model.module if hasattr(model, "module") else model
            model_to_save.save_pretrained(os.path.join(save_dir, f'{idx}'))
            
    print(f"Training finish! Best validation accuracy: {best_acc*100:.2f}\n")

In [8]:
def validate_ensemble(save_dir, model_name, val_loader, answer, idx_max=10, acc_threshold=0.85):
    ''' Measure ensemble accuracy '''
    pred_ensemble = []
    for idx in range(idx_max):
        model = Roberta.from_pretrained(os.path.join(save_dir, f'{idx}'), model_name)
        model.to(device)
        acc, pred_all = validate('best', model, val_loader, verbose=False)
        print(f"Load {idx}th model (acc: {acc*100:.2f})")
        if acc >= acc_threshold:
            pred_ensemble.append(pred_all)
        
    pred_ensemble = torch.stack(pred_ensemble, dim=-1).float()
    pred_ensemble = (pred_ensemble.mean(-1) >= 0.5).long().to(answer.device)
    acc_ensemble = (pred_ensemble == answer).sum() / len(answer)
    print(f"\nEnsemble accuracy: {acc_ensemble*100:.2f}")

In [9]:
# Fine-tuning networks (10 repeat)
lr = 8e-6
num_epochs = 10
save_dir = './results_qa'

config = RobertaConfig.from_pretrained(model_name)
config.nclass = 2
for i in range(10):
    train(i, num_epochs, lr, train_loader, val_loader, config=config, save_dir=save_dir)

Start trining 0th model


Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

[Epoch 0] Train loss: 0.135, acc: 55.66, time: 346.7s
[Epoch 0] Valid loss: 0.111, acc: 72.00
[Epoch 1] Train loss: 0.086, acc: 81.53, time: 348.5s
[Epoch 1] Valid loss: 0.090, acc: 78.71
[Epoch 2] Train loss: 0.037, acc: 93.40, time: 350.1s
[Epoch 2] Valid loss: 0.094, acc: 82.57
[Epoch 3] Train loss: 0.016, acc: 97.33, time: 349.0s
[Epoch 3] Valid loss: 0.112, acc: 84.00
[Epoch 4] Train loss: 0.006, acc: 98.99, time: 347.3s
[Epoch 4] Valid loss: 0.167, acc: 84.00
[Epoch 5] Train loss: 0.004, acc: 99.40, time: 350.4s
[Epoch 5] Valid loss: 0.165, acc: 84.00
[Epoch 6] Train loss: 0.003, acc: 99.48, time: 350.7s
[Epoch 6] Valid loss: 0.144, acc: 87.00
[Epoch 7] Train loss: 0.001, acc: 99.81, time: 349.5s
[Epoch 7] Valid loss: 0.181, acc: 85.71
[Epoch 8] Train loss: 0.001, acc: 99.89, time: 353.4s
[Epoch 8] Valid loss: 0.189, acc: 85.00
[Epoch 9] Train loss: 0.001, acc: 99.89, time: 350.9s
[Epoch 9] Valid loss: 0.193, acc: 84.43
Training finish! Best validation accuracy: 87.00

Start trin

Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

[Epoch 0] Train loss: 0.130, acc: 59.15, time: 351.2s
[Epoch 0] Valid loss: 0.101, acc: 75.86
[Epoch 1] Train loss: 0.081, acc: 82.18, time: 349.3s
[Epoch 1] Valid loss: 0.070, acc: 84.71
[Epoch 2] Train loss: 0.031, acc: 94.22, time: 350.0s
[Epoch 2] Valid loss: 0.126, acc: 84.71
[Epoch 3] Train loss: 0.013, acc: 97.93, time: 354.0s
[Epoch 3] Valid loss: 0.132, acc: 84.29
[Epoch 4] Train loss: 0.007, acc: 99.07, time: 353.1s
[Epoch 4] Valid loss: 0.152, acc: 84.57
[Epoch 5] Train loss: 0.002, acc: 99.54, time: 353.0s
[Epoch 5] Valid loss: 0.164, acc: 84.71
[Epoch 6] Train loss: 0.003, acc: 99.54, time: 352.5s
[Epoch 6] Valid loss: 0.150, acc: 86.14
[Epoch 7] Train loss: 0.002, acc: 99.73, time: 351.3s
[Epoch 7] Valid loss: 0.168, acc: 85.71
[Epoch 8] Train loss: 0.001, acc: 99.97, time: 354.4s
[Epoch 8] Valid loss: 0.179, acc: 85.29
[Epoch 9] Train loss: 0.001, acc: 99.95, time: 353.6s
[Epoch 9] Valid loss: 0.179, acc: 85.71
Training finish! Best validation accuracy: 86.14

Start trin

Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

[Epoch 0] Train loss: 0.140, acc: 50.97, time: 353.7s
[Epoch 0] Valid loss: 0.134, acc: 59.14
[Epoch 1] Train loss: 0.104, acc: 73.29, time: 353.2s
[Epoch 1] Valid loss: 0.083, acc: 81.14
3170/3485

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 4] Train loss: 0.006, acc: 99.13, time: 352.3s
[Epoch 4] Valid loss: 0.148, acc: 83.00
[Epoch 5] Train loss: 0.004, acc: 99.29, time: 354.9s
[Epoch 5] Valid loss: 0.143, acc: 83.43
[Epoch 6] Train loss: 0.003, acc: 99.56, time: 353.8s
[Epoch 6] Valid loss: 0.193, acc: 82.86
[Epoch 7] Train loss: 0.000, acc: 99.95, time: 356.7s
[Epoch 7] Valid loss: 0.202, acc: 84.71
425/425

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 5] Train loss: 0.002, acc: 99.70, time: 353.3s
[Epoch 5] Valid loss: 0.179, acc: 85.00
[Epoch 6] Train loss: 0.002, acc: 99.59, time: 353.9s
[Epoch 6] Valid loss: 0.188, acc: 84.86
[Epoch 7] Train loss: 0.001, acc: 99.84, time: 355.2s
[Epoch 7] Valid loss: 0.184, acc: 85.00
[Epoch 8] Train loss: 0.001, acc: 99.92, time: 355.2s
[Epoch 8] Valid loss: 0.184, acc: 85.86
3584/3585

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 2] Train loss: 0.035, acc: 94.00, time: 354.7s
[Epoch 2] Valid loss: 0.100, acc: 82.00
[Epoch 3] Train loss: 0.012, acc: 98.25, time: 357.2s
[Epoch 3] Valid loss: 0.128, acc: 83.57
[Epoch 4] Train loss: 0.007, acc: 98.80, time: 354.8s
[Epoch 4] Valid loss: 0.150, acc: 82.57
[Epoch 5] Train loss: 0.005, acc: 98.99, time: 356.3s
[Epoch 5] Valid loss: 0.187, acc: 81.43
1705/1705

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 7] Train loss: 0.002, acc: 99.75, time: 358.5s
[Epoch 7] Valid loss: 0.181, acc: 83.57
[Epoch 8] Train loss: 0.001, acc: 99.92, time: 358.0s
[Epoch 8] Valid loss: 0.193, acc: 82.57
[Epoch 9] Train loss: 0.000, acc: 99.97, time: 356.7s
[Epoch 9] Valid loss: 0.209, acc: 81.86
Training finish! Best validation accuracy: 83.57

Start trining 5th model


Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

[Epoch 0] Train loss: 0.140, acc: 51.24, time: 357.3s
[Epoch 0] Valid loss: 0.123, acc: 68.14
2083/2690

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 9] Train loss: 0.001, acc: 99.95, time: 359.1s
[Epoch 9] Valid loss: 0.218, acc: 84.43
Training finish! Best validation accuracy: 84.86

Start trining 6th model


Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

[Epoch 0] Train loss: 0.137, acc: 53.12, time: 359.3s
[Epoch 0] Valid loss: 0.098, acc: 76.57
[Epoch 1] Train loss: 0.086, acc: 81.17, time: 354.8s
[Epoch 1] Valid loss: 0.086, acc: 81.86
[Epoch 2] Train loss: 0.033, acc: 94.00, time: 358.5s
[Epoch 2] Valid loss: 0.113, acc: 81.14
2626/2690

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 6] Train loss: 0.002, acc: 99.86, time: 356.8s
[Epoch 6] Valid loss: 0.181, acc: 84.57
[Epoch 7] Train loss: 0.001, acc: 99.89, time: 358.8s
[Epoch 7] Valid loss: 0.199, acc: 84.57
[Epoch 8] Train loss: 0.000, acc: 99.95, time: 357.7s
[Epoch 8] Valid loss: 0.205, acc: 85.29
[Epoch 9] Train loss: 0.000, acc: 100.00, time: 355.7s
[Epoch 9] Valid loss: 0.209, acc: 85.43
Training finish! Best validation accuracy: 85.43

Start trining 7th model


Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

400/780

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 1] Train loss: 0.085, acc: 80.90, time: 355.3s
[Epoch 1] Valid loss: 0.097, acc: 81.00
[Epoch 2] Train loss: 0.037, acc: 93.70, time: 356.8s
[Epoch 2] Valid loss: 0.088, acc: 84.57
[Epoch 3] Train loss: 0.013, acc: 98.25, time: 355.4s
[Epoch 3] Valid loss: 0.115, acc: 84.57
[Epoch 4] Train loss: 0.009, acc: 98.64, time: 357.4s
[Epoch 4] Valid loss: 0.121, acc: 85.00
2834/2855

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 3] Train loss: 0.015, acc: 97.44, time: 357.1s
[Epoch 3] Valid loss: 0.130, acc: 83.71
[Epoch 4] Train loss: 0.006, acc: 99.02, time: 356.7s
[Epoch 4] Valid loss: 0.135, acc: 84.29
[Epoch 5] Train loss: 0.003, acc: 99.48, time: 356.5s
[Epoch 5] Valid loss: 0.177, acc: 84.00
[Epoch 6] Train loss: 0.001, acc: 99.95, time: 359.4s
[Epoch 6] Valid loss: 0.232, acc: 84.00
3238/3245

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 9] Train loss: 0.001, acc: 99.86, time: 361.4s
[Epoch 9] Valid loss: 0.224, acc: 83.71
Training finish! Best validation accuracy: 84.29

Start trining 9th model


Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

[Epoch 0] Train loss: 0.141, acc: 51.65, time: 359.3s
[Epoch 0] Valid loss: 0.153, acc: 46.57
[Epoch 1] Train loss: 0.104, acc: 74.05, time: 357.5s
[Epoch 1] Valid loss: 0.088, acc: 79.43
[Epoch 2] Train loss: 0.048, acc: 90.86, time: 356.8s
[Epoch 2] Valid loss: 0.084, acc: 82.71
[Epoch 3] Train loss: 0.017, acc: 97.57, time: 357.0s
[Epoch 3] Valid loss: 0.150, acc: 79.57
25/25

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Epoch 4] Train loss: 0.007, acc: 98.83, time: 360.3s
[Epoch 4] Valid loss: 0.162, acc: 83.14
[Epoch 5] Train loss: 0.004, acc: 99.32, time: 360.5s
[Epoch 5] Valid loss: 0.185, acc: 83.57
[Epoch 6] Train loss: 0.004, acc: 99.51, time: 360.3s
[Epoch 6] Valid loss: 0.161, acc: 84.29
[Epoch 7] Train loss: 0.003, acc: 99.62, time: 360.3s
[Epoch 7] Valid loss: 0.164, acc: 83.86
1589/1590

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [10]:
answer = torch.tensor(val_dataset.labels)
save_dir = './results_qa'

validate_ensemble(save_dir, model_name, val_loader, answer, idx_max=10, acc_threshold=0.85)

Load 0th model (acc: 87.00)
Load 1th model (acc: 86.14)
Load 2th model (acc: 85.43)
Load 3th model (acc: 85.86)
Load 4th model (acc: 83.57)
Load 5th model (acc: 84.86)
Load 6th model (acc: 85.43)
Load 7th model (acc: 85.43)
Load 8th model (acc: 84.29)
Load 9th model (acc: 84.29)

Ensemble accuracy: 88.14
