# torch roberta 필사
competition : 문장 유형 분류 AI 경진대회  
notebook link : https://dacon.io/competitions/official/236037/codeshare/7334?page=1&dtype=recent

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score
from sklearn.preprocessing import LabelEncoder
import random
import os

import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm
from torch.optim import Adam

import matplotlib as mpl
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings(action='ignore')

In [3]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

In [4]:
device

device(type='cpu')

In [5]:
CFG = {
    "epochs" : 5,
    "learning_rate" : 1e-6,
    "batch_size" : 8,
    "seed" : 2023
}

In [6]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['pythonhashseed'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

seed_everything(CFG['seed'])

# data load

In [13]:
path = "C:/Users/zelo9/Desktop/ff/code/data/sentence_classification"

In [15]:
train = pd.read_csv(path + "/train.csv")
test = pd.read_csv(path + "/test.csv")

In [16]:
train.head()

Unnamed: 0,ID,문장,유형,극성,시제,확실성,label
0,TRAIN_00000,0.75%포인트 금리 인상은 1994년 이후 28년 만에 처음이다.,사실형,긍정,현재,확실,사실형-긍정-현재-확실
1,TRAIN_00001,이어 ＂앞으로 전문가들과 함께 4주 단위로 상황을 재평가할 예정＂이라며 ＂그 이전이...,사실형,긍정,과거,확실,사실형-긍정-과거-확실
2,TRAIN_00002,정부가 고유가 대응을 위해 7월부터 연말까지 유류세 인하 폭을 30%에서 37%까지...,사실형,긍정,미래,확실,사실형-긍정-미래-확실
3,TRAIN_00003,"서울시는 올해 3월 즉시 견인 유예시간 60분을 제공하겠다고 밝혔지만, 하루 만에 ...",사실형,긍정,과거,확실,사실형-긍정-과거-확실
4,TRAIN_00004,익사한 자는 사다리에 태워 거꾸로 놓고 소금으로 코를 막아 가득 채운다.,사실형,긍정,현재,확실,사실형-긍정-현재-확실


# label encoding

In [18]:
le1 = LabelEncoder()
le1 = le1.fit(train["유형"])
train["유형"] = le1.transform(train["유형"])

le2 = LabelEncoder()
le2 = le2.fit(train["극성"])
train["극성"] = le2.transform(train["극성"])

le3 = LabelEncoder()
le3 = le3.fit(train["시제"])
train["시제"]= le3.transform(train["시제"])

le4 = LabelEncoder()
le4 = le4.fit(train["확실성"])
train["확실성"] = le4.transform(train["확실성"])

# train/ val split

In [20]:
valid = train[13000:].reset_index(drop = True)
train = train[:13000].reset_index(drop = True)

train_len = len(train)
val_len = len(valid)

train_len, val_len

(13000, 3541)

# tokenizer define

In [26]:
from transformers import AutoModel, AutoTokenizer

model_name = AutoModel.from_pretrained("klue/roberta-large")
tokenizers = AutoTokenizer.from_pretrained("klue/roberta-large")

Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

Downloading:   0%|          | 0.00/375 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/248k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/752k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/173 [00:00<?, ?B/s]

# custom dataset

In [27]:
class CustomDataset(Dataset):
    def __init__(self, data, mode = "train"):
        self.dataset = data
        self.tokenizer = tokenizers
        self.mode = mode
        
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        text = self.dataset["문장"][idx]
        # tokenizer.tokenize(text)를 하면 단어별로 쪼개짐
        # tokenizer(text)를 하면 model에 들어갈 input encoing 생성
        inputs = self.tokenizer(text, padding = "max_length", max_length = 512,
                                truncation = True, return_tensors = "pt")
        # inputs["input_ids"]는 토큰을 시퀀스로 변형한 것
        # 0번째를 가져온 이유는 토큰화하면서 제일 앞에 start 토큰이 붙기 때문(제일 마지막에도 end 토큰이 붙음)
        input_ids = inputs["input_ids"][0]
        # 어느 토큰에 집중해야하는지(1), 아닌지(0)를 판단할 수 있게 해주는 역할
        attention_mask = inputs["attention_mask"][0]
        
        if self.mode =="train":
            st_type = self.dataset["유형"][idx]
            st_polarity = self.dataset["극성"][idx]
            st_tense = self.dataset["시제"][idx]
            st_certainty = self.dataset["확실성"][idx]
            return input_ids, attention_mask, st_type, st_polarity, st_tense, st_certainty
        else:
            return input_ids, attention_mask

In [28]:
train = CustomDataset(train, mode = "train")
valid = CustomDataset(valid, mode = "train")

train_dataloader = torch.utils.data.DataLoader(train, batch_size = CFG["batch_size"], shuffle = True)
val_dataloader = torch.utils.data.DataLoader(valid, batch_size = CFG["batch_size"], shuffle = False)

# model define

In [29]:
class BaseModel(nn.Module):
    def __init__(self, dropout = 0.5):
        super(BaseModel, self).__init__()
        self.nlp_model = model_name
        
        #out_features 는 예측해야하는 범주의 갯수
        
        self.type_classifier = nn.Sequential(
            nn.Dropout(p=0.3),
            nn.Linear(in_features = 1024, out_features = 4),
        )
        
        self.polarity_classifier = nn.Sequential(
            nn.Dropout(p=0.3),
            nn.Linear(in_features = 1024, out_features = 3),
        )
        
        self.tense_classifier = nn.Sequential(
            nn.Dropout(p=0.3),
            nn.Linear(in_features = 1024, out_features = 3),
        )
        
        self.certainty_classifier = nn.Sequential(
            nn.Dropout(p = 0.3),
            nn.Linear(in_features = 1024, out_features = 2),
        )
        
    def forward(self, input_id, mask):
        _, pooled_output = self.nlp_model(input_ids = input_id, attention_mask = mask, return_dict = False)
        type_output = self.type_classifier(pooled_output)
        polarity_output = self.polarity_classifier(pooled_output)
        tense_output = self.tense_classifier(pooled_output)
        certainty_output = self.certainty_classifier(pooled_output)
        
        return type_output, polarity_output, tense_output, certainty_output

# train

In [30]:
def train(model, optimizer, train_dataloader, val_dataloader, scheduler, device):
    model.to(device)
    
    criterion = {
        "type" : nn.CrossEntropyLoss().to(device),
        "polarity" : nn.CrossEntropyLoss().to(device),
        "tense" : nn.CrossEntropyLoss().to(device),
        "certainty" : nn.CrossEntropyLoss().to(device)
    }
    
    best_loss = 999999
    best_model = None
    
    for epoch in range(1, CFG["epochs"] + 1):
        model.train()
        train_loss = []
        
        for sentence, attention_mask, type_label, polarity_label, tense_label, \
            certainty_label in tqdm(iter(train_dataloader)):
            
            sentence = sentence.to(device)
            type_label = type_label.type(torch.LongTensor).to(device)
            polarity_label = polarity_label.type(torch.LongTensor).to(device)
            tense_label = tense_label.type(torch.LongTensor).to(device)
            certainty_label = certainty_label.type(torch.LongTensor).to(device)
            mask = attention_mask.to(device)
            
            optimizer.zero_grad()
            
            type_logit, polarity_logit, tense_logit, certainty_logit = model(sentence, mask)
            
            loss = 0.25 * criterion['type'](type_logit, type_label) + \
                    0.25 * criterion['polarity'](polarity_logit, polarity_label) + \
                    0.25 * criterion['tense'](tense_logit, tense_label) + \
                    0.25 * criterion['certainty'](certainty_logit, certainty_label)
            
            loss.backward()
            optimizer.step()
            
            train_loss.append(loss.item())
            
        val_loss, val_type_f1, val_polarity_f1, val_tense_f1, val_certainty_f1 = \
        validation(model,val_dataloader, criterion, device)
        print(f'Epoch : [{epoch}] Train Loss : [{np.mean(train_loss):.5f}] Val Loss : [{val_loss:.5f}] 유형 F1 : [{val_type_f1:.5f}] 극성 F1 : [{val_polarity_f1:.5f}] 시제 F1 : [{val_tense_f1:.5f}] 확실성 F1 : [{val_certainty_f1:.5f}]')
        
        if scheduler is not None:
            scheduler.step(val_loss)
            
        if best_loss > val_loss:
            best_loss = val_loss
            best_model = model
            torch.save(model, os.path.join("results", "Allineone-KOR-best-model.pth"))
            print("Model saved")
        
        return best_model

In [33]:
def validation(model, val_dataloader, criterion, device):
    model.eval()
    val_loss = []
    
    type_preds, polarity_preds, tense_preds, certainty_preds = [], [], [], []
    type_labels, polarity_labels, tense_labels, certainty_labels = [], [], [], []
    
    with torch.no_grad():
        for sentence, attention_mask, type_label, polarity_label, tense_label, \
        certainty_label in tqdm(iter(val_dataloader)):
            
            sentence = sentence.to(device)
            type_label = type_label.type(torch.LongTensor).to(device)
            polarity_label = polarity_label.type(torch.LongTensor).to(device)
            tense_label = tense_label.type(torch.LongTensor).to(device)
            certainty_label = certainty_label.type(torch.LongTensor).to(device)
            mask = attention_mask.to(device)
                                     
            type_logit, polarity_logit, tense_logit, certainty_logit = model(sentence, mask)
                                     
            loss = 0.25 * criterion['type'](type_logit, type_label) + \
                    0.25 * criterion['polarity'](polarity_logit, polarity_label) + \
                    0.25 * criterion['tense'](tense_logit, tense_label) + \
                    0.25 * criterion['certainty'](certainty_logit, certainty_label)
                                     
            val_loss.append(loss.item())
                                     
            type_preds += type_logit.argmax(1).detach().cpu().numpy().tolist()
            type_labels += type_label.detach().cpu().numpy().tolist()
            
            polarity_preds += polarity_logit.argmax(1).detach().cpu().numpy().tolist()
            polarity_labels += polarity_label.detach().cpu().numpy().tolist()
            
            tense_preds += tense_logit.argmax(1).detach().cpu().numpy().tolist()
            tense_labels += tense_label.detach().cpu().numpy().tolist()
            
            certainty_preds += certainty_logit.argmax(1).detach().cpu().numpy().tolist()
            certainty_labels += certainty_label.detach().cpu().numpy().tolist()
                                     
    type_f1 = f1_score(type_labels, type_preds, average='weighted')
    polarity_f1 = f1_score(polarity_labels, polarity_preds, average='weighted')
    tense_f1 = f1_score(tense_labels, tense_preds, average='weighted')
    certainty_f1 = f1_score(certainty_labels, certainty_preds, average='weighted')
    
    return np.mean(val_loss), type_f1, polarity_f1, tense_f1, certainty_f1

# run

In [34]:
model =BaseModel()
model.eval()
optimizer = torch.optim.AdamW(params = model.parameters(), lr = CFG["learning_rate"])
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode = "min", factor = 0.5, patience = 2, threshold_mode = "abs", min_lr = 1e-8, verbose = True)

infer_model = train(model, optimizer, train_dataloader, val_dataloader, scheduler, device)

  0%|                                                                             | 1/1625 [01:43<46:29:34, 103.06s/it]


KeyboardInterrupt: 

# inference

In [41]:
infer_model = torch.load(os.path.join("results", "Allineone-KOR-best-model.pth"))

test = CustomDataset(test, mode = "test")
test_dataloader = torch.utils.data.DataLoader(test, batch_size= CFG['batch_size'], shuffle=False)

In [38]:
def inference(model, test_loader, device):
    model.to(device)
    model.eval()
    
    type_preds, polarity_preds, tense_preds, certainty_preds = [], [], [], []
    
    with torch.no_grad():
        for sentence, attention_mask in tqdm(test_loader):
            sentence = sentence.to(device)
            mask = attention_mask.to(device)
            
            type_logit, polarity_logit, tense_logit, certainty_logit = model(sentence, mask)
            
            type_preds += type_logit.argmax(1).detach().cpu().numpy().tolist()
            polarity_preds += polarity_logit.argmax(1).detach().cpu().numpy().tolist()
            tense_preds += tense_logit.argmax(1).detach().cpu().numpy().tolist()
            certainty_preds += certainty_logit.argmax(1).detach().cpu().numpy().tolist()
            
        return type_preds, polarity_preds, tense_preds, certainty_preds

In [42]:
type_preds, polarity_preds, tense_preds, certainty_preds = inference(model, test_dataloader, device)

  0%|                                                                                          | 0/887 [00:03<?, ?it/s]


KeyboardInterrupt: 

In [43]:
type_preds = le1.inverse_transform(type_preds)
polarity_preds = le2.inverse_transform(polarity_preds)
tense_preds = le3.inverse_transform(tense_preds)
certainty_preds = le4.inverse_transform(certainty_preds)

NameError: name 'type_preds' is not defined

In [None]:
predictions = []
for type_pred, polarity_pred, tense_pred, certainty_pred in zip(type_preds, polarity_preds, tense_preds, certainty_preds):
    predictions.append(type_pred+'-'+polarity_pred+'-'+tense_pred+'-'+certainty_pred)