## Ön Eğitim - Sınıflandırma
Bu çalışmada ilk ön eğitim tekniğimizi inceliyoruz. Topladığımız Mandalina emoji veri setiyle transformers modellerini eğitiyoruz. Bu eğitimde iki tane hedefimiz var.

Birincisi BERT'te de kullanılan Maskelenmiş Dil Modeli (Masked Language Model)
İkincisi ise *Mandalina emoji veri setindeki* twitlerden emojileri çıkartıp her twitte geçen emojinin kategorisini 8 sınıf arasından tahmin ediyoruz.
Kategoriler
* laughing = kahkaha

😀 😃 😄 😁 😆 😅 🤣 😂
* smiling = gülücük

🙂 🙃 😉 😊 😇
* affection = ilgi

🥰 😍 🤩 😘 😗 😚 😙
* tongue = dil

😛 😜 🤪 😝 🤑
* neutral = nötr

🤐 🤨 😐 😑 😶
* unwell = hasta

😷 🤒 🤕 🤢 🤮 🤧 🥵 🥶 🥴 😵
* concerned = endişeli

😕 😟 🙁 ☹️ 😮 😯 😲 😳 🥺 😦 😧 😨 😰 😥 😢 😭 😖 😣 😞 😓 😩 😫
* negative = negatif

😤 😡 😠 🤬 😈 👿

***

Not: Bu modelin eğitimi uzun sürüyor, bu sebeple iyi bir GPU kullanılması önerilir.

Gerekli kütüphaneler

In [1]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score, accuracy_score
from transformers import AdamW, AutoConfig, AutoModelForPreTraining, AutoTokenizer
import torch
import numpy as np
import random
import math
import pandas as pd
from tqdm.notebook import tqdm
from collections import Counter
from random import shuffle
from random import random as rand
from torch.nn import CrossEntropyLoss
import sys
from datetime import datetime
import os
from random import randrange

In [2]:
learning_rate = 0.00003
wd = 0.1
loss_multiplier = 5
transformer_name = "distilberturk"

# Transformerdaki CLS ve SEP taglerinin tanımlanması
cls_tag = 101
sep_tag = 102
if transformer_name=="distilberturk":
    cls_tag = 2
    sep_tag = 3

embedding_layer = True
last_free_layer = 0

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
print(torch.cuda.device_count())
print(torch.cuda.is_available())

cuda:0
1
True


In [3]:
# Mandalina Emoji Verisi
df = pd.read_csv("../Veriler/emoji.csv", index_col=0)
all_labels = sorted(list(set(df["emoji class"])))

In [4]:
if transformer_name=="mbert":
    config = AutoConfig.from_pretrained('bert-base-multilingual-cased')
    config.output_hidden_states = True
elif transformer_name=="distilberturk":
    config = AutoConfig.from_pretrained('dbmdz/distilbert-base-turkish-cased')
    config.output_hidden_states = True

Maskelenmiş Dil Modeli'nin eklendiği modelin tanımlanması 

In [5]:
class Combined_Net(nn.Module):
    def __init__(self):
        super(Combined_Net, self).__init__()
        if transformer_name=="distilberturk":
            self.net_bert = AutoModelForPreTraining.from_config(config).from_pretrained('dbmdz/distilbert-base-turkish-cased', output_hidden_states=True)
        elif transformer_name=="mbert":
            self.net_bert = AutoModelForPreTraining.from_config(config).from_pretrained('bert-base-multilingual-cased', output_hidden_states=True)
        unfrozen_layers = ["cls", "pooler", "vocab"]
        if embedding_layer:
            unfrozen_layers.append('embedding')
        
        for idx in range(last_free_layer, 12):
            if transformer_name=="distilberturk":
                unfrozen_layers.append('transformer.layer.'+str(idx))
            elif transformer_name=="mbert":
                unfrozen_layers.append('encoder.layer.'+str(idx))
            
        print(unfrozen_layers)
        for name, param in self.net_bert.named_parameters():
            if not any([layer in name for layer in unfrozen_layers]):
                print("[FROZE]: %s" % name)
                param.requires_grad = False
            else:
                print("[FREE]: %s" % name)
                param.requires_grad = True

        self.fc1 = nn.Linear(768, 8)

    def forward(self, input_ids, input_attention, input_types, input_ids_masked, input_attention_masked, input_types_masked):
        if transformer_name=="mbert":
            _, _, x  = self.net_bert(input_ids, attention_mask=input_attention, token_type_ids=input_types)
            probs, _, _  = self.net_bert(input_ids_masked, attention_mask=input_attention_masked, token_type_ids=input_types_masked)
        elif transformer_name=="distilberturk":
            _, x  = self.net_bert(input_ids, attention_mask=input_attention)
            probs, _  = self.net_bert(input_ids_masked, attention_mask=input_attention_masked)
    
        #Getting head
        x = x[-1][:,0,:]

        x = self.fc1(x)
        return x, probs
    

def weight_reset(m):
    if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
        m.reset_parameters()

try:
    combined_net.apply(weight_reset)
    print('Combined Net resetlendi.')
except: 
    pass

combined_net = Combined_Net().to(device)
print('Combined Net tanımlandı.')

['cls', 'pooler', 'vocab', 'embedding', 'transformer.layer.0', 'transformer.layer.1', 'transformer.layer.2', 'transformer.layer.3', 'transformer.layer.4', 'transformer.layer.5', 'transformer.layer.6', 'transformer.layer.7', 'transformer.layer.8', 'transformer.layer.9', 'transformer.layer.10', 'transformer.layer.11']
[FREE]: distilbert.embeddings.word_embeddings.weight
[FREE]: distilbert.embeddings.position_embeddings.weight
[FREE]: distilbert.embeddings.LayerNorm.weight
[FREE]: distilbert.embeddings.LayerNorm.bias
[FREE]: distilbert.transformer.layer.0.attention.q_lin.weight
[FREE]: distilbert.transformer.layer.0.attention.q_lin.bias
[FREE]: distilbert.transformer.layer.0.attention.k_lin.weight
[FREE]: distilbert.transformer.layer.0.attention.k_lin.bias
[FREE]: distilbert.transformer.layer.0.attention.v_lin.weight
[FREE]: distilbert.transformer.layer.0.attention.v_lin.bias
[FREE]: distilbert.transformer.layer.0.attention.out_lin.weight
[FREE]: distilbert.transformer.layer.0.attention.o

In [6]:
if transformer_name=="distilberturk":
    tokenizer = AutoTokenizer.from_pretrained('dbmdz/distilbert-base-turkish-cased')
elif transformer_name=="mbert":
    tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

Maskelenmiş Dil Modeli için otomatik bir şekilde eğitim verisinin oluşturulması ve öznitelik çıkarma kodları

In [7]:
tokenizer_vocabs = list(tokenizer.vocab.keys())
def to_masked(original_sent):

    max_pred = 20
    mask_prob = 0.15
    max_len = 256
    first_part = list(tokenizer.encode(original_sent, add_special_tokens=False))
    feature = [cls_tag]+first_part+[sep_tag]
    if len(feature)>256:
        print('Error in ',original_sent)
        feature = feature[:255]+[sep_tag]
    mid_index = len(first_part)+2
    input_type = [0]*mid_index+(len(feature)-mid_index)*[1]
    
    masked_tokens = []
    masked_pos = []
    tokens = tokenizer.convert_ids_to_tokens(feature)
    input_mask = [1]*len(tokens)
    n_pred = min(max_pred, max(1, int(round(len(tokens)*mask_prob))))
    cand_pos = [i for i, token in enumerate(tokens)
                if token != '[CLS]' and token != '[SEP]' and token != '[unused5]']
    shuffle(cand_pos)

    for pos in cand_pos[:n_pred]:
        masked_tokens.append(tokens[pos])
        masked_pos.append(pos)
        if rand() < 0.8: # 80%
            tokens[pos] = '[MASK]'
        elif rand() < 0.5: # 10%
            tokens[pos] = tokenizer_vocabs[random.randrange(len(tokenizer_vocabs))]

            
#     masked_weights = [1]*len(masked_tokens)
    
    masked_lm_labels = [-100]*max_len
    for pos in masked_pos:
        masked_lm_labels[pos] = feature[pos]
    # Token Indexing
    input_ids = tokenizer.encode(tokens, add_special_tokens=False)
#     masked_ids = tokenizer.encode(masked_tokens, add_special_tokens=False)
    n_pad = max_len - len(input_ids)
    input_ids.extend([0]*n_pad)
    input_mask.extend([0]*n_pad)
    input_type.extend([0]*n_pad)
    
        # Zero Padding for masked target
#     if max_pred > n_pred:
#         n_pad = max_pred - n_pred
#         masked_ids.extend([0]*n_pad)
#         masked_pos.extend([0]*n_pad)
#         masked_weights.extend([0]*n_pad)

    return torch.tensor(input_ids), torch.tensor(input_mask), torch.tensor(input_type), torch.tensor(masked_lm_labels)

def to_masked_all(X):
    ids = []
    mask = []
    types = []
    labels = []
    for _, original_sent in X:
        try:
            input_ids, input_mask, input_type, masked_lm_labels = to_masked(original_sent)
        except:
            print(original_sent)
            raise Exception
        ids.append(input_ids)
        mask.append(input_mask)
        types.append(input_type)
        labels.append(masked_lm_labels)
    return torch.stack(ids),torch.stack(mask),torch.stack(types),torch.stack(labels)

def to_id(text):
    ids_1 = tokenizer.encode(text, add_special_tokens=False)
    return torch.tensor([cls_tag]+ids_1+[sep_tag])

def feat_ext(data):
    features = []
    attention_masks = []
    type_ids = []
    max_len = 256
    for input_ids, _ in data:
        first_ind = list(input_ids).index(sep_tag)+1
        if transformer_name=="distilberturk":
            type_id = []
        elif transformer_name=="mbert":
            type_id = torch.cat((torch.LongTensor([0]*(first_ind)), torch.LongTensor([1]*(len(input_ids)-first_ind)),torch.LongTensor([0]*(max_len-len(input_ids)))), 0)
        attention_mask = torch.cat((torch.tensor([1.0]*(len(input_ids))), torch.tensor([0.0]*(max_len-len(input_ids)))), 0)
        input_ids = torch.cat((input_ids, torch.tensor([0]*(max_len-len(input_ids)))), 0)        
        attention_masks.append(attention_mask)
        features.append(input_ids)
        type_ids.append(torch.tensor(type_id))
    return torch.stack(features),torch.stack(attention_masks),torch.stack(type_ids)

def feat_ext_batch(subset_df):
    X = []
    y = []
    ct = 0
    for idx, row in subset_df.iterrows():
        id_seq = to_id(row['text'])
        if len(id_seq)>=256:
            ct+=1
            continue
        X.append((id_seq, row['text']))
        y.append(all_labels.index(row['emoji class']))
    X_feat, X_attention, X_types = feat_ext(X)
    X_feat_masked, X_attention_masked, X_types_masked, X_masked_lm_labels = to_masked_all(X)
    y = torch.tensor(y)
    return X_feat, X_attention, X_types, X_feat_masked, X_attention_masked, X_types_masked, X_masked_lm_labels, y

Emoji veri setinin eğitim ve test kısımlarının ayrılması

In [8]:
msk = np.random.rand(len(df)) < 0.95
train_df = df[msk]
test_df = df[~msk].reset_index(drop=True)
msk = np.random.rand(len(train_df)) < 0.94
val_df = train_df[~msk].reset_index(drop=True)
train_df = train_df[msk].reset_index(drop=True)

Optimizer ve loss fonksiyonun belirlenmesi

In [9]:
criterion = CrossEntropyLoss()
lm_criterion = CrossEntropyLoss()

optimizer = AdamW(combined_net.parameters(), lr=learning_rate,  correct_bias=False, weight_decay=wd)

Dil modeli ve emoji sınıflandırma için tahminleri dönen kod

In [10]:
def get_predictions(df):
    batch_size = 4
    lm_loss = 0
    cl_loss = 0
    with torch.no_grad():
        outputs = torch.tensor([], device='cpu')
        y_test = torch.LongTensor([], device='cpu')
        for idx in tqdm(range(math.ceil(len(df)/batch_size)), total=math.ceil(len(df)/batch_size)):
            inputs_0, input_attention, input_type, inputs_0_m, input_attention_m, input_type_m, input_masked_labels, y_test_sub = feat_ext_batch(df[idx*batch_size: min(len(df), (idx+1)*batch_size)])

            inputs_0 = inputs_0.to(device)
            input_attention = input_attention.to(device)
            input_type = input_type.to(device)
            inputs_0_m = inputs_0_m.to(device)
            input_attention_m = input_attention_m.to(device)
            input_type_m = input_type_m.to(device)

            o, probs = combined_net(inputs_0, input_attention, input_type, inputs_0_m, input_attention_m, input_type_m) 
            outputs = torch.cat((outputs, o.to('cpu')), 0)
            y_test = torch.cat((y_test, y_test_sub), 0)
            
            lm_loss += lm_criterion(probs.to('cpu').view(-1, config.vocab_size), input_masked_labels.view(-1))
            cl_loss += criterion(o.to('cpu'), y_test_sub)
        _, predicted_test = torch.max(outputs.data, 1)
        total = y_test.size(0)
        correct = (predicted_test == y_test).sum().item()
        test_acc = correct/total

    test_f1 = f1_score(predicted_test.cpu(), y_test.cpu(), average="weighted")
    return test_f1, test_acc, lm_loss, cl_loss

In [11]:
best_val_acc = 0
batch_size = 4
best_val_loss = np.inf
accumulation_steps = 64
for epoch in range(10):
    
    running_loss = 0.0
    running_cl_loss = 0.0
    running_lm_loss = 0.0
    total_loss = 0
    total_cl_loss = 0.0
    total_lm_loss = 0.0
    total = 0
    correct = 0
    train_df = train_df.sample(frac=1).reset_index(drop=True)
    test_df = test_df.sample(frac=1).reset_index(drop=True)
    val_df = val_df.sample(frac=1).reset_index(drop=True)
    train_outputs = torch.LongTensor([]).to(device)
    for idx in range(math.ceil(len(train_df)/batch_size)):
        inputs_0, input_attention, input_type, inputs_0_m, input_attention_m, input_type_m, input_masked_labels, labels = feat_ext_batch(train_df[idx*batch_size: min(len(train_df), (idx+1)*batch_size)])
        inputs_0 = inputs_0.to(device)
        input_attention = input_attention.to(device)
        input_type = input_type.to(device)
        inputs_0_m = inputs_0_m.to(device)
        input_attention_m = input_attention_m.to(device)
        input_type_m = input_type_m.to(device)
        input_masked_labels = input_masked_labels.to(device)
        labels = labels.to(device)
        
        outputs, probs = combined_net(inputs_0, input_attention, input_type, inputs_0_m, input_attention_m, input_type_m)
        
        lm_loss = lm_criterion(probs.view(-1, config.vocab_size), input_masked_labels.view(-1)) / accumulation_steps 

        _, predicted = torch.max(outputs.data, 1)
        correct += (predicted == labels).sum().item()
        total+= len(labels)
        train_outputs = torch.cat((train_outputs, predicted), 0)
        # forward + backward + optimize
        cl_loss = criterion(outputs, labels) / accumulation_steps 
        
        loss = loss_multiplier*cl_loss + lm_loss
        loss.backward()
        
        if (idx+1) % accumulation_steps == 0:             # Wait for several backward steps
            optimizer.step()                            # Now we can do an optimizer step
            optimizer.zero_grad()                           # Reset gradients tensors
#             scheduler.step()
        
        

        # print statistics
        running_loss += loss.item()
        running_cl_loss += cl_loss.item()
        running_lm_loss += lm_loss.item()
        if (idx+1) % accumulation_steps == 0:    # print every 2000 mini-batches
            print('[%d_%d-%d, %5d/%d] loss: %.3f lm_loss: %.3f cl_loss: %.3f  |  accuracy: %.3f' %
                  (epoch + 1, (idx + 1)//accumulation_steps,(idx + 1)%accumulation_steps, idx + 1, len(train_df)//batch_size, running_loss, running_lm_loss, running_cl_loss, correct/total))
            total_loss += running_loss
            total_cl_loss += running_cl_loss
            total_lm_loss += running_lm_loss
            running_loss = 0.0
            running_cl_loss = 0 
            running_lm_loss = 0
        

            
    train_acc = correct/total
    print(train_acc)
    
    test_f1, test_acc, test_lm_loss, test_cl_loss = get_predictions(test_df)
    test_loss = (loss_multiplier*test_cl_loss) + test_lm_loss
    val_f1, val_acc, val_lm_loss, val_cl_loss = get_predictions(val_df)
    val_loss = (loss_multiplier*val_cl_loss) + val_lm_loss
   
    if True: #val_loss<best_val_loss:
        now = datetime.now()
        torch.save({
                'epoch': epoch+1,
                'model_state_dict': combined_net.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': total_loss
                }, f'../Models/{transformer_name}_emoji_f1_{test_f1}_acc_{test_acc}_{epoch+1}_big.pt')
        if val_loss<best_val_loss:
            best_val_loss = val_loss
    print('Epoch: ',epoch+1)
    print(f'Loss: {total_loss}, LM Loss: {total_lm_loss}, CL Loss: {total_cl_loss}, Training accuracy:{train_acc}, Validation accuracy:{val_acc}, Test accuracy:{test_acc}')
    print(f'Val F1:{val_f1} \t Val Loss: {val_loss} Val CL Loss: {val_cl_loss} Val LM Loss: {val_lm_loss}')
    print(f'Test F1:{test_f1} \t Test Loss: {test_loss} Test CL Loss: {test_cl_loss}  LM Loss: {val_lm_loss}')


[1_1-0,    64/171287] loss: 15.373 lm_loss: 4.758 cl_loss: 2.123  |  accuracy: 0.059
[1_2-0,   128/171287] loss: 14.778 lm_loss: 4.516 cl_loss: 2.052  |  accuracy: 0.137
[1_3-0,   192/171287] loss: 14.461 lm_loss: 4.435 cl_loss: 2.005  |  accuracy: 0.159
[1_4-0,   256/171287] loss: 14.703 lm_loss: 4.303 cl_loss: 2.080  |  accuracy: 0.163
[1_5-0,   320/171287] loss: 14.433 lm_loss: 4.212 cl_loss: 2.044  |  accuracy: 0.170
[1_6-0,   384/171287] loss: 14.310 lm_loss: 4.261 cl_loss: 2.010  |  accuracy: 0.178
[1_7-0,   448/171287] loss: 14.169 lm_loss: 4.184 cl_loss: 1.997  |  accuracy: 0.181
[1_8-0,   512/171287] loss: 14.317 lm_loss: 4.275 cl_loss: 2.008  |  accuracy: 0.177
[1_9-0,   576/171287] loss: 14.066 lm_loss: 4.154 cl_loss: 1.982  |  accuracy: 0.188
[1_10-0,   640/171287] loss: 14.440 lm_loss: 4.392 cl_loss: 2.010  |  accuracy: 0.191
[1_11-0,   704/171287] loss: 14.045 lm_loss: 4.265 cl_loss: 1.956  |  accuracy: 0.199
[1_12-0,   768/171287] loss: 14.169 lm_loss: 4.214 cl_loss: 1.9

KeyboardInterrupt: 