## T5(Text-toText Transfer Transformer)

기존 자연어 처리 모델은 대부분 입력 문장을 벡터나 행렬로 변환한 뒤, 이를 통해 출력 문장ㅇ르 생성하는 방식이거나, 출력값이 클래스나 입력값의 일부를 반환하는 형식으로 동작.

T5는 출력을 모두 토큰 시퀀스로 처리하는 Text to Text structure.

입력과 출력의 형태를 자유로이 다룰 수 있으며, 구조상 유연성과 확장성이 뛰어남.

문장마다 마스크 토큰을 사용하는 Sentinel Token을 사용. <extra_id_0> 이나 <extra_id_1> 처럼, 0부터 99개의 기본값.

In [11]:
import numpy as np
from datasets import load_dataset

In [12]:
news = load_dataset('argilla/news-summary', split='test')
df = news.to_pandas().sample(5000, random_state=42)[['text', 'prediction']]
df['text'] = 'summarize: ' + df['text']
df['prediction'] = df['prediction'].map(lambda x: x[0]['text'])
train, valid, test = np.split(
    df.sample(frac = 1, random_state = 42), [int(0.6*len(df)), int(0.8*len(df))]
)

  return bound(*args, **kwds)


In [13]:
train

Unnamed: 0,text,prediction
9209,"summarize: DANANG, Vietnam (Reuters) - Russian...",Putin says had useful interaction with Trump a...
13236,summarize: NEW YORK (Reuters) - A showdown bet...,NY mayor criticizes Trump's closing public atr...
7828,summarize: (This January 3 story was correcte...,Oil business seen in strong position as Trump ...
18839,summarize: NEW YORK (Reuters) - Washington sta...,Courts likely to probe Trump's intent in issui...
19844,summarize: WASHINGTON (Reuters) - Kristie Kenn...,Kristie Kenney named State Department's new co...
...,...,...
7920,summarize: MOSCOW (Reuters) - President Vladim...,Putin warns North Korea situation on verge of ...
751,"summarize: DANANG, Vietnam (Reuters) - It is n...",New Zealand says unclear if TPP agreement can ...
16847,"summarize: CORALVILLE, Iowa (Reuters) - U.S. R...",Republican candidate Rubio: Fed needs clear ru...
1037,summarize: WASHINGTON (Reuters) - It would not...,Germany's Schaeuble presses ECB to unwind loos...


In [14]:
train['text'][9209]

'summarize: DANANG, Vietnam (Reuters) - Russian President Vladimir Putin said on Saturday he had a normal dialogue with U.S. leader Donald Trump at a summit in Vietnam, and described Trump as civil, well-educated, and comfortable to deal with. Putin said that a mooted bilateral sit-down meeting with Trump did not happen at the Asia-Pacific Economic Cooperation summit, citing scheduling issues on both sides and unspecified protocol issues. Putin, at a briefing for reporters at the end of the summit, said there was still a need for further U.S.-Russia contacts, both at the level of heads of state and their officials, to discuss issues including security and economic development.   '

In [15]:
train['prediction'][9209]

'Putin says had useful interaction with Trump at Vietnam summit'

In [16]:
import sys
sys.path.append("C:/Users/dohyeong/miniconda3/Lib/site-packages/")

In [17]:
import torch
from transformers import T5Tokenizer
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data import RandomSampler, SequentialSampler

In [18]:
from torch import optim

In [19]:
def make_dataset(data, tokenizer, device):
    source = tokenizer(
        text = data.text.tolist(),
        padding='max_length',
        max_length=128,
        pad_to_max_length=True,
        truncation=True,
        return_tensors='pt'
    )
    
    target = tokenizer(
        text = data.prediction.tolist(),
        padding='max_length',
        max_length=128,
        pad_to_max_length= True,
        truncation = True,
        return_tensors = 'pt'
    )
    
    source_ids = source['input_ids'].squeeze().to(device)
    source_mask = source['attention_mask'].squeeze().to(device)
    target_ids = target['input_ids'].squeeze().to(device)
    target_mask = target['attention_mask'].squeeze().to(device)
    return TensorDataset(source_ids, source_mask, target_ids, target_mask)

In [20]:
def get_dataloader(dataset, sampler, batch_size):
    data_sampler = sampler(dataset)
    dataloader = DataLoader(dataset, sampler = data_sampler, batch_size = batch_size)
    return dataloader

In [21]:
epochs = 3
batch_size = 8
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [22]:
device

'cuda'

In [23]:
tokenizer = T5Tokenizer.from_pretrained(
    pretrained_model_name_or_path= 't5-small'
)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [24]:
train_dataset = make_dataset(train, tokenizer, device)
train_dataloader = get_dataloader(train_dataset, RandomSampler, batch_size)

In [25]:
valid_dataset = make_dataset(valid, tokenizer, device)
valid_dataloader = get_dataloader(valid_dataset, RandomSampler, batch_size)

In [26]:
test_dataset = make_dataset(test, tokenizer, device)
test_dataloader = get_dataloader(test_dataset, RandomSampler, batch_size)

In [27]:
print(next(iter(train_dataloader)))

[tensor([[21603,    10,   377,  ...,   141,  5132,     1],
        [21603,    10,    71,  ...,  1506,  2542,     1],
        [21603,    10,   549,  ...,   888,    12,     1],
        ...,
        [21603,    10,  8161,  ...,    81,    69,     1],
        [21603,    10,  5422,  ...,    19, 11970,     1],
        [21603,    10,  6045,  ...,  7402,   593,     1]], device='cuda:0'), tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0'), tensor([[11882, 18486,  2231,  ...,     0,     0,     0],
        [ 2523,    31,     7,  ...,     0,     0,     0],
        [ 1589,   212,    76,  ...,     0,     0,     0],
        ...,
        [  571,  2770,  6420,  ...,     0,     0,     0],
        [18263,    27,  1967,  ...,     0,     0,     0],
        [16870,   789,     3,  ...,     0,     0,     0]], device='cuda:0'), ten

In [28]:
from torch import optim
from transformers import T5ForConditionalGeneration

  torch.utils._pytree._register_pytree_node(


In [29]:
model = T5ForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path= 't5-small'
).to(device)

Downloading config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [30]:
optimizer = optim.AdamW(model.parameters(), lr=1e-5, eps=1e-8)

In [31]:
import numpy as np
from torch import nn

In [32]:
def train(model, optimizer, dataloader):
    model.train()
    train_loss = 0.0
    
    for source_ids, source_mask, target_ids, target_mask in dataloader:
        decoder_input_ids = target_ids[:, :-1].contiguous()
        labels = target_ids[:, 1:].clone().detach()
        labels[target_ids[:, 1:] == tokenizer.pad_token_id] = -100
        
        outputs = model(
            input_ids = source_ids,
            attention_mask = source_mask,
            decoder_input_ids = decoder_input_ids,
            labels = labels
        )
        
        loss = outputs.loss
        train_loss += loss.item()
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    train_loss = train_loss / len(dataloader)
    return train_loss

In [33]:
def evaluation(model, dataloader):
    with torch.no_grad():
        model.eval()
        val_loss = 0.0
        
        for source_ids, source_mask, target_ids, target_mask in dataloader:
            decoder_input_ids = target_ids[:, :-1].contiguous()
            labels = target_ids[:, 1:].clone().detach()
            labels[target_ids[:, 1:] == tokenizer.pad_token_id] = -100
            
            outputs = model( 
                input_ids = source_ids,
                attention_mask = source_mask,
                decoder_input_ids = decoder_input_ids,
                labels = labels,
            )
            
            loss = outputs.loss
            val_loss += loss
            
        val_loss = val_loss / len(dataloader)
        return val_loss

In [34]:
best_loss = 10000

for epoch in range(epochs):
    train_loss = train(model, optimizer, train_dataloader)
    val_loss = evaluation(model, valid_dataloader)
    print(f"epoch: {epoch+1}, train_loss: {train_loss:.4f}, val_loss: {val_loss:.4f}")
    
    if val_loss < best_loss:
        best_loss = val_loss
        torch.save(model.state_dict(), './t5generator.pt')
        print()
        print('model saved')

epoch: 1, train_loss: 4.3346, val_loss: 3.3429

model saved
epoch: 2, train_loss: 3.4221, val_loss: 2.9161

model saved
epoch: 3, train_loss: 3.1400, val_loss: 2.7666

model saved


In [35]:
model.eval()

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

In [37]:
with torch.no_grad():
    for source_ids, source_mask, target_ids, target_mask in test_dataloader:
        generated_ids = model.generate(
            input_ids = source_ids, 
            attention_mask = source_mask,
            max_length = 128,
            num_beams = 3,
            repetition_penalty = 2.5,
            length_penalty = 1.0,
            early_stopping = True
        )
        
        for generated, target in zip(generated_ids, target_ids):
            pred = tokenizer.decode(
                generated, skip_special_tokens= True, clean_up_tokenization_spaces= True
            )
            actual = tokenizer.decode(
                target, skip_special_tokens=True, clean_up_tokenization_spaces= True,
            )
            
            print('generated_headline_text: ', pred)
            print('actual_headline: ', actual)
            print('')
        break

generated_headline_text:  a top Republican defends border-adjustable tax provision against Trump criticism. House of Representatives says reform measure to tax imports but not exports remains part of debate.
actual_headline:  Republican defends border-adjustment tax after Trump criticism

generated_headline_text:  Israeli intelligence minister says Bashar al-Assad is ready to permit Iran to set up Syrian bases. Israel worries that Assad's recent gains have given Iranian and Lebanese Hezbollah allies foothold on Syria front.
actual_headline:  After Russia, Iran seeks deal for long-term Syria garrison: Israel

generated_headline_text:  U.S. officials seeking way to reverse gains by militant groups. three U.S. service members killed in Afghanistan operations near Pakistan border.
actual_headline:  Risk of deeper involvement as U.S. weighs its options in Afghanistan

generated_headline_text:  independent human rights investigator says he had information about tortured inmate at Guantanamo 