## T5(Text-toText Transfer Transformer)

기존 자연어 처리 모델은 대부분 입력 문장을 벡터나 행렬로 변환한 뒤, 이를 통해 출력 문장ㅇ르 생성하는 방식이거나, 출력값이 클래스나 입력값의 일부를 반환하는 형식으로 동작.

T5는 출력을 모두 토큰 시퀀스로 처리하는 Text to Text structure.

입력과 출력의 형태를 자유로이 다룰 수 있으며, 구조상 유연성과 확장성이 뛰어남.

문장마다 마스크 토큰을 사용하는 Sentinel Token을 사용. <extra_id_0> 이나 <extra_id_1> 처럼, 0부터 99개의 기본값.

In [1]:
import numpy as np
from datasets import load_dataset

In [2]:
news = load_dataset('argilla/news-summary', split='test')
df = news.to_pandas().sample(5000, random_state=42)[['text', 'prediction']]
df['text'] = 'summarize: ' + df['text']
df['prediction'] = df['prediction'].map(lambda x: x[0]['text'])
train, valid, test = np.split(
    df.sample(frac = 1, random_state = 42), [int(0.6*len(df)), int(0.8*len(df))]
)

  return bound(*args, **kwds)


In [3]:
train

Unnamed: 0,text,prediction
9209,"summarize: DANANG, Vietnam (Reuters) - Russian...",Putin says had useful interaction with Trump a...
13236,summarize: NEW YORK (Reuters) - A showdown bet...,NY mayor criticizes Trump's closing public atr...
7828,summarize: (This January 3 story was correcte...,Oil business seen in strong position as Trump ...
18839,summarize: NEW YORK (Reuters) - Washington sta...,Courts likely to probe Trump's intent in issui...
19844,summarize: WASHINGTON (Reuters) - Kristie Kenn...,Kristie Kenney named State Department's new co...
...,...,...
7920,summarize: MOSCOW (Reuters) - President Vladim...,Putin warns North Korea situation on verge of ...
751,"summarize: DANANG, Vietnam (Reuters) - It is n...",New Zealand says unclear if TPP agreement can ...
16847,"summarize: CORALVILLE, Iowa (Reuters) - U.S. R...",Republican candidate Rubio: Fed needs clear ru...
1037,summarize: WASHINGTON (Reuters) - It would not...,Germany's Schaeuble presses ECB to unwind loos...


In [4]:
train['text'][9209]

'summarize: DANANG, Vietnam (Reuters) - Russian President Vladimir Putin said on Saturday he had a normal dialogue with U.S. leader Donald Trump at a summit in Vietnam, and described Trump as civil, well-educated, and comfortable to deal with. Putin said that a mooted bilateral sit-down meeting with Trump did not happen at the Asia-Pacific Economic Cooperation summit, citing scheduling issues on both sides and unspecified protocol issues. Putin, at a briefing for reporters at the end of the summit, said there was still a need for further U.S.-Russia contacts, both at the level of heads of state and their officials, to discuss issues including security and economic development.   '

In [5]:
train['prediction'][9209]

'Putin says had useful interaction with Trump at Vietnam summit'

In [6]:
import sys
sys.path.append("C:/Users/dohyeong/miniconda3/Lib/site-packages/")

In [7]:
import torch
from transformers import T5Tokenizer
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data import RandomSampler, SequentialSampler

In [26]:
from torch import optim

In [18]:
def make_dataset(data, tokenizer, device):
    source = tokenizer(
        text = data.text.tolist(),
        padding='max_length',
        max_length=128,
        pad_to_max_length=True,
        truncation=True,
        return_tensors='pt'
    )
    
    target = tokenizer(
        text = data.prediction.tolist(),
        padding='max_length',
        max_length=128,
        pad_to_max_length= True,
        truncation = True,
        return_tensors = 'pt'
    )
    
    source_ids = source['input_ids'].squeeze().to(device)
    source_mask = source['attention_mask'].squeeze().to(device)
    target_ids = target['input_ids'].squeeze().to(device)
    target_mask = target['attention_mask'].squeeze().to(device)
    return TensorDataset(source_ids, source_mask, target_ids, target_mask)

In [19]:
def get_dataloader(dataset, sampler, batch_size):
    data_sampler = sampler(dataset)
    dataloader = DataLoader(dataset, sampler = data_sampler, batch_size = batch_size)
    return dataloader

In [20]:
epochs = 3
batch_size = 8
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [21]:
device

'cuda'

In [22]:
tokenizer = T5Tokenizer.from_pretrained(
    pretrained_model_name_or_path= 't5-small'
)

In [23]:
train_dataset = make_dataset(train, tokenizer, device)
train_dataloader = get_dataloader(train_dataset, RandomSampler, batch_size)

In [24]:
valid_dataset = make_dataset(valid, tokenizer, device)
valid_dataloader = get_dataloader(valid_dataset, RandomSampler, batch_size)

In [25]:
test_dataset = make_dataset(test, tokenizer, device)
test_dataloader = get_dataloader(test_dataset, RandomSampler, batch_size)

In [28]:
print(next(iter(train_dataloader)))

[tensor([[21603,    10,     3,  ..., 13191,  6230,     1],
        [21603,    10,  6554,  ...,  7070,   166,     1],
        [21603,    10,     3,  ...,   105, 20857,     1],
        ...,
        [21603,    10,   549,  ...,  7896,   134,     1],
        [21603,    10,     3,  ...,    12,    66,     1],
        [21603,    10,   549,  ...,   988,    13,     1]], device='cuda:0'), tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0'), tensor([[  749,    18, 18610,  ...,     0,     0,     0],
        [18263,   819,    13,  ...,     0,     0,     0],
        [ 2523, 14667,   265,  ...,     0,     0,     0],
        ...,
        [16388,     7,    13,  ...,     0,     0,     0],
        [ 7676, 15093,     7,  ...,     0,     0,     0],
        [  412,     5,   134,  ...,     0,     0,     0]], device='cuda:0'), ten