<a href="https://colab.research.google.com/github/cafecatwang/github_cafecatwang.github.io/blob/master/Minimalistic_training_of_T5_transformer_with_Pytorch_Lightning_and_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **Few shot text generation with T5 Transformer and Pytorch Lightning**

Author: Ramsri Goutham Golla

Linkedin : https://www.linkedin.com/in/ramsrig/

Twitter: https://twitter.com/ramsri_goutham

Thanks to [Venelin Valkov](https://www.youtube.com/user/VulkovVenelin) and [Suraj Patil](https://github.com/patil-suraj) for their T5 transformer codes. Their notebooks were instrumental in crafting this. 

## 1. Install libraries

In [None]:
!pip install --quiet transformers==4.1.1
!pip install --quiet pytorch-lightning==1.1.3
!pip install --quiet tokenizers==0.9.4 
!pip install --quiet sentencepiece==0.1.94
!pip install --quiet tqdm==4.56.0

[K     |████████████████████████████████| 1.5MB 15.4MB/s 
[K     |████████████████████████████████| 890kB 45.7MB/s 
[K     |████████████████████████████████| 2.9MB 50.3MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 686kB 18.8MB/s 
[K     |████████████████████████████████| 102kB 14.3MB/s 
[K     |████████████████████████████████| 276kB 49.9MB/s 
[K     |████████████████████████████████| 829kB 43.7MB/s 
[K     |████████████████████████████████| 1.3MB 34.9MB/s 
[K     |████████████████████████████████| 143kB 50.2MB/s 
[K     |████████████████████████████████| 296kB 48.6MB/s 
[?25h  Building wheel for PyYAML (setup.py) ... [?25l[?25hdone
  Building wheel for future (setup.py) ... [?25l[?25hdone
  Building wheel for idna-ssl (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 1.1MB 35.2MB/s 
[K     |████████████████████████████████| 81kB 7.8MB/s 
[?25h

In [None]:
# restart runtime
import os

def restart_runtime():
  os.kill(os.getpid(), 9)

restart_runtime()

In [None]:
# Check we have a GPU and check the memory size of the GUP
!nvidia-smi

Fri Jan 15 07:31:50 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## 2. Prepare Model

In [None]:

import random
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl

from transformers import (
    AdamW,
    T5ForConditionalGeneration,
    T5Tokenizer,
    get_linear_schedule_with_warmup
)



pl.seed_everything(42)

42

In [None]:
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-base')


Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Some weights of the model checkpoint at t5-base were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# dataset preparation

true_false_adjective_tuples_train = [
                               ("The cat is alive","The cat is dead"),
                               ("The old woman is beautiful","The old woman is ugly"),
                               ("The purse is cheap","The purse is expensive"),
                               ("Her hair is curly","Her hair is straight"),
                               ("The bathroom is clean","The bathroom is dirty"),
                               ("The exam was easy","The exam was difficult"),
                               ("The house is big","The house is small"),
                               ("The house owner is good","The house owner is bad"),
                               ("The little kid is fat","The little kid is thin"),
                               ("She arrived early","She arrived late."),
                               ("John is very hardworking","John is very lazy"),
                               ("The fridge is empty","The fridge is full")

]


true_false_adjective_tuples_validation = [
                               ("Her face was bright","Her face was dull"),
                               ("The kid is very active","The kid is very silent")
                              
]

In [None]:
from tqdm.notebook import tqdm
import copy


class FalseGenerationDataset(Dataset):
    def __init__(self, tokenizer, tf_list, max_len_inp=96,max_len_out=96):

        self.true_false_adjective_tuples = tf_list

        self.max_len_input = max_len_inp
        self.max_len_output = max_len_out
        self.tokenizer = tokenizer
        self.inputs = []
        self.targets = []
        self.skippedcount =0
        self._build()

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, index):
        source_ids = self.inputs[index]["input_ids"].squeeze()
        target_ids = self.targets[index]["input_ids"].squeeze()

        src_mask = self.inputs[index]["attention_mask"].squeeze()  # might need to squeeze
        target_mask = self.targets[index]["attention_mask"].squeeze()  # might need to squeeze

        labels = copy.deepcopy(target_ids)
        labels [labels==0] = -100

        return {"source_ids": source_ids, "source_mask": src_mask, "target_ids": target_ids, "target_mask": target_mask,"labels":labels}

    def _build(self):
        for inputs,outputs in self.true_false_adjective_tuples:
          input_sent = "falsify: "+inputs
          ouput_sent = "falsified: "+outputs

          # tokenize inputs
          tokenized_inputs = self.tokenizer.batch_encode_plus(
              [input_sent], max_length=self.max_len_input, pad_to_max_length=True, return_tensors="pt"
          )
          # tokenize targets
          tokenized_targets = self.tokenizer.batch_encode_plus(
              [ouput_sent], max_length=self.max_len_output, pad_to_max_length=True,return_tensors="pt"
          )

          self.inputs.append(tokenized_inputs)
          self.targets.append(tokenized_targets)


       

In [None]:
train_dataset = FalseGenerationDataset(t5_tokenizer,true_false_adjective_tuples_train)
validation_dataset = FalseGenerationDataset(t5_tokenizer,true_false_adjective_tuples_validation)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
class T5FineTuner(pl.LightningModule):
    def __init__(self,hparams, t5model, t5tokenizer):
        super(T5FineTuner, self).__init__()
        self.hparams = hparams
        self.model = t5model
        self.tokenizer = t5tokenizer


    def forward( self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, lm_labels=None):
         outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            decoder_attention_mask=decoder_attention_mask,
            labels=lm_labels,
        )
         
         return outputs


    def training_step(self, batch, batch_idx):
        outputs = self.forward(
            input_ids=batch["source_ids"],
            attention_mask=batch["source_mask"],
            decoder_input_ids = batch["target_ids"],
            decoder_attention_mask=batch['target_mask'],
            lm_labels=batch['labels']
        )

        loss = outputs[0]
        self.log('train_loss',loss)
        return loss

    def validation_step(self, batch, batch_idx):
        outputs = self.forward(
            input_ids=batch["source_ids"],
            attention_mask=batch["source_mask"],
            decoder_input_ids = batch["target_ids"],
            decoder_attention_mask=batch['target_mask'],
            lm_labels=batch['labels']
        )

        loss = outputs[0]
        self.log("val_loss",loss)
        return loss

    def train_dataloader(self):
        return DataLoader(train_dataset, batch_size=self.hparams.batch_size,num_workers=4)

    def val_dataloader(self):
        return DataLoader(validation_dataset, batch_size=self.hparams.batch_size,num_workers=4)



    def configure_optimizers(self):
        optimizer = AdamW(self.parameters(), lr=3e-4, eps=1e-8)
        return optimizer




## 3. Train Model

In [None]:
import argparse
args_dict = dict(
    batch_size=1,
)

args = argparse.Namespace(**args_dict)


model = T5FineTuner(args,t5_model,t5_tokenizer)

trainer = pl.Trainer(max_epochs = 5, gpus=1,progress_bar_refresh_rate=30)

trainer.fit(model)



GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M 
-----------------------------------------------------
222 M     Trainable params
0         Non-trainable params
222 M     Total params


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

1

## 4. Test model

In [None]:
test_sent = 'falsify: The sailor was happy and joyful. </s>'
test_tokenized = t5_tokenizer.encode_plus(test_sent, return_tensors="pt")

test_input_ids  = test_tokenized["input_ids"]
test_attention_mask = test_tokenized["attention_mask"]

model.model.eval()
beam_outputs = model.model.generate(
    input_ids=test_input_ids,attention_mask=test_attention_mask,
    max_length=64,
    early_stopping=True,
    num_beams=10,
    num_return_sequences=3,
    no_repeat_ngram_size=2
)

for beam_output in beam_outputs:
    sent = t5_tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    print (sent)


  f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added."


falsified: The sailor was sad
falsified: The sailor was unhappy
falsified: The sailor was happy


In [None]:
test_sent = 'falsify: This is a safe neighbourhood. </s>'
test_tokenized = t5_tokenizer.encode_plus(test_sent, return_tensors="pt")

test_input_ids  = test_tokenized["input_ids"]
test_attention_mask = test_tokenized["attention_mask"]

model.model.eval()
beam_outputs = model.model.generate(
    input_ids=test_input_ids,attention_mask=test_attention_mask,
    max_length=64,
    early_stopping=True,
    num_beams=10,
    num_return_sequences=3,
    no_repeat_ngram_size=2
)

for beam_output in beam_outputs:
    sent = t5_tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    print (sent)

  f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added."


falsified: This is a dangerous neighbourhood
falsified: This is a safe neighbourhood
falsified: This is a dangerous neighbourhood.


In [None]:
test_sent = 'falsify: The tortoise was very slow. </s>'
test_tokenized = t5_tokenizer.encode_plus(test_sent, return_tensors="pt")

test_input_ids  = test_tokenized["input_ids"]
test_attention_mask = test_tokenized["attention_mask"]

model.model.eval()
beam_outputs = model.model.generate(
    input_ids=test_input_ids,attention_mask=test_attention_mask,
    max_length=64,
    early_stopping=True,
    num_beams=10,
    num_return_sequences=3,
    no_repeat_ngram_size=2
)

for beam_output in beam_outputs:
    sent = t5_tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    print (sent)

  f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added."


falsified: The tortoise was very slow
falsified: The tortoise was very fast
falsified: The tortoise was slow
