<a href="https://colab.research.google.com/github/dentadelta/123/blob/master/T5_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# I borrows  a few lines of codes from the below Google Colabs:

https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=coOmS2s_xDBy

In [0]:
!pip install nlp transformers
!pip install pyarrow==0.17

Collecting wandb
[?25l  Downloading https://files.pythonhosted.org/packages/d1/c7/8bf2c62c3f133f45e135a8a116e4e0f162043248e3db54de30996eaf1a8a/wandb-0.8.36-py2.py3-none-any.whl (1.4MB)
[K     |████████████████████████████████| 1.4MB 3.5MB/s 
[?25hCollecting shortuuid>=0.5.0
  Downloading https://files.pythonhosted.org/packages/25/a6/2ecc1daa6a304e7f1b216f0896b26156b78e7c38e1211e9b798b4716c53d/shortuuid-1.0.1-py3-none-any.whl
Collecting GitPython>=1.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/8c/f9/c315aa88e51fabdc08e91b333cfefb255aff04a2ee96d632c32cb19180c9/GitPython-3.1.3-py3-none-any.whl (451kB)
[K     |████████████████████████████████| 460kB 18.4MB/s 
[?25hCollecting watchdog>=0.8.3
[?25l  Downloading https://files.pythonhosted.org/packages/73/c3/ed6d992006837e011baca89476a4bbffb0a91602432f73bd4473816c76e2/watchdog-0.10.2.tar.gz (95kB)
[K     |████████████████████████████████| 102kB 10.4MB/s 
Collecting docker-pycreds>=0.4.0
  Downloading https://files.pyt

## Make Sure You Restart Your Runtime After Running the Above Codes


In [0]:
import pyarrow as pa
import pandas as pd
import os
import torch
from transformers import (T5Config, T5Tokenizer, T5ForConditionalGeneration, TextDataset, DataCollator, Trainer, TrainingArguments)
import ipywidgets as widgets
import random
from typing import Dict, List
import nlp
from dataclasses import dataclass
from tqdm.auto import tqdm
import re
import pathlib
import numpy as np
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

@dataclass
class T2TDataCollator(DataCollator):
    def collate_batch(self, batch: List) -> Dict[str, torch.Tensor]:
        input_ids = torch.stack([example['input_ids'] for example in batch])
        lm_labels = torch.stack([example['target_ids'] for example in batch])
        lm_labels[lm_labels[:, :] == 0] = -100
        attention_mask = torch.stack([example['attention_mask'] for example in batch])
        decoder_attention_mask = torch.stack([example['target_attention_mask'] for example in batch])
        

        return {
            'input_ids': input_ids, 
            'attention_mask': attention_mask,
            'lm_labels': lm_labels, 
            'decoder_attention_mask': decoder_attention_mask
        }

class Custom_T5_Training():
  def __init__(self, dataset_path,working_folder, maximum_input_length, maximum_output_length, epochs=1, logging_step=1000, model_name = 't5-base'):
    self.dataset_path = dataset_path
    self.working_folder = working_folder
    self.model_name = model_name
    self.maximum_input_length = maximum_input_length
    self.maximum_output_length = maximum_output_length
    self.epochs = epochs
    self.load_data()
    self.load_model()
    self.training_args = TrainingArguments(
                        output_dir= self.working_folder,
                        overwrite_output_dir=True,
                        do_train=True,
                        do_eval =True,
                        num_train_epochs=self.epochs,   
                        per_device_train_batch_size=                1, 
                        logging_steps=                              logging_step,   
                        save_steps=                                 -1,
                        )

  def load_data(self):
    file = pathlib.Path('{}/train_data.pt'.format(self.working_folder))
    if file.exists():
      self.train_dataset = torch.load('{}/train_data.pt'.format(self.working_folder))
      self.valid_dataset = torch.load('{}/valid_data.pt'.format(self.working_folder))
      self.test_dataset =  torch.load('{}/test_data.pt'.format(self.working_folder))

      self.tokenizer = T5Tokenizer.from_pretrained(self.working_folder)
    else:
      self.create_dataset()

  def load_model(self):
    file = pathlib.Path('{}/pytorch_model.bin'.format(self.working_folder))
    if file.exists():
      self.model = T5ForConditionalGeneration.from_pretrained(self.working_folder)
    
    else:
      config = T5Config.from_pretrained(self.model_name)
      self.model = T5ForConditionalGeneration.from_pretrained(self.model_name, config=config)

  def train_model(self, epochs=1):
    data_collator = T2TDataCollator()
    progress = widgets.FloatProgress(value=0.1, min=0.0, max=1.0, bar_style = 'info')

    trainer = Trainer(
                        model= self.model,
                        args=self.training_args,
                        data_collator=data_collator,
                        train_dataset=self.train_dataset,
                        eval_dataset =self.test_dataset,
                        prediction_loss_only=True
                      )
    
    progress.value = 0.4
    p_start, p_end = 0.4, 1.
    
    def progressify(f):
      def inner(*args, **kwargs):
        if trainer.epoch is not None:
          progress.value = p_start + trainer.epoch / self.epochs * (p_end - p_start)
          return f(*args, **kwargs)
      return inner

    try:
      trainer._training_step = progressify(trainer._training_step)
      trainer.train()
    
    except KeyboardInterrupt:
      print('Keyboard interrupted, but dont worry because...')
    finally:
      trainer.save_model(self.working_folder)
      print('the model has been saved')

  def add_eos_to_examples(self,example):
    example['input_text'] = '<{}>: <{}> </s>'.format(example['prefix'] , example['input_text'] )
    example['target_text'] = '"<{}> </s>"'.format(example['target_text'])
    return example

  def convert_to_features(self,example_batch):
    input_encodings = self.tokenizer.batch_encode_plus(example_batch['input_text'], pad_to_max_length=True, max_length=self.maximum_input_length)     ########## Specify the maximum input lengths (context + question)
    target_encodings = self.tokenizer.batch_encode_plus(example_batch['target_text'], pad_to_max_length=True, max_length=self.maximum_output_length)     ########## Specify the maximum output length
    encodings = {
        'input_ids': input_encodings['input_ids'], 
        'attention_mask': input_encodings['attention_mask'],
        'target_ids': target_encodings['input_ids'],
        'target_attention_mask': target_encodings['attention_mask']
    }
    return encodings

  def create_dataset(self):
    self.tokenizer = T5Tokenizer.from_pretrained(self.model_name)
    df = pd.read_csv(self.dataset_path)
    df_train, df_valid, df_test= np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

    fields = [
          ('input_text', pa.string()),
          ('target_text', pa.string()),
          ('prefix', pa.string())
      ]

    train_dataset = nlp.arrow_dataset.Dataset(pa.Table.from_pandas(df_train,pa.schema(fields)))
    valid_dataset = nlp.arrow_dataset.Dataset(pa.Table.from_pandas(df_valid,pa.schema(fields)))
    test_dataset  = nlp.arrow_dataset.Dataset(pa.Table.from_pandas(df_test,pa.schema(fields)))

    train_dataset = train_dataset.map(self.add_eos_to_examples)
    train_dataset = train_dataset.map(self.convert_to_features, batched=True)

    valid_dataset = valid_dataset.map(self.add_eos_to_examples, load_from_cache_file=False)
    valid_dataset = valid_dataset.map(self.convert_to_features, batched=True, load_from_cache_file=False)

    test_dataset = test_dataset.map(self.add_eos_to_examples, load_from_cache_file=False)
    test_dataset = test_dataset.map(self.convert_to_features, batched=True, load_from_cache_file=False)

    columns = ['input_ids', 'target_ids', 'attention_mask', 'target_attention_mask']
    train_dataset.set_format(type='torch', columns=columns)
    valid_dataset.set_format(type='torch', columns=columns)
    test_dataset.set_format(type='torch', columns=columns)

    torch.save(train_dataset, '{}/train_data.pt'.format(self.working_folder))
    torch.save(valid_dataset, '{}/valid_data.pt'.format(self.working_folder))
    torch.save(test_dataset, '{}/test_data.pt'.format(self.working_folder))

    self.train_dataset = train_dataset
    self.valid_dataset = valid_dataset
    self.test_dataset = test_dataset
    self.tokenizer.save_pretrained(self.working_folder)

  def validate_model(self, dataset = None, batch_size=32):
    if dataset is None:
      dataset = self.valid_dataset
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)
    evaluate_model = T5ForConditionalGeneration.from_pretrained(self.working_folder).to(device)
    try:
      answers = []

      for batch in tqdm(dataloader):
        outs = evaluate_model.generate(input_ids=batch['input_ids'].to(device), 
                        attention_mask=batch['attention_mask'].to(device),
                        early_stopping=True)
        outs = [self.tokenizer.decode(ids) for ids in outs]
        answers.extend(outs)
    except KeyboardInterrupt:
      print('proceeds to evaluation')

    finally:
      predictions = []
      references = []
      input_texts = []
      for ref, pred in zip(dataset, answers):
        pred = pred[4:]
        predictions.append(pred)
        
        input_ = self.tokenizer.decode(ref['input_ids'])
        input_ = ''.join(input_)
        input_ = re.sub('[!@#$*-]', '', input_)
        input_ = input_.lstrip().title()

        start_index = input_.index('>:')
        prefix = input_[:start_index]
        input_ = input_[start_index + 6:]

        input_texts.append(input_)

        output_ = self.tokenizer.decode(ref['target_ids'])
        output_ = ''.join(output_)[4:-3]
        references.append(output_)
    
      for _ in range(min(10, len(answers))):
        i = random.randint(0, len(predictions))
        print('Input:             {}\nPredicted Answer:  {}\nReal Answer:       {}\n'.format(input_texts[i],predictions[i], references[i]))

      return {'input_text': input_texts, 'target_text': predictions}


  def work(self, file_path, batch_size = 100):
    file = pathlib.Path('{}/work_data.pt'.format(self.working_folder))
    if file.exists():
      work_dataset = torch.load('{}/train_data.pt'.format(self.working_folder))
      self.tokenizer = T5Tokenizer.from_pretrained(self.working_folder)
    else:
      da = pd.read_csv(file_path)
      da['target_text'] = ''
      fields = [
          ('input_text', pa.string()),
          ('target_text', pa.string()),
          ('prefix', pa.string())
      ]

      work_dataset = nlp.arrow_dataset.Dataset(pa.Table.from_pandas(da,pa.schema(fields)))
      work_dataset = work_dataset.map(self.add_eos_to_examples)
      work_dataset = work_dataset.map(self.convert_to_features, batched=True)
      columns = ['input_ids', 'target_ids', 'attention_mask', 'target_attention_mask']
      work_dataset.set_format(type='torch', columns=columns)
      torch.save(work_dataset, '{}/work_data.pt'.format(self.working_folder))
      self.tokenizer.save_pretrained(self.working_folder)
    
    results = self.validate_model(work_dataset, batch_size=batch_size)
    ds = pd.DataFrame(results)
  
    ds.to_csv('{}/work_output.csv'.format(self.working_folder), index=None)
    

# Run Your Own Model Here

In [0]:
My_T5 = Custom_T5_Training(
    dataset_path= 'https://www.dropbox.com/s/6w5z4qvt8vytngm/training_data.csv?dl=1', # Try to put your data on your personal cloud database (as an url) so that you can keep training the model using the latest available data
    working_folder= '/content/',   # Change this to you google drive so that you dont have to retrain your own model from scratch in the future
    maximum_input_length=512,
    maximum_output_length= 60,
    model_name= 't5-base',
    logging_step = 100,
    epochs = 1)  

 Uncomment and run the below function if you updated your dataset:

In [0]:
#My_T5.create_dataset()  

52559it [00:01, 52210.78it/s]
100%|██████████| 53/53 [00:17<00:00,  2.97it/s]
17520it [00:00, 50764.71it/s]
100%|██████████| 18/18 [00:05<00:00,  3.06it/s]
17520it [00:00, 48899.43it/s]
100%|██████████| 18/18 [00:05<00:00,  3.07it/s]


In [0]:
My_T5.train_model()   # you can interrup training anytime and the model will be saved

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=26280.0, style=ProgressStyle(description_…

{"loss": 2.4776219486892224, "learning_rate": 4.619482496194825e-05, "epoch": 0.076103500761035, "step": 2000}
{"loss": 2.434477653712034, "learning_rate": 4.2389649923896504e-05, "epoch": 0.15220700152207, "step": 4000}
{"loss": 2.4032016327381136, "learning_rate": 3.8584474885844754e-05, "epoch": 0.228310502283105, "step": 6000}
{"loss": 2.4020297024846076, "learning_rate": 3.4779299847793e-05, "epoch": 0.30441400304414, "step": 8000}
{"loss": 2.3949736602306366, "learning_rate": 3.097412480974125e-05, "epoch": 0.380517503805175, "step": 10000}
{"loss": 2.3567006936371326, "learning_rate": 2.71689497716895e-05, "epoch": 0.45662100456621, "step": 12000}
{"loss": 2.3792035594284533, "learning_rate": 2.3363774733637747e-05, "epoch": 0.532724505327245, "step": 14000}
{"loss": 2.3893420217037202, "learning_rate": 1.9558599695585997e-05, "epoch": 0.60882800608828, "step": 16000}
{"loss": 2.3774873881340026, "learning_rate": 1.5753424657534248e-05, "epoch": 0.684931506849315, "step": 18000}

In [0]:
_ = My_T5.validate_model()  # Validating on dataset the model has never seen before

HBox(children=(FloatProgress(value=0.0, max=548.0), HTML(value='')))


Input:             What Is The Term For The Line Across The Field Where The Ball Is Positioned Before
Predicted Answer:  the ball>
Real Answer:       line of scrimmage

Input:             What Can Sometimes Be Translated As Tetragraph?>
Predicted Answer:  the tetragraph>
Real Answer:       Square-Block Characters

Input:             Which Term For His Religious Outlook Did Popper Prefer?>
Predicted Answer:  the sacramental>
Real Answer:       agnosticism

Input:             Where Did Hanna Holborn Gray Go After Yale?>
Predicted Answer:  St. Louis>
Real Answer:       University of Chicago

Input:             What Particles Are Pushed Through The Antenna By A Transmitter?>
Predicted Answer:  sea particles>
Real Answer:       electrons

Input:             What Did Japan Call The Occupied Group Of Asian Nations?>
Predicted Answer:  the "Asian">
Real Answer:       Greater East Asia Co-Prosperity Sphere

Input:             How Many Mortgage Lenders Went Bankrupt During 2007 And 2008?>
Predi

I removed the 'context" from the dataset so that I can use the dataset as a sequence to sequence model, not as a question to answer model. This explains why the predicted answers are not accurate. 

Nevertheless T5 is the state of the art NLP model. 

If you need to train a question to answer model, you needs to use a Long Form pretrained model (not a T5 model)

In [0]:
My_T5.work('PATH_TO_REAL_WORLD_DATASET_FOR_THE_MACHINE_TO CARRYOUT_AN_ANALYSIS.csv',1)