# **Paraphrase Generation**

# **Datasets**

## Quora Question Pairs

The Quora Question Pairs2 dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.

    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'question1': Text(shape=(), dtype=tf.string),
    'question2': Text(shape=(), dtype=tf.string),


## MRPC

The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.


    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'sentence1': Text(shape=(), dtype=tf.string),
    'sentence2': Text(shape=(), dtype=tf.string),

## MS COCO Annotations


COCO (*Common objects in context*) is a large-scale object detection, segmentation, and captioning dataset. Roughly each example in the dataset consists of an image, its captions and its label. All the examples in the dataset are segregated into 80 classes. And each image has 5 annoted captions for each image. These 5 captions try to explain the image, and hence are paraphrased sentences. We will be breaking this set of 5 captions per each image into two example pairs and leaving out one. 


    'caption': Text(shape=(), dtype=tf.string,
    'image': Image(shape=(None, None, 3), dtype=tf.uint8),
    'image/filename': Text(shape=(), dtype=tf.string),
    'image/id': tf.int64,
    'objects': Sequence({
        'area': tf.int64,
        'bbox': BBoxFeature(shape=(4,), dtype=tf.float32),
        'id': tf.int64,
        'is_crowd': tf.bool,
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=80),
    }),

# **What models can be used**


## T5 (Encoder decoder transformer)


  T5 (*Text-to-text Transfer Transformer*) is yet another transformer model for Google Research Group. This is a vanilla transformer i.e this model has both the encoder and decoder layers as described in the paper [ Attention is All You Need](https://arxiv.org/abs/1706.03762). T5 is a encoder-decoder model that reaches SOTA results by solving NLP problems with a text-to-text approach. This is where text is used as both an input and an output for solving all types of tasks. This was introduced in the recent paper, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer ([paper](https://arxiv.org/abs/1910.10683)).

  ![](https://1.bp.blogspot.com/-89OY3FjN0N0/XlQl4PEYGsI/AAAAAAAAFW4/knj8HFuo48cUFlwCHuU5feQ7yxfsewcAwCLcBGAsYHQ/s640/image2.png)

  As we can see from the above picture, we have prepend our task name in front of the input and pass it to the model. If we want to fine tune our model to a specific task then our inputs should be modified to "task_name: input_sentence < /s> " and the output should be modified to "output_sentence < /s>". This model then can be trained on this modified input and output sequences.



To explore this model:

1. [T5 Paper](https://arxiv.org/abs/1910.10683)

2. [A Brief Paper Analysis](https://towardsdatascience.com/t5-text-to-text-transformer-a-brief-paper-analysis-e4bba797bd68)

3. [Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)

4. [Collin Rafel's Talk](https://www.youtube.com/watch?v=eKqWC577WlI)

## Bart (Encoder decoder transformer)

This is similar to the T5 model. This was released by the Facebook's AI team. To explore more about this model I suggest you to go through these links:

1. [Blog Post](https://mc.ai/introducing-bart-combining-the-power-of-bert-and-gpt/)

2. [Introducing BART](https://sshleifer.github.io/blog_v2/jupyter/2020/03/12/bart.html)

3. [BART Paper](https://www.aclweb.org/anthology/2020.acl-main.703.pdf)

4. [Hugging Face Docs](https://huggingface.co/transformers/model_doc/bart.html)

## Transformer + seq2seq model

  The output from encoder models like BERT, is fed in to a seq2seq encoder-decoder model. To explore this model in detail read through the [paper](https://www.aclweb.org/anthology/D19-5627.pdf).



## Encoder-Decoder model

  In this model, one BERT model is used as the encoder and the other BERT model is used as a decoder. Hugging face has a method to implement this model using the Encoder-Decoder Class where we can instantiate one model as an encoder and any other model as a decoder. As of now only the support of BERT2BERT models exist. You can head to the [docs](https://huggingface.co/transformers/model_doc/encoderdecoder.html) for futher implementation details. Do note that there is no Tensorflow implementation of this Encoder-Decoder class as of now only the Pytorch bindings exist.






# **Building the T5 model for fine tuning**

## **Running a Pre-trained Model**

In [1]:
!pip install sentencepiece



In [2]:
!pip install transformers



In [3]:
import torch
from transformers import T5ForConditionalGeneration,T5Tokenizer
import sentencepiece

def set_seed(seed):
  torch.manual_seed(seed)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

set_seed(42)

model = T5ForConditionalGeneration.from_pretrained('ramsrigouthamg/t5_paraphraser')
tokenizer = T5Tokenizer.from_pretrained('t5-base')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print ("device ",device)
model = model.to(device)

sentence = "There was something in the closet so i had to be careful"

text =  "paraphrase: " + sentence + " </s>"


max_len = 256

encoding = tokenizer.encode_plus(text,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)


beam_outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    do_sample=True,
    max_length=256,
    top_k=120,
    top_p=0.98,
    early_stopping=True,
    num_return_sequences=5 # Number of sentences to return
)

print(f"Sentence: {sentence}")

print("Paraphrase: ")

for i,line in enumerate(beam_outputs):
  paraphrase = tokenizer.decode(line,skip_special_tokens=True,clean_up_tokenization_spaces=True)
  print(f"{i+1}. {paraphrase}")


Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

device  cuda


  f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added."


Sentence: There was something in the closet so i had to be careful
Paraphrase: 
1. I knew what was in my closet and had to be careful. I could not see what it was all the time. I think I have all of it. But I'm so sure it's not.
2. Because there was stuff in my closet that would not help, I had to be careful about this.
3. I had something in my closet so i had to be careful. I had to open it to be careful.
4. What I had in the closet is a closet. Something in it always comes out. It's not everything. I just had it!
5. I'm worried about something in the closet because i always felt like he was hiding something, and I knew there was something in there for years. It went away and came back to me and cleaned everything up.


## **Fine tuning our own model**

## **Fine Tuner Class**
















In [4]:
!pip install pytorch-lightning
!pip install transformers



In [5]:
import argparse
import glob
import os
import json
import time
import logging
import random
import re
from itertools import chain
from string import punctuation

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl


from transformers import AdamW, T5ForConditionalGeneration, T5Tokenizer, get_linear_schedule_with_warmup

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
def set_seed(seed):
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

set_seed(42)

In [101]:
class T5FineTuner(pl.LightningModule):

  def __init__(self,hparams):

    # Calling the super constructer
    super().__init__()
    self.params = hparams

    self.model = T5ForConditionalGeneration.from_pretrained(hparams.model_name_or_path)
    self.tokenizer = T5Tokenizer.from_pretrained(hparams.tokenizer_name_or_path)


  def forward(self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, labels=None):

    return self.model(input_ids, attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            decoder_attention_mask=decoder_attention_mask,
            labels=labels,)
    
  def is_logger(self):
      return self.trainer.global_rank <= 0
    

  def _step(self, batch):
        labels = batch["target_ids"]
        labels[labels[:, :] == self.tokenizer.pad_token_id] = -100

        outputs = self(
            input_ids=batch["source_ids"],
            attention_mask=batch["source_mask"],
            labels=labels,
            decoder_attention_mask=batch['target_mask']
        )

        loss = outputs[0]

        return loss

  def training_step(self, batch, batch_idx):
      loss = self._step(batch)

      tensorboard_logs = {"train_loss": loss}
      return {"loss": loss, "log": tensorboard_logs}


  def training_epoch_end(self, outputs):
      avg_train_loss = torch.stack([x["loss"] for x in outputs]).mean()
      tensorboard_logs = {"avg_train_loss": avg_train_loss}
      return {"avg_train_loss": avg_train_loss, "log": tensorboard_logs, 'progress_bar': tensorboard_logs}

  def validation_step(self, batch, batch_idx):
      loss = self._step(batch)
      return {"val_loss": loss}

  def validation_epoch_end(self, outputs):
      avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
      tensorboard_logs = {"val_loss": avg_loss}
      return {"avg_val_loss": avg_loss, "log": tensorboard_logs, 'progress_bar': tensorboard_logs}


  def configure_optimizers(self):
    "Prepare optimizer and schedule (linear warmup and decay)"

    model = self.model
    no_decay = ["bias", "LayerNorm.weight"]

    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": self.params.weight_decay,
        },
        {
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
            "weight_decay": 0.1,
        },
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=self.params.learning_rate, eps=self.params.adam_epsilon)
    self.opt = optimizer
    return [optimizer]


  # def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, second_order_closure=None, using_native_amp=None, on_tpu=None, using_lbfgs=None, optimizer_closure=None):
  #   optimizer.step(closure=optimizer_closure)
  #   optimizer.zero_grad()
  #   self.lr_scheduler.step()


  def get_tqdm_dict(self):
    tqdm_dict = {"loss": "{:.3f}".format(self.trainer.avg_loss), "lr": self.lr_scheduler.get_last_lr()[-1]}

    return tqdm_dict

  def train_dataloader(self):
    train_dataset = CustomDataset(tokenizer=self.tokenizer, type_path="PAW_Train",data_dir=self.params.data_dir, max_len=self.params.max_seq_length)
    dataloader = DataLoader(train_dataset, batch_size=self.params.train_batch_size, drop_last=True, shuffle=True,
                            num_workers=4)
    t_total = (
            (len(dataloader.dataset) // (self.params.train_batch_size * max(1, self.params.n_gpu)))
            // self.params.gradient_accumulation_steps
            * float(self.params.num_train_epochs)
    )
    scheduler = get_linear_schedule_with_warmup(
        self.opt, num_warmup_steps=self.params.warmup_steps, num_training_steps=t_total
    )
    self.lr_scheduler = scheduler
    return dataloader

  def val_dataloader(self):
    val_dataset = CustomDataset(tokenizer=self.tokenizer, type_path="PAW_Test",data_dir=self.params.data_dir, max_len=self.params.max_seq_length)
    return DataLoader(val_dataset, batch_size=self.params.eval_batch_size, num_workers=4)
  



In [102]:
logger = logging.getLogger(__name__)

class LoggingCallback(pl.Callback):
  def on_validation_end(self, trainer, pl_module):
    logger.info("***** Validation results *****")
    if pl_module.is_logger():
      metrics = trainer.callback_metrics
      # Log results
      for key in sorted(metrics):
        if key not in ["log", "progress_bar"]:
          logger.info("{} = {}\n".format(key, str(metrics[key])))

  def on_test_end(self, trainer, pl_module):
    logger.info("***** Test results *****")

    if pl_module.is_logger():
      metrics = trainer.callback_metrics

      # Log and save results to file
      output_test_results_file = os.path.join(pl_module.hparams.output_dir, "test_results.txt")
      with open(output_test_results_file, "w") as writer:
        for key in sorted(metrics):
          if key not in ["log", "progress_bar"]:
            logger.info("{} = {}\n".format(key, str(metrics[key])))
            writer.write("{} = {}\n".format(key, str(metrics[key])))


In [103]:
# Hyper parameters
args_dict = dict(
    data_dir="", # path for data files
    output_dir="", # path to save the checkpoints
    model_name_or_path='t5-base',
    tokenizer_name_or_path='t5-base',
    max_seq_length=512,
    learning_rate=3e-4,
    weight_decay=0.1,
    adam_epsilon=1e-8,
    warmup_steps=0,
    train_batch_size=6,
    eval_batch_size=6,
    num_train_epochs=2,
    gradient_accumulation_steps=16,
    n_gpu=1,
    # early_stop_callback=False,
    fp_16=False, # if you want to enable 16-bit training then install apex and set this to true
    opt_level='O2', # you can find out more on optimisation levels here https://nvidia.github.io/apex/amp.html#opt-levels-and-properties
    max_grad_norm=1.0, # if you enable 16-bit training then set this to a sensible value, 0.5 is a good default
    seed=42,
)

In [104]:
tokenizer = T5Tokenizer.from_pretrained('t5-base')

class CustomDataset(Dataset):
    def __init__(self, tokenizer, data_dir, type_path, max_len=256):
        # self.path = os.path.join(data_dir, type_path + '.csv')

        self.source_column = "question1"
        self.target_column = "question2"
        
        self.data = []
        
        with open(type_path+".csv","r") as csv_file:
          csv_reader = csv.reader(csv_file, delimiter=',')
          line_count = 0
          for row in csv_reader:
            self.data.append(row)

        self.max_len = max_len
        self.tokenizer = tokenizer
        self.inputs = []
        self.targets = []

        self._build()

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, index):
        source_ids = self.inputs[index]["input_ids"].squeeze()
        target_ids = self.targets[index]["input_ids"].squeeze()

        src_mask = self.inputs[index]["attention_mask"].squeeze()  # might need to squeeze
        target_mask = self.targets[index]["attention_mask"].squeeze()  # might need to squeeze

        return {"source_ids": source_ids, "source_mask": src_mask, "target_ids": target_ids, "target_mask": target_mask}

    def _build(self):
        for example in self.data:
            
            input_ = example[0]
            target = example[1]

            input_ = "paraphrase: "+ input_ + ' </s>'
            target = target + " </s>"

            # tokenize inputs
            tokenized_inputs = self.tokenizer.batch_encode_plus(
                [input_], max_length=self.max_len, pad_to_max_length=True, truncation=True, return_tensors="pt"
            )
            # tokenize targets
            tokenized_targets = self.tokenizer.batch_encode_plus(
                [target], max_length=self.max_len, pad_to_max_length=True,truncation=True, return_tensors="pt"
            )

            self.inputs.append(tokenized_inputs)
            self.targets.append(tokenized_targets)





In [71]:
# def get_dataset(tokenizer, type_path, args):
#   return CustomDataset(tokenizer=tokenizer, data_dir=args.data_dir, type_path=type_path,  max_len=args.max_seq_length)

## Quora Question Pairs

In [None]:
# Data Preparation
import tensorflow_datasets as tfds

(ds_train,ds_test,ds_validation) ,ds_info = tfds.load("glue/qqp",split=["train","test","validation"],with_info=True)

print(ds_info)


train_examples = []
test_examples = []


for example in ds_train:

  if(example["label"] == 1):
    train_examples.append((example["question1"].numpy().decode(),example["question2"].numpy().decode()))
  

for example in ds_validation:
  
  if(example["label"] == 1):
    test_examples.append((example["question1"].numpy().decode(),example["question2"].numpy().decode()))

In [None]:
import csv

with open('Train.csv','w') as out:
    csv_out=csv.writer(out)
    # csv_out.writerow(['question1',''])
    for row in train_examples:
        csv_out.writerow(row)

with open('Test.csv','w') as out:
    csv_out = csv.writer(out)

    for row in test_examples:
        csv_out.writerow(row)    

In [None]:
if not os.path.exists('t5_QQP'):
    os.makedirs('t5_QQP')


args_dict.update({'output_dir': 't5_paraphrase','num_train_epochs':3,'max_seq_length':256})
args = argparse.Namespace(**args_dict)
print(args_dict)

In [None]:
checkpoint_callback = pl.callbacks.ModelCheckpoint(
    filepath=args.output_dir, prefix="checkpoint", monitor="val_loss", mode="min", save_top_k=5
)

train_params = dict(
    accumulate_grad_batches=args.gradient_accumulation_steps,
    gpus=args.n_gpu,
    max_epochs=args.num_train_epochs,
    early_stop_callback=False,
    precision= 16 if args.fp_16 else 32,
    amp_level=args.opt_level,
    gradient_clip_val=args.max_grad_norm,
    checkpoint_callback=checkpoint_callback,
    callbacks=[LoggingCallback()],
)

In [None]:
model = T5FineTuner(args)

In [None]:

trainer = pl.Trainer(**train_params)

print (" Training model")
trainer.fit(model)

print ("training finished")

print ("Saving model")
model.model.save_pretrained('t5_paraphrase')

print ("Saved model")

In [None]:
# Getting the output


# model = T5ForConditionalGeneration.from_pretrained('./t5_paraphraser')
# tokenizer = T5Tokenizer.from_pretrained('t5-small')

sentence = "In order to make something we have to work hard."

text =  "paraphrase: " + sentence + " </s>"


max_len = 256

encoding = tokenizer.encode_plus(text,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")


# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
# Have to read about these decodings
beam_outputs = model.model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    do_sample=True,
    max_length=256,
    top_k=120,
    top_p=0.98,
    early_stopping=True,
    num_return_sequences=10
)


print ("\nOriginal Question ::")
print (sentence)
print ("\n")
print ("Paraphrased Questions :: ")
final_outputs =[]
for beam_output in beam_outputs:
    sent = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    if sent.lower() != sentence.lower() and sent not in final_outputs:
        final_outputs.append(sent)

for i, final_output in enumerate(final_outputs):
    print("{}: {}".format(i, final_output))


Original Question ::
In order to make something we have to work hard.


Paraphrased Questions :: 
0: To Make Something you can also work hard.
1: How to write will work hard if there’s not enough.
2: The most basic equation is free speech technology but it can be used to create someone wrong.
3: Now we have to work hard...
4: Make her guess is sure to work for a lot of others.
5: How important is making something necessary.
6: But when thinking about it, we should have a hard learning experience.
7: As a society you can expect to make something into your work.
8: Whether we create something is an investment do enough to push this way.
9: To learn a little we are trying to make something a dream.


## MRPC

In [None]:
# Data Preparation
import tensorflow_datasets as tfds

(ds_train,ds_test,ds_validation),ds_info = tfds.load("glue/mrpc",split=["train","test","validation"],with_info=True)


mrpc_train = []
mrpc_test = []


for example in ds_train:

  if(example["label"] == 1):
    mrpc_train.append((example["sentence1"].numpy().decode(),example["sentence2"].numpy().decode()))
  

for example in ds_validation:
  
  if(example["label"] == 1):
    mrpc_test.append((example["sentence1"].numpy().decode(),example["sentence2"].numpy().decode()))


In [None]:

with open('MRPC_Train.csv','w') as out:
    csv_out=csv.writer(out)
    # csv_out.writerow(['question1',''])
    for row in mrpc_train:
        csv_out.writerow(row)

with open('MRPC_Test.csv','w') as out:
    csv_out = csv.writer(out)

    for row in mrpc_test:
        csv_out.writerow(row)     

In [None]:
if not os.path.exists('t5_mrpc'):
    os.makedirs('t5_mrpc')


args_dict.update({'output_dir': 't5_mrpc','num_train_epochs':5,'max_seq_length':256})
args = argparse.Namespace(**args_dict)
print(args_dict)

In [None]:
checkpoint_callback = pl.callbacks.ModelCheckpoint(
    filepath=args.output_dir, prefix="checkpoint", monitor="val_loss", mode="min", save_top_k=5
)

train_params = dict(
    accumulate_grad_batches=args.gradient_accumulation_steps,
    gpus=args.n_gpu,
    max_epochs=args.num_train_epochs,
    early_stop_callback=False,
    precision= 16 if args.fp_16 else 32,
    amp_level=args.opt_level,
    gradient_clip_val=args.max_grad_norm,
    checkpoint_callback=checkpoint_callback,
    callbacks=[LoggingCallback()],
)

In [None]:
model = T5FineTuner(args)

In [None]:

trainer = pl.Trainer(**train_params)

print (" Training model")
5: How important is making something necessary.
trainer.fit(model)

print ("training finished")

print ("Saving model")
model.model.save_pretrained('t5_paraphrase')

print ("Saved model")

In [None]:
sentence = "In order to make something we have to work hard."

text =  "paraphrase: " + sentence + " </s>"


max_len = 256

encoding = tokenizer.encode_plus(text,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")


# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
# Have to read about these decodings
beam_outputs = model.model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    do_sample=True,
    max_length=256,
    top_k=120,
    top_p=0.98,
    early_stopping=True,
    num_return_sequences=10
)


print ("\nOriginal Question ::")
print (sentence)
print ("\n")
print ("Paraphrased Questions :: ")
final_outputs =[]
for beam_output in beam_outputs:
    sent = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    if sent.lower() != sentence.lower() and sent not in final_outputs:
        final_outputs.append(sent)

for i, final_output in enumerate(final_outputs):
    print("{}: {}".format(i, final_output))


Original Question ::
In order to make something we have to work hard.


Paraphrased Questions :: 
0: In order to make something the people wants us to work harder.
1: As many people we get rid of, people work hard to create the product people!
2: The purpose of the law is to work hard to make something one more.
3: Just enter the story in yourself, in order to make something, some people should work hard.
4: Rather, in order to do a man something is to work hard.
5: Pour une réalisation, we know that we are not capable of being more self-heavy.
6: Eroldened a career from life without technology.
7: Our main advantages are the ability to work hard.
8: To do things and work we have to work hard.
9: Just because of our work, we have to work hard.


## MS COCO Annotations

In [None]:
# Data Preparation

## Downloading the coco dataset and unzipping the content
!wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
!unzip annotations_trainval2017.zip

--2020-07-15 11:03:20--  http://images.cocodataset.org/annotations/annotations_trainval2017.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 52.216.77.204
Connecting to images.cocodataset.org (images.cocodataset.org)|52.216.77.204|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 252907541 (241M) [application/zip]
Saving to: ‘annotations_trainval2017.zip’


2020-07-15 11:03:23 (80.9 MB/s) - ‘annotations_trainval2017.zip’ saved [252907541/252907541]

Archive:  annotations_trainval2017.zip
  inflating: annotations/instances_train2017.json  
  inflating: annotations/instances_val2017.json  
  inflating: annotations/captions_train2017.json  
  inflating: annotations/captions_val2017.json  
  inflating: annotations/person_keypoints_train2017.json  
  inflating: annotations/person_keypoints_val2017.json  


In [None]:
import json

paraphrase_examples = []
paraphrase_test = []

with open("annotations/captions_train2017.json","r") as file:

  data = json.load(file)


  annotations = data["annotations"]

  annotations.sort(key=lambda x:x["image_id"])


with open("annotations/captions_val2017.json","r") as file:

  data_test = json.load(file)

  annotations_test = data_test["annotations"]

  annotations_test.sort(key=lambda x:x["image_id"])

In [None]:
for examples in range(0,len(annotations),5):

    pairs = annotations[examples:examples+5]


    if(len(pairs) >= 4):

      paraphrase_examples.append((pairs[0]['caption'],pairs[1]['caption']))

      paraphrase_examples.append((pairs[2]['caption'],pairs[3]['caption']))




for examples in range(0,len(annotations_test),5):

    pairs = annotations_test[examples:examples+5]


    if(len(pairs) >= 4):

      paraphrase_test.append((pairs[0]['caption'],pairs[1]['caption']))

      paraphrase_test.append((pairs[2]['caption'],pairs[3]['caption']))

In [None]:
print(len(paraphrase_examples),len(paraphrase_test))
ind = len(paraphrase_examples)//2
paraphrase_examples = paraphrase_examples[:ind]
print(len(paraphrase_examples))
print(paraphrase_examples[0])

236700 10006
118350
('Closeup of bins of food that include broccoli and bread.', 'A meal is presented in brightly colored plastic trays.')


In [None]:
import csv

with open("COCO_Train.csv","w") as file:

  csv_out = csv.writer(file)

  for row in paraphrase_examples:

    csv_out.writerow(row)


with open("COCO_Test.csv","w") as file:

  csv_out = csv.writer(file)

  for row in paraphrase_test:

    csv_out.writerow(row)

In [None]:
if not os.path.exists('t5_coco'):
    os.makedirs('t5_coco')

args_dict.update({'output_dir': 't5_coco','num_train_epochs':1,'max_seq_length':256})
args = argparse.Namespace(**args_dict)
print(args_dict)

{'data_dir': '', 'output_dir': 't5_coco', 'model_name_or_path': 't5-small', 'tokenizer_name_or_path': 't5-small', 'max_seq_length': 256, 'learning_rate': 0.0003, 'weight_decay': 0.0, 'adam_epsilon': 1e-08, 'warmup_steps': 0, 'train_batch_size': 6, 'eval_batch_size': 6, 'num_train_epochs': 1, 'gradient_accumulation_steps': 16, 'n_gpu': 1, 'early_stop_callback': False, 'fp_16': False, 'opt_level': 'O1', 'max_grad_norm': 1.0, 'seed': 42}


In [None]:
checkpoint_callback = pl.callbacks.ModelCheckpoint(
    filepath=args.output_dir, prefix="checkpoint", monitor="val_loss", mode="min", save_top_k=5
)

train_params = dict(
    accumulate_grad_batches=args.gradient_accumulation_steps,
    gpus=args.n_gpu,
    max_epochs=args.num_train_epochs,
    early_stop_callback=False,
    precision= 16 if args.fp_16 else 32,
    amp_level=args.opt_level,
    gradient_clip_val=args.max_grad_norm,
    checkpoint_callback=checkpoint_callback,
    callbacks=[LoggingCallback()],
)

In [None]:
model = T5FineTuner(args)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1197.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=242136741.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at t5-small were not used when initializing T5ForConditionalGeneration: ['encoder.block.0.layer.0.layer_norm.bias', 'encoder.block.0.layer.1.layer_norm.bias', 'encoder.block.1.layer.0.layer_norm.bias', 'encoder.block.1.layer.1.layer_norm.bias', 'encoder.block.2.layer.0.layer_norm.bias', 'encoder.block.2.layer.1.layer_norm.bias', 'encoder.block.3.layer.0.layer_norm.bias', 'encoder.block.3.layer.1.layer_norm.bias', 'encoder.block.4.layer.0.layer_norm.bias', 'encoder.block.4.layer.1.layer_norm.bias', 'encoder.block.5.layer.0.layer_norm.bias', 'encoder.block.5.layer.1.layer_norm.bias', 'encoder.final_layer_norm.bias', 'decoder.block.0.layer.0.layer_norm.bias', 'decoder.block.0.layer.1.layer_norm.bias', 'decoder.block.0.layer.2.layer_norm.bias', 'decoder.block.1.layer.0.layer_norm.bias', 'decoder.block.1.layer.1.layer_norm.bias', 'decoder.block.1.layer.2.layer_norm.bias', 'decoder.block.2.layer.0.layer_norm.bias', 'decoder.block.2.layer.1.layer_norm.bias

In [None]:

trainer = pl.Trainer(**train_params)

print (" Training model")
trainer.fit(model)

print ("training finished")

print ("Saving model")
model.model.save_pretrained('t5_paraphrase')

print ("Saved model")

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]


 Training model



  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 60 M  


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…


training finished
Saving model
Saved model


In [None]:
# Getting the output


model = T5ForConditionalGeneration.from_pretrained('t5_paraphrase')
tokenizer = T5Tokenizer.from_pretrained('t5-small')

model.to("cuda")

sentence = "People are so fragile that i cannot even perceive this intuition"

text =  "paraphrase: " + sentence + " </s>"


max_len = 256

encoding = tokenizer.encode_plus(text,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")


# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
# Have to read about these decodings
beam_outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    do_sample=True,
    max_length=256,
    top_k=120,
    top_p=0.98,
    early_stopping=True,
    num_return_sequences=5
)


print ("\nOriginal Question ::")
print (sentence)
print ("\n")
print ("Paraphrased Questions :: ")
final_outputs =[]
for beam_output in beam_outputs:
    sent = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    if sent.lower() != sentence.lower() and sent not in final_outputs:
        final_outputs.append(sent)

for i, final_output in enumerate(final_outputs):
    print("{}: {}".format(i, final_output))


Original Question ::
People are so fragile that i cannot even perceive this intuition


Paraphrased Questions :: 
0: A hand tilts and twists on a glass balcony.
1: Some people in the woods and some people in the woods.
2: A man doing a trick on his hand.
3: A blind dog on a bridge in a grass field.
4: A group of people that are going through a physical relationship.


# Google's PAW Dataset

In [11]:
!wget https://storage.googleapis.com/paws/english/paws_wiki_labeled_final.tar.gz

--2022-01-13 10:46:38--  https://storage.googleapis.com/paws/english/paws_wiki_labeled_final.tar.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.213.128, 173.194.215.128, 173.194.216.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.213.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4687157 (4.5M) [application/gzip]
Saving to: ‘paws_wiki_labeled_final.tar.gz.1’


2022-01-13 10:46:38 (233 MB/s) - ‘paws_wiki_labeled_final.tar.gz.1’ saved [4687157/4687157]



In [12]:
! tar -xzf paws_wiki_labeled_final.tar.gz

In [13]:
import csv

train_examples = []
test_examples = []
dev_examples = []

with open("final/train.tsv","r") as csvfile:

  reader = csv.reader(csvfile,delimiter="\t")
  
  next(reader)

  for row in reader:

    if row[3] == "1":
      train_examples.append((row[1],row[2]))



with open("final/test.tsv","r") as csvfile:

  reader = csv.reader(csvfile,delimiter="\t")
  
  next(reader)

  for row in reader:

    if row[3] == "1":
      test_examples.append((row[1],row[2]))


with open("final/dev.tsv","r") as csvfile:

  reader = csv.reader(csvfile,delimiter="\t")
  
  next(reader)

  for row in reader:

    if row[3] == "1":
      dev_examples.append((row[1],row[2]))






In [14]:
test_examples = dev_examples + test_examples

In [15]:
len(test_examples)

7075

In [16]:
with open("PAW_Train.csv","w") as csvfile:
  writer = csv.writer(csvfile)

  for row in train_examples:
    writer.writerow(row)


In [17]:
with open("PAW_Test.csv","w") as csvfile:
  writer = csv.writer(csvfile)

  for row in test_examples:
    writer.writerow(row)


In [18]:
if not os.path.exists('t5_paw_global'):
    os.makedirs('t5_paw_global')

args_dict.update({'output_dir': 't5_paw_global','num_train_epochs':1,'max_seq_length':256})
args = argparse.Namespace(**args_dict)
print(args_dict)

{'data_dir': '', 'output_dir': 't5_paw_global', 'model_name_or_path': 't5-base', 'tokenizer_name_or_path': 't5-base', 'max_seq_length': 256, 'learning_rate': 0.0003, 'weight_decay': 0.1, 'adam_epsilon': 1e-08, 'warmup_steps': 0, 'train_batch_size': 6, 'eval_batch_size': 6, 'num_train_epochs': 1, 'gradient_accumulation_steps': 16, 'n_gpu': 1, 'fp_16': False, 'opt_level': 'O2', 'max_grad_norm': 1.0, 'seed': 42}


In [105]:
checkpoint_callback = pl.callbacks.ModelCheckpoint(
    filename="checkpoint" + args.output_dir, monitor="val_loss", mode="min", save_top_k=5
)

train_params = dict(
    accumulate_grad_batches=args.gradient_accumulation_steps,
    gpus=args.n_gpu,
    max_epochs=args.num_train_epochs,
    # early_stop_callback=False,
    precision= 16 if args.fp_16 else 32,
    amp_level=args.opt_level,
    gradient_clip_val=args.max_grad_norm,
    checkpoint_callback=checkpoint_callback,
    callbacks=[LoggingCallback()],
    amp_backend='apex'
)

In [106]:
model = T5FineTuner(args)

In [None]:
import csv
trainer = pl.Trainer(**train_params)

print (" Training model")
trainer.fit(model)

print ("training finished")

print ("Saving model")
model.model.save_pretrained('t5_paw_global')

print ("Saved model")

  f"Setting `Trainer(checkpoint_callback={checkpoint_callback})` is deprecated in v1.5 and will "
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


 Training model



  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M 
-----------------------------------------------------
222 M     Trainable params
0         Non-trainable params
222 M     Total params
891.614   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

  f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added."
  cpuset_checked))


Training: 0it [00:00, ?it/s]

  f"One of the returned values {set(extra.keys())} has a `grad_fn`. We will detach it automatically"


In [None]:
# Getting the output


model = T5ForConditionalGeneration.from_pretrained('t5_paraphrase')
tokenizer = T5Tokenizer.from_pretrained('t5-small')

model.to("cuda")

sentence = "This is something which i cannot understand at all"

text =  "paraphrase: " + sentence + " </s>"


max_len = 256

encoding = tokenizer.encode_plus(text,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")


# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
# Have to read about these decodings
beam_outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    do_sample=True,
    max_length=256,
    top_k=220,
    top_p=1,
    early_stopping=True,
    num_return_sequences=5
)


print ("\nOriginal Question ::")
print (sentence)
print ("\n")
print ("Paraphrased Questions :: ")
final_outputs =[]
for beam_output in beam_outputs:
    sent = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    if sent.lower() != sentence.lower() and sent not in final_outputs:
        final_outputs.append(sent)

for i, final_output in enumerate(final_outputs):
    print("{}: {}".format(i, final_output))


Original Question ::
This is something which i cannot understand at all


Paraphrased Questions :: 
0: This is something I cannot understand at all so there is nothing i could understand it at all.
1: This is something that i can have absolutely no comprehension of.
2: This is something which i cannot explain at all.
3: ... It is something at all that i can not understand myself at.
4: This is something i cannot understand at all.


### Additonal Training Data From PAWS

In [None]:
!wget https://storage.googleapis.com/paws/english/paws_wiki_unlabeled_final.tar.gz

--2020-07-18 03:39:59--  https://storage.googleapis.com/paws/english/paws_wiki_unlabeled_final.tar.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.79.128, 108.177.119.128, 108.177.126.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.79.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47393331 (45M) [application/gzip]
Saving to: ‘paws_wiki_unlabeled_final.tar.gz’


2020-07-18 03:40:01 (45.3 MB/s) - ‘paws_wiki_unlabeled_final.tar.gz’ saved [47393331/47393331]



In [None]:

!tar -xzf paws_wiki_unlabeled_final.tar.gz

In [None]:
unlabeled_train = []
unlabeled_test = []


with open("unlabeled/final/train.tsv","r") as csvfile:

  reader = csv.reader(csvfile,delimiter="\t")

  for row in reader:

    if row[3] == "1":
      unlabeled_train.append((row[1],row[2]))

with open("unlabeled/final/dev.tsv","r") as csvfile:

  reader = csv.reader(csvfile,delimiter="\t")

  for row in reader:

    if row[3] == "1":
      unlabeled_test.append((row[1],row[2]))



In [None]:
ind = len(unlabeled_train) // 3
print(ind)
unlabeled_train = unlabeled_train[:ind]

107608


In [None]:
len(unlabeled_test)

5000

In [None]:
train_data = train_examples + unlabeled_train

In [None]:
test_data = test_examples + unlabeled_test

In [None]:
import random


random.shuffle(train_data)
random.shuffle(test_data)


In [None]:
len(test_data)

12075

In [None]:
with open("PAW_Train_Global.csv","w") as csvfile:
  writer = csv.writer(csvfile)

  for row in train_data:
    writer.writerow(row)


with open("PAW_Test_Global.csv","w") as csvfile:
  writer = csv.writer(csvfile)

  for row in test_data:
    writer.writerow(row)