Paraphrasing a question with T5 (Text-To-Text Transfer Transformer) involves using the T5 model, which is a versatile language model developed by Google Research, to rephrase a question in a different but semantically equivalent way. T5 is trained to perform various natural language processing tasks, including text generation and paraphrasing.

To paraphrase a question with T5, you would typically input the original question as text and request the model to generate a paraphrased version of the question. The model will use its understanding of language to produce a different wording of the question while preserving its meaning.

Here's an example:
- Original Question: "What are the effects of climate change on the environment?"
- Paraphrased Question (generated by T5): "How does climate change impact the natural world?"

In this example, T5 has paraphrased the question while maintaining the core meaning and intent of the original query. This can be useful for generating diverse variations of a question or making a question more suitable for a particular context. T5 is known for its capability to perform a wide range of text generation tasks, including summarization, translation, and question-answering.

https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

#### Setup Env

In [None]:
!pip install transformers datasets accelerate evaluate rouge_score -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m59.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.0/261.0 kB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m76.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m91.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.4 M

In [None]:
!wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv

--2023-10-29 12:26:31--  http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Resolving qim.fs.quoracdn.net (qim.fs.quoracdn.net)... 162.159.152.17, 162.159.153.247
Connecting to qim.fs.quoracdn.net (qim.fs.quoracdn.net)|162.159.152.17|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://qim.fs.quoracdn.net/quora_duplicate_questions.tsv [following]
--2023-10-29 12:26:31--  https://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Connecting to qim.fs.quoracdn.net (qim.fs.quoracdn.net)|162.159.152.17|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58176133 (55M) [text/tab-separated-values]
Saving to: ‘quora_duplicate_questions.tsv’


2023-10-29 12:26:31 (82.2 MB/s) - ‘quora_duplicate_questions.tsv’ saved [58176133/58176133]



## Import Library

In [None]:
import os
import pandas as pd
import numpy as np
import nltk
import evaluate
nltk.download('punkt')

from sklearn.model_selection import train_test_split

import torch
from datasets import Dataset, DatasetDict
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer,  AutoModelForSeq2SeqLM
from transformers import T5ForConditionalGeneration, AutoTokenizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# set variable & parameters
PATH_TSV = "/content/quora_duplicate_questions.tsv"
PATH_TRAIN = "/content/train.csv"
PATH_TEST = "/content/test.csv"
PATH_VAL = "/content/val.csv"

MAX_LENGTH = 256
BATCH_SIZE = 16

PREFIX = "paraphrase: "
END_PREFIX = " </s>"
MODEL_CHECKPOINT = "t5-base"
MODEL_REPO = "t5-base-paraphase-question"

## Preprocess dataset

In [None]:
data = pd.read_csv(PATH_TSV, sep="\t")
data = data.loc[:30000]

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30001 entries, 0 to 30000
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            30001 non-null  int64 
 1   qid1          30001 non-null  int64 
 2   qid2          30001 non-null  int64 
 3   question1     30001 non-null  object
 4   question2     30001 non-null  object
 5   is_duplicate  30001 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 1.4+ MB


In [None]:
question_pairs_correct_paraphrased = data[data['is_duplicate']==1]
question_pairs_correct_paraphrased.drop(['id', 'is_duplicate', 'qid1', 'qid2'], axis = 1,inplace = True)


train, val = train_test_split(question_pairs_correct_paraphrased, test_size=0.3)
val, test = train_test_split(val, test_size=0.2)
train.to_csv('train.csv', index=False)
test.to_csv('test.csv', index=False)
val.to_csv('val.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  question_pairs_correct_paraphrased.drop(['id', 'is_duplicate', 'qid1', 'qid2'], axis = 1,inplace = True)


In [None]:
train = pd.read_csv(PATH_TRAIN)
val = pd.read_csv(PATH_VAL)
test = pd.read_csv(PATH_TEST)
train_data = Dataset.from_pandas(train)
val_data = Dataset.from_pandas(val)
test_data = Dataset.from_pandas(test)


data = DatasetDict()
data['train'] = train_data
data['validation'] = val_data
# data['test'] = test_data
print(data)

DatasetDict({
    train: Dataset({
        features: ['question1', 'question2'],
        num_rows: 7809
    })
    validation: Dataset({
        features: ['question1', 'question2'],
        num_rows: 2678
    })
})


## Tokenize dataset

In [None]:
def preprocess_function(examples):
    inputs = [PREFIX + doc + END_PREFIX for doc in examples["question1"]]
    target = [doc + END_PREFIX for doc in examples["question2"]]
    # tokenize inputs
    model_inputs = tokenizer(
        inputs, max_length=MAX_LENGTH,
        pad_to_max_length=True, truncation=True
    )

    labels = tokenizer(
        text_target=target, max_length=MAX_LENGTH, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
model = T5ForConditionalGeneration.from_pretrained(MODEL_CHECKPOINT)

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
tokenized_dataset = data.map(preprocess_function, batched=True, remove_columns=["question1", "question2"])

Map:   0%|          | 0/7809 [00:00<?, ? examples/s]



Map:   0%|          | 0/2678 [00:00<?, ? examples/s]

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

## Compute Metrics

In [None]:
metrics = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = metrics.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [None]:
# Set up training arguments
training_args = Seq2SeqTrainingArguments(
    MODEL_REPO,
    evaluation_strategy="steps",
    eval_steps=500,
    learning_rate=3e-4,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
    load_best_model_at_end=True,
    metric_for_best_model="rouge1",
    fp16=True
)

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

## Training

In [None]:
trainer.train()

Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
500,0.8239,1.293208,0.6213,0.3793,0.5917,0.5917,11.9029
1000,0.9769,1.318021,0.6197,0.377,0.5899,0.5897,11.8786
1500,0.9795,1.317559,0.6199,0.3773,0.5898,0.5897,11.8783
2000,0.9804,1.317149,0.6202,0.3774,0.59,0.5899,11.8727
2500,0.9801,1.316763,0.6201,0.3772,0.5899,0.5897,11.8727
3000,0.9782,1.316485,0.6201,0.3773,0.5899,0.5897,11.8704
3500,0.9713,,0.0,0.0,0.0,0.0,0.0
4000,0.0,,0.0,0.0,0.0,0.0,0.0




Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
500,0.8239,1.293208,0.6213,0.3793,0.5917,0.5917,11.9029
1000,0.9769,1.318021,0.6197,0.377,0.5899,0.5897,11.8786
1500,0.9795,1.317559,0.6199,0.3773,0.5898,0.5897,11.8783
2000,0.9804,1.317149,0.6202,0.3774,0.59,0.5899,11.8727
2500,0.9801,1.316763,0.6201,0.3772,0.5899,0.5897,11.8727
3000,0.9782,1.316485,0.6201,0.3773,0.5899,0.5897,11.8704
3500,0.9713,,0.0,0.0,0.0,0.0,0.0
4000,0.0,,0.0,0.0,0.0,0.0,0.0
4500,0.0,,0.0,0.0,0.0,0.0,0.0




TrainOutput(global_step=4890, training_loss=0.684061642512222, metrics={'train_runtime': 3374.9461, 'train_samples_per_second': 23.138, 'train_steps_per_second': 1.449, 'total_flos': 2.37767608369152e+16, 'train_loss': 0.684061642512222, 'epoch': 10.0})

## Evaluation model on Test set


In [None]:
import torch

In [None]:
def preprocess_function(examples):
    inputs = [PREFIX + doc + END_PREFIX for doc in examples["question1"]]
    # tokenize inputs
    model_inputs = tokenizer(
        inputs, max_length=MAX_LENGTH,
        pad_to_max_length=True, truncation=True
    )

    return model_inputs

In [None]:
data['test'] = test_data
print(data)

DatasetDict({
    train: Dataset({
        features: ['question1', 'question2'],
        num_rows: 7809
    })
    validation: Dataset({
        features: ['question1', 'question2'],
        num_rows: 2678
    })
    test: Dataset({
        features: ['question1', 'question2'],
        num_rows: 670
    })
})


In [None]:
test_tokenized_dataset = data["test"]
test_tokenized_dataset = test_tokenized_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/670 [00:00<?, ? examples/s]



In [None]:
# prepare dataloader
test_tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
dataloader = torch.utils.data.DataLoader(test_tokenized_dataset, batch_size=32)

In [None]:
# generate text for each batch
all_predictions = []
for i, batch in enumerate(dataloader):
    # Memindahkan data ke GPU
    batch = {key: value.to('cuda:0') for key, value in batch.items()}
    predictions = model.generate(**batch)
    all_predictions.append(predictions)

# flatten predictions
all_predictions_flattened = [pred for preds in all_predictions for pred in preds]

# tokenize and pad titles
all_titles = tokenizer(
    test_tokenized_dataset["question2"], max_length=MAX_LENGTH,
    truncation=True, padding="max_length"
)["input_ids"]

# Mengkopi tensor dari GPU ke CPU
all_predictions_flattened = [pred.to('cpu') for pred in all_predictions_flattened]

# compute metrics
predictions_labels = [all_predictions_flattened, all_titles]
compute_metrics(predictions_labels)


{'rouge1': 0.585,
 'rouge2': 0.3346,
 'rougeL': 0.5538,
 'rougeLsum': 0.5535,
 'gen_len': 12.0373}

# Inference

In [None]:
import torch
from transformers import T5ForConditionalGeneration,T5Tokenizer

In [None]:
def set_seed(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(42)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device :", device)
model = model.to(device)

device : cuda


In [None]:
# sentence = "Which course should I take to get started in data science?"
sentence = "What are the ingredients required to bake a perfect cake?"
# sentence = "What is the best possible approach to learn aeronautical engineering?"
# sentence = "Do apples taste better than oranges in general?"

text =  "paraphrase: " + sentence + " </s>"

In [None]:
encoding = tokenizer.encode_plus(
    text, pad_to_max_length=True, return_tensors="pt"
)
input_ids = encoding["input_ids"].to(device)
attention_masks = encoding["attention_mask"].to(device)

In [None]:
# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
beam_outputs = model.generate(
    input_ids=input_ids,
    attention_mask=attention_masks,
    do_sample=True,
    max_length=128,
    top_k=50,
    top_p=0.98,
    early_stopping=True,
    num_return_sequences=5,
    repetition_penalty=4.9
)

In [None]:
print ("\nOriginal Question ::")
print (sentence)
print ("\n")
print ("Paraphrased Questions :: ")
final_outputs =[]
for beam_output in beam_outputs:
    sent = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    if sent.lower() != sentence.lower() and sent not in final_outputs:
        final_outputs.append(sent)

for i, final_output in enumerate(final_outputs):
    print("{}: {}".format(i, final_output))


Original Question ::
What are the ingredients required to bake a perfect cake?


Paraphrased Questions :: 
0: What exactly steps need to bake a perfect cake?
1: What are the most important ingredients needed to bake a perfect cake? How can I really make it and how much you use all those sugary treats in your kitchen...
2: What is the exact amount of ingrediente needed for baking a perfect cake?
3: What is the best way to bake a cake just right?
4: What are some preparations required to bake an outstanding cake? Could someone please show me how this worked out.
