# **Fine-tune a RoBERTa Encoder Decoder model trained on MLM for Text Generation**

For a few weeks I was investigating different models and alternatives in Huggingface to train a text generation model. We have a short list of products with their description and our goal is to obtain the name of the product. I did some experiments with the Transformer model in Tensorflow as well as the T5 summarizer. Finally, in order to deepen the use of Huggingface transformers, I decided to approach the problem with a somewhat more complex approach, an encoder decoder model. Maybe it was not the best option, but I wanted to learn new things about huggingface Transformers.

First, I must admit that probably a text generation problem is not usually approached with this kind of solution, using encoders models like BERT or RoBERTa. But in this problem we are not going to generate “free text”, so we can simplify our task. We are looking for a subset of words from the product description to compose the product name and our full vocabulary is present in the input data. From this point of view, we can encode the product description into a vector representation and decode it to a text name. Therefore, the use of an encoder-decoder is presented as an option to evaluate.

Our problem can be represented as a sequence-to-sequence problem, where we need to find a mapping of an input sequence (the product description) to an output sequence (the product name). In a Hugingface blog post “Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models” you can find a deep explanation and experiments building many encoder-decoder model using BERT or GTP2 transformers model. I highly recommend you to read it.

## Our objetive in this notebook

In a previous notebook, we created a custom tokenizer and trained a RoBERTa model. Now, we will use that trained model to build a encoder decoder model and we will fine tune this new model on our dataset.


## Loading the libraries

In [None]:
#Installation
!pip install datasets==1.0.2
!pip install transformers
# In previous version of the library the seq2seqtrainer was not included
#!rm seq2seq_trainer.py
#!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/seq2seq/seq2seq_trainer.py
#!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/legacy/seq2seq/seq2seq_trainer.py
!pip install rouge_score

import datasets
import transformers
import pandas as pd
from datasets import Dataset

#Tokenizer
from transformers import RobertaTokenizerFast

#Encoder-Decoder Model
from transformers import EncoderDecoderModel

#Training
# When using previous version of the library you need the following two lines
#from seq2seq_trainer import Seq2SeqTrainer
#from transformers import TrainingArguments

from transformers import Seq2SeqTrainer
from transformers import Seq2SeqTrainingArguments
from dataclasses import dataclass, field
from typing import Optional

import os



# Loading the dataset

As we mentioned before, our dataset contains around 31.000 items, about clothes from an important retailer, including a long product description and a short product name, our target variable. First, we execute a exploratory data analysis and we can observe that the count of rows with outliers values is a small number. The count of words looks like a left skewed distribution, 75% of rows in the range 50–60 words and a maximum about 125 words. The target variable contains about 3 to 6 words.

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Define parameters for data location and model folders

In [None]:
#Set the path to the data folder, datafile and output folder and files
root_folder = '/content/drive/My Drive/'
data_folder = os.path.abspath(os.path.join(root_folder, 'datasets/text_gen_product_names'))
model_folder = os.path.abspath(os.path.join(root_folder, 'Projects/text_generation_names/RoBERTa-FT-MLM'))
output_folder = os.path.abspath(os.path.join(root_folder, 'Projects/text_generation_names'))
pretrainedmodel_folder = os.path.abspath(os.path.join(root_folder, 'Projects/text_generation_names/RoBERTaMLM'))
tokenizer_folder = os.path.abspath(os.path.join(root_folder, 'Projects/text_generation_names/TokRoBERTa'))

# Datafiles names containing training and test data
test_filename='cl_test_descriptions.csv'
datafile= 'product_names_desc_cl_train.csv'
outputfile = 'submission.csv'
datafile_path = os.path.abspath(os.path.join(data_folder,datafile))
testfile_path = os.path.abspath(os.path.join(data_folder,test_filename))
outputfile_path = os.path.abspath(os.path.join(output_folder,outputfile))

Load the datafile with the product descriptions and names:

In [None]:
# Load the dataset from a CSV file
df=pd.read_csv(datafile_path, header=0, usecols=[0,1])
print('Num Examples: ',len(df))
print('Null Values\n', df.isna().sum())

Num Examples:  31593
Null Values
 name           44
description     1
dtype: int64


Remove rows with null values:

In [None]:
df.dropna(inplace=True)
print('Num Examples: ',len(df))

Num Examples:  31548


## Split the data into train and validation dataset

We split the dataset into a training dataset (90%) and a validation dataset (10%). To choose the examples, there is a sampling method to randomly extract the training dataset.

In [None]:
# Splitting the data into training and validation
# Defining the train size. So 90% of the data will be used for training and the rest will be used for validation. 
train_size = 0.9
# Sampling 90% fo the rows from the dataset
train_dataset=df.sample(frac=train_size,random_state = 42)
# Reset the indexes
val_dataset=df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)
print('Length Train dataset: ', len(train_dataset))
print('Length Val dataset: ', len(val_dataset))

Length Train dataset:  28393
Length Val dataset:  3155


In the next section, we try to limit the number of examples to train on in order to reduce the cost and time for training during the experiments. When the model is ready to be trained, we must train on the whole training dataset.

In [None]:
# To limit the training and validation dataset, for testing
max_train=28393
max_val=3155
# Create a Dataset from a pandas dataframe for training and validation
train_data=Dataset.from_pandas(train_dataset[:max_train])
val_data=Dataset.from_pandas(val_dataset[:max_val])

# Encoder-Decoder Architecture

In many machine learning solutions, we can find architectures where the first section of the DNN produces a vector representation of the input and the second half takes that vector and an input an generate an expected result. This is basically an encoder-decoder model and with RNN they become a very powerful tool to solve NLP tasks. 

"*Analogous to RNN-based encoder-decoder models, transformer-based encoder-decoder models consist of an encoder and a decoder which are both stacks of residual attention blocks. The key innovation of transformer-based encoder-decoder models is that such residual attention blocks can process an input sequence of variable length without exhibiting a recurrent structure. Not relying on a recurrent structure allows transformer-based encoder-decoders to be highly parallelizable,…*" [2] "Transformers-based Encoder-Decoder models"

The purpose of this architecture is to find a mapping function between an input sequence and its targeted output sequence. In the case of transformers, the encoder encodes the input sequence to a sequence of hidden states then the decoder takes de target sequence and the encoded hidden states to model or learn the mapping function. But the encoder only works in the first step, in the next steps the decoder receives the next target token and reuses the hidden states from the encoder produced in step 1. A deep explanation can be found on the mentioned post [2] "Transformers-based Encoder-Decoder models".

With this idea in mind, we can consider an encoder-decoder model as an encoder-only model, such as BERT, and a decoder-only model, such as GPT-2, both combined to produce a target sequence. At this moment, we have many pretrained models available in Huggingface's model hub, so the first option to evaluate is using these pretrained models to build our encoder-decoder and fine-tune it to our specific task.

In the post "Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models",[1], many combinations are explained and evaluated, but in summary, you need to consider how to initialize the models:
- Initialize both the encoder and the decoder from a pretrained encoder-only
- Initialize the encoder from an encoder-only checkpoint and the decoder from a decoder-only
- Initialize only the encoder or only the decoder from a pretrained encoder or decoder-only model
- Sharing or not the encoder weights with the decoder
- Which transformer model use in the encoder and the decoder, it could be a BERT, GPT-2, or RoBERTa model.

In our experiment we will try a RoBERTaShared strategy, where both the encoder and the decoder are RoBERTa based and they share their weights. This combination has shown great results in many tasks.


# Create the encoder-decoder model from a pretrained RoBERTa model

## Load the trained tokenizer on our specific language
As we mentioned previously, we have trained a tokenizer and a RoBERTa model from scratch using the Masked Language Modelling technique trying to focus our model on our specific task. Now we can configure our encoder-decoder using this pretrained model.

The first step is loading the tokenizer we need to apply to generate our input and target tokens and transform them into a vector representation of the text data.

In [None]:
# Loading the RoBERTa Tokenizer
tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer_folder)
# Setting the BOS and EOS token
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token

## Setting the model and training parameters

Now it is time to set the model and training parameters, they will be passed to the dataset generator and to the Trainer object in a latter section.

In [None]:
TRAIN_BATCH_SIZE = 16   # input batch size for training (default: 64)
VALID_BATCH_SIZE = 4    # input batch size for testing (default: 1000)
TRAIN_EPOCHS = 3       # number of epochs to train (default: 10)
VAL_EPOCHS = 1 
LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
SEED = 42               # random seed (default: 42)
MAX_LEN = 128           # Max length for product description
SUMMARY_LEN = 7         # Max length for product names

## Prepare and create the dataset
In the next step, we need to generate the dataset for our model training. Using the tokenizer loaded, we tokenize the text data, apply the padding technique, and truncate the input and output sequences. Remember that we can define a maximum length for the input data and a different length for the output one. Finally, we define a batch size to split the dataset into batches for training and evaluation.

RoBERTa is a variant of a BERT model so the expected inputs are similar: the `input_ids` and the `attention_mask`. But RoBERTa doesn't have `token_type_ids` parameter like BERT, it doesn't make sense, because it was not training for Next Sentence Prediction so you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `</s>`).

In [None]:
batch_size=TRAIN_BATCH_SIZE  # change to 16 for full training
encoder_max_length=MAX_LEN
decoder_max_length=SUMMARY_LEN

def process_data_to_model_inputs(batch):
  # Tokenize the input and target data
  inputs = tokenizer(batch["description"], padding="max_length", truncation=True, max_length=encoder_max_length)
  outputs = tokenizer(batch["name"], padding="max_length", truncation=True, max_length=decoder_max_length)

  batch["input_ids"] = inputs.input_ids
  batch["attention_mask"] = inputs.attention_mask
  batch["decoder_input_ids"] = outputs.input_ids
  batch["decoder_attention_mask"] = outputs.attention_mask
  batch["labels"] = outputs.input_ids.copy()

  batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]

  return batch
# Preprocessing the training data
train_data = train_data.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size, 
    remove_columns=["description", "name"]
)
train_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)
# Preprocessing the validation data
val_data = val_data.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size, 
    remove_columns=["description", "name"]
)
val_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)
# Shuffle the dataset when it is needed
#dataset = dataset.shuffle(seed=42, buffer_size=10, reshuffle_each_iteration=True)


  0%|          | 0/1775 [00:00<?, ?ba/s]

  0%|          | 0/198 [00:00<?, ?ba/s]

## Define the RoBERTa Encoder-Decoder model

We are building our model based on the pretrained model we build in Part 1 of this series, thanks to Hugginface's libraries and wrappers it is very simple to create the model. We just need to invoke the method `from_encoder_decoder_pretrained` from the `EncoderDecoderModel` Class of the Hugginface transformers library. We set the encoder and decoder pretrained model passing the folder where we saved our RoBERTa model trained and indicate that we want to "tie" the weights of both models, so they share their weights.

**Why tie?**

In [None]:
# set encoder decoder tying to True
roberta_shared = EncoderDecoderModel.from_encoder_decoder_pretrained(pretrainedmodel_folder, decoder_folder, tie_encoder_decoder=True)
# Show the vocab size to check it has been loaded
print('Vocab Size: ',roberta_shared.config.encoder.vocab_size)

ValueError: ignored

Once the model is built, we need to specify a bunch of parameters like the special tokens, to start a sentence or bos and the end of sentence or eos. 
We also have to set parameters to define how the text is generated. 

Very important parameters to declare are those relative to the **decoding strategy**. Any text generator needs to *define how the next output or token is selected from all the available possibilities*. This kind of architecture try to model the probability of the next token considering the sequence of previous tokens. At this point we generate the probabilities of the tokens to be the next one and we need a method to select one of them. 

## Decoding strategies

We are not going to analyze all the possibilities but we want to mention some of the alternatives that the Huggingface library  provides.
 
Our first and most intuitive approximation is the **Greddy search** where we take the word with the highest probability in our vocabulary at every step. Unfortunately, this strategy tends to produce patterns and repeat itself in cyclic dependencies. It could be a very deterministic method and that is not what a text generator should be. And this strategy creates the problem of a high-probability token "hiding behind" a low-probability token that preceeds it in the order of the sentence. And we can get stuck in a suboptimal solution.

Another approach that tries to minimize this problem is the **Beam Search**, which maintain multiple possible paths or sequences to approximates the solution. From this beam of results a greedy search is applied to allow us to deterministically approximate the most likely sequence of words. It is not a very creative procedure but it works fine for translation tasks, translating sentences between languages does not requiere a great creativity. If we want to generate a story or a dialogue we need to "explore" alternatives and reduce determinism. 

**Sampling techniques** is a way to introduce random selections in our text generation process. But using random sampling we cannot control how the output is produced and a "guided" sampling is a better approach. **"A simple, but very powerful sampling scheme, called Top-K sampling. In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words"**, [3] ."This has the effect of constraining the generation process to select words that the model itself has deemed to be more sensible in the context".

During our training stage, we apply beam search strategy using 10 beams and penalize repeated ngrams and lengthy outputs.

In [None]:
# set special tokens
roberta_shared.config.decoder_start_token_id = tokenizer.bos_token_id                                             
roberta_shared.config.eos_token_id = tokenizer.eos_token_id
roberta_shared.config.pad_token_id = tokenizer.pad_token_id

# sensible parameters for beam search
# set decoding params                               
roberta_shared.config.max_length = SUMMARY_LEN
roberta_shared.config.early_stopping = True
roberta_shared.config.no_repeat_ngram_size = 1
roberta_shared.config.length_penalty = 2.0
roberta_shared.config.repetition_penalty = 3.0
roberta_shared.config.num_beams = 10
roberta_shared.config.vocab_size = roberta_shared.config.encoder.vocab_size

# Training the encoder-decoder

The `Trainer` component of the Huggingface library will train our new model in a very easy way, in just a bunch of lines of code. The `Trainer` API provides all capabilities we need to train almost any transformer model and we do not have to code the training loops at all. 

But first we have to create the `TrainingArguments` in our scenario we define a subclass called `Seq2SeqTrainingArguments` to include some required parameters:


In [None]:
'''
This section was required in previous versions of the transformer library when using
the seq2seq_trainer.py from the examples notebooks

@dataclass
class Seq2SeqTrainingArguments(TrainingArguments):
    label_smoothing: Optional[float] = field(
        default=0.0, metadata={"help": "The label smoothing epsilon to apply (if not zero)."}
    )
    sortish_sampler: bool = field(default=False, metadata={"help": "Whether to SortishSamler or not."})
    predict_with_generate: bool = field(
        default=False, metadata={"help": "Whether to use generate to calculate generative metrics (ROUGE, BLEU)."}
    )
    adafactor: bool = field(default=False, metadata={"help": "whether to use adafactor"})
    encoder_layerdrop: Optional[float] = field(
        default=None, metadata={"help": "Encoder layer dropout probability. Goes into model.config."}
    )
    decoder_layerdrop: Optional[float] = field(
        default=None, metadata={"help": "Decoder layer dropout probability. Goes into model.config."}
    )
    dropout: Optional[float] = field(default=None, metadata={"help": "Dropout probability. Goes into model.config."})
    attention_dropout: Optional[float] = field(
        default=None, metadata={"help": "Attention dropout probability. Goes into model.config."}
    )
    lr_scheduler: Optional[str] = field(
        default="linear", metadata={"help": f"Which lr scheduler to use."}
    )
'''

## Define the Rouge metric for model evaluation

When training an ML model we need an evaluation metric, we chose the Rouge Score metric that it is used to evaluate translation and summarization task. We are focusing our problem as a summarization, from the product description to the name. This metric seems to be the right choice, but there could be other interesting alternatives.

The datasets library provides us with that metric, so we load it and define a `compute_metrics` method where we calculate the metric for the target text, the product description, and the predicted text.


In [None]:
# load rouge for validation
rouge = datasets.load_metric("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

## Create the Trainer

Now it is time to set the training arguments: batch_size, training epochs, save the model, etc. And then we can instantiate a `Seq2SeqTrainer`, a subclass of the `Trainer`object we mentioned, selecting the model to train, the training arguments, the metrics computation, the train, and the evaluation datasets.


In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir=model_folder,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    #evaluate_during_training=True,
    evaluation_strategy="epoch",
    do_train=True,
    do_eval=True,
    logging_steps=1024,  
    save_steps=2048, 
    warmup_steps=1024,  
    #max_steps=1500, # delete for full training
    num_train_epochs = TRAIN_EPOCHS, #TRAIN_EPOCHS
    overwrite_output_dir=True,
    save_total_limit=1,
    fp16=True, 
)

# instantiate trainer
trainer = Seq2SeqTrainer(
    model=roberta_shared,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=val_data,
)

Using amp fp16 backend


Now, we start training the model:

In [None]:
# Fine-tune the model, training and evaluating on the train dataset
trainer.train()

***** Running training *****
  Num examples = 28393
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 888


Epoch,Training Loss,Validation Loss,Rouge2 Precision,Rouge2 Recall,Rouge2 Fmeasure
1,No log,2.935614,0.1217,0.1316,0.1232


***** Running Evaluation *****
  Num examples = 3155
  Batch size = 32


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=888, training_loss=4.123736407305743, metrics={'train_runtime': 1717.5781, 'train_samples_per_second': 16.531, 'train_steps_per_second': 0.517, 'total_flos': 1055475498061824.0, 'train_loss': 4.123736407305743, 'epoch': 1.0})

Save the encoder-decoder model just trained:

In [None]:
# Save the encoder-decoder model just trained
trainer.save_model(model_folder)

Saving model checkpoint to /content/drive/My Drive/Projects/text_generation_names/RoBERTaMLM
Configuration saved in /content/drive/My Drive/Projects/text_generation_names/RoBERTaMLM/config.json
Model weights saved in /content/drive/My Drive/Projects/text_generation_names/RoBERTaMLM/pytorch_model.bin


# Evaluate the model on the test dataset

Once we have our model trained, we can use it to generate names for our products and check the result of our fine-tuning process on our objective task. 

We load a test dataset, a subset of our original dataset and delete rows containing null values.

In [None]:
# Load the dataset: sentence in english, sentence in spanish 
df=pd.read_csv(testfile_path, header=0)
print('Num Examples: ',len(df))
print('Null Values\n', df.isna().sum())
print(df.head(5))

test_data=Dataset.from_pandas(df)
print(test_data)

Num Examples:  1441
Null Values
 description    0
dtype: int64
                                         description
0  knit midi dress with vneckline straps matching...
1  loosefitting dress with round neckline long sl...
2  nautical with peak.this item must returned wit...
3  nautical with peak . adjustable inner strap de...
4  nautical with side button detail.this item mus...
Dataset(features: {'description': Value(dtype='string', id=None)}, num_rows: 1441)


If you need to restore the trained model from a checkpoint run the next cell, selecting the folder where the checkpoint was saved.

In [None]:
checkpoint_path = os.path.abspath(os.path.join(model_folder,'checkpoint-3072'))
print(checkpoint_path)

Then we load the Tokenizer and the fine-tuned model from a saved version.

In [None]:
#Load the Tokenizer and the fine-tuned model
tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer_folder)
model = EncoderDecoderModel.from_pretrained(model_folder)

model.to("cuda")

The following encoder weights were not tied to the decoder ['roberta/pooler']
The following encoder weights were not tied to the decoder ['roberta/pooler']


EncoderDecoderModel(
  (encoder): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(8192, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

In order to improve the results, we will define two methods to generate the text, using the Beam search decoding strategy and random sampling, and we will apply them and compare the results.


In [None]:
# Generate the text without setting a decoding strategy
def generate_summary(batch):
    # Tokenizer will automatically set [BOS] <text> [EOS]
    # cut off at BERT max length 512
    inputs = tokenizer(batch["description"], padding="max_length", truncation=True, max_length=MAX_LEN, return_tensors="pt")
    input_ids = inputs.input_ids.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")

    #outputs = roberta_shared.generate(input_ids, attention_mask=attention_mask)
    outputs = roberta_shared.generate(input_ids, attention_mask=attention_mask)

    # all special tokens including will be removed
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    batch["pred"] = output_str

    return batch


In [None]:
# Generate a text using beams search
def generate_summary_beam_search(batch):
    # Tokenizer will automatically set [BOS] <text> [EOS]
    # cut off at BERT max length 512
    inputs = tokenizer(batch["description"], padding="max_length", truncation=True, max_length=MAX_LEN, return_tensors="pt")
    input_ids = inputs.input_ids.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")

    outputs = roberta_shared.generate(input_ids, attention_mask=attention_mask,
                                  num_beams=15,
                                  repetition_penalty=3.0, 
                                  length_penalty=2.0, 
                                  num_return_sequences = 1
    )

    # all special tokens including will be removed
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    batch["pred"] = output_str

    return batch

# Generate a text using beams search
def generate_summary_topk(batch):
    # Tokenizer will automatically set [BOS] <text> [EOS]
    # cut off at BERT max length 512
    inputs = tokenizer(batch["description"], padding="max_length", truncation=True, max_length=MAX_LEN, return_tensors="pt")
    input_ids = inputs.input_ids.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")

    outputs = roberta_shared.generate(input_ids, attention_mask=attention_mask,
                                  repetition_penalty=3.0, 
                                  length_penalty=2.0, 
                                  num_return_sequences = 1,
                                  do_sample=True,
                                  top_k=50, 
                                  top_p=0.95,

    )

    # all special tokens including will be removed
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    batch["pred"] = output_str

    return batch


Now, we can make predictions for the test dataset using Beam search strategy and top-k sampling technique.

In [None]:
batch_size = TRAIN_BATCH_SIZE

#results = test_data.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["description"])
# Generate predictions using beam search
results = test_data.map(generate_summary_beam_search, batched=True, batch_size=batch_size, remove_columns=["description"])
pred_str_bs = results["pred"]
# Generate predictions using top-k sampling
results = test_data.map(generate_summary_topk, batched=True, batch_size=batch_size, remove_columns=["description"])
pred_str_topk = results["pred"]

#label_str = results["Summary"]


  0%|          | 0/46 [00:00<?, ?ba/s]

  0%|          | 0/46 [00:00<?, ?ba/s]

Now, we can see some results from our trained model to check its performance on the task.

In [None]:
#Calculate the Rouge score for the predictions using beam search
print("ROUGE 1 SCORE: ",rouge.compute(predictions=pred_str_bs, references=label_str, rouge_types=["rouge1"])["rouge1"].mid)
print("ROUGE 2 SCORE: ",rouge.compute(predictions=pred_str_bs, references=label_str, rouge_types=["rouge2"])["rouge2"].mid)
print("ROUGE F SCORE: ",rouge.compute(predictions=pred_str_bs, references=label_str, rouge_types=["rougeL"])["rougeL"].mid)
#Calculate the Rouge score for the predictions using sampling search
print("ROUGE 1 SCORE: ",rouge.compute(predictions=pred_str_topk, references=label_str, rouge_types=["rouge1"])["rouge1"].mid)
print("ROUGE 2 SCORE: ",rouge.compute(predictions=pred_str_topk, references=label_str, rouge_types=["rouge2"])["rouge2"].mid)
print("ROUGE F SCORE: ",rouge.compute(predictions=pred_str_topk, references=label_str, rouge_types=["rougeL"])["rougeL"].mid)


When more than one output are generated we need to join them on a single list 

In [None]:
import numpy as np

preds=np.reshape(pred_str, (-1, 3))
print('Predictions Shape: ',preds.shape)
predictions = [','.join(p) for p in preds]
print('Num predictions: ', len(predictions),predictions)

Predictions Shape:  (1441, 3)
Num predictions:  1441 ['lace midi dress trf,lacetrimmed camisole dress,lacetrimmed dress tr', 'pleated dress trf,floral print dress,poplin dress', 'corduroy nautical cap,check nautical cap,checked nautical cap', 'corduroy nautical cap,faux shearling nautical cap,check nautical cap', 'check nautical cap,checked nautical cap,corduroy nautical cap', 'cropped tiedye tshirt trf,fadedeffect tshirt print trf,fadedeffect tshirt slogan trf', 'faux suede coat,coat faux suede,doublefaced faux suede coat', 'full sleeve tshirt,fringed tshirt,balloon sleeve tshirt', 'stretch top,stretch top trf,stretch top straps', 'stretch top trf,stretch top straps trf,stretch top', 'stretch top straps trf,stretch top,stretch top trf', 'stretch top straps trf,stretch top trf,stretch top straps', 'stretch top straps trf,stretch top trf,cropped vest top trf', 'stretch top straps trf,stretch top trf,stretch top straps', 'stretch top,stretch top trf,stretch top straps trf', 'mini dress p

In [None]:
print(predictions)

['lace midi dress trf,lacetrimmed camisole dress,lacetrimmed dress tr', 'pleated dress trf,floral print dress,poplin dress', 'corduroy nautical cap,check nautical cap,checked nautical cap', 'corduroy nautical cap,faux shearling nautical cap,check nautical cap', 'check nautical cap,checked nautical cap,corduroy nautical cap', 'cropped tiedye tshirt trf,fadedeffect tshirt print trf,fadedeffect tshirt slogan trf', 'faux suede coat,coat faux suede,doublefaced faux suede coat', 'full sleeve tshirt,fringed tshirt,balloon sleeve tshirt', 'stretch top,stretch top trf,stretch top straps', 'stretch top trf,stretch top straps trf,stretch top', 'stretch top straps trf,stretch top,stretch top trf', 'stretch top straps trf,stretch top trf,stretch top straps', 'stretch top straps trf,stretch top trf,cropped vest top trf', 'stretch top straps trf,stretch top trf,stretch top straps', 'stretch top,stretch top trf,stretch top straps trf', 'mini dress pockets,dress contrast pockets,ruffled dress', 'stripe

Save the predictions to a file:

In [None]:
final_df = pd.DataFrame({'name':pred_str})
final_df.to_csv(outputfile_path, index=False)
print('Output Files generated for review')

Output Files generated for review
