# **Fine-tune Ruperta, a spanish RoBERTa model, for Sentiment Analysis**

**Describe Ruperta**

## Our objetive in this notebook

**Describe the model objetive**

## Loading the libraries

In [14]:
#Installation
!pip install datasets==1.0.2
!pip install transformers==4.11.3
# In previous version of the library the seq2seqtrainer was not included
#!pip install rouge_score

import datasets
import pandas as pd
from datasets import Dataset

#Tokenizer and models
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

import os, shutil
#Import a package to manage the warnings and ignore them
import warnings
warnings.simplefilter("ignore", DeprecationWarning)



In [2]:
transformers.__version__

'4.11.3'

# Loading the dataset

**Describe the dataset**

In [2]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


Define parameters for data location and model folders

In [21]:
#Set the path to the data folder, datafile and output folder and files
root_folder = '/content/drive/My Drive/'
data_folder = os.path.abspath(os.path.join(root_folder, 'datasets/IMDb_reviews_spanish'))
model_folder = os.path.abspath(os.path.join(root_folder, 'Projects/transformer_sentiment_analysis_spanish/RuPERTa_base_sentiment_analysis_es'))
output_folder = os.path.abspath(os.path.join(root_folder, 'Projects/transformer_sentiment_analysis_spanish'))
#pretrainedmodel_folder = os.path.abspath(os.path.join(root_folder, 'Projects/text_generation_names/RoBERTaMLM'))
#tokenizer_folder = os.path.abspath(os.path.join(root_folder, 'Projects/text_generation_names/TokRoBERTa'))

# Datafiles names containing training and test data
datafile= 'IMDB Dataset SPANISH.csv'
outputfile = 'submission.csv'
datafile_path = os.path.abspath(os.path.join(data_folder,datafile))
outputfile_path = os.path.abspath(os.path.join(output_folder,outputfile))

mymodel_name='RuPERTa_base_sentiment_analysis_es'

Load the datafile with the movie reviews:

In [4]:
# Load the dataset from a CSV file
df=pd.read_csv(datafile_path, header=0, usecols=[2,4],nrows=10000)
print('Num Examples: ',len(df))
print('Null Values\n', df.isna().sum())
print(df.head(5))

Num Examples:  10000
Null Values
 review_es      0
sentimiento    0
dtype: int64
                                           review_es sentimiento
0  Uno de los otros críticos ha mencionado que de...    positivo
1  Una pequeña pequeña producción.La técnica de f...    positivo
2  Pensé que esta era una manera maravillosa de p...    positivo
3  Básicamente, hay una familia donde un niño peq...    negativo
4  El "amor en el tiempo" de Petter Mattei es una...    positivo


In [None]:
print(df.head(5))

                                           review_es sentimiento
0  Uno de los otros críticos ha mencionado que de...    positivo
1  Una pequeña pequeña producción.La técnica de f...    positivo
2  Pensé que esta era una manera maravillosa de p...    positivo
3  Básicamente, hay una familia donde un niño peq...    negativo
4  El "amor en el tiempo" de Petter Mattei es una...    positivo


Transform the target column to integer values, 1 for positive and 0 for negative:

In [5]:
#Create a dictonary to map the sentiment value to a integer value
sentiments = {"positivo":1, "negativo":0}
df['sentimiento'] = df['sentimiento'].map(sentiments)
#print('Num Examples: ',len(df))
print(df.head(5))

                                           review_es  sentimiento
0  Uno de los otros críticos ha mencionado que de...            1
1  Una pequeña pequeña producción.La técnica de f...            1
2  Pensé que esta era una manera maravillosa de p...            1
3  Básicamente, hay una familia donde un niño peq...            0
4  El "amor en el tiempo" de Petter Mattei es una...            1


Check the target value distribution to identify unbalanced values:

In [6]:
#df['sentimiento'].plot(kind='bar', title='Target distribution')
print(df['sentimiento'].value_counts())
#Check the maximum length of our text feature
print('Max length: ',max(df.review_es.apply(len)))

1    5028
0    4972
Name: sentimiento, dtype: int64
Max length:  9985


## Split the data into train and validation dataset

We split the dataset into a training dataset (90%) and a validation dataset (10%). To choose the examples, there is a sampling method to randomly extract the training dataset.

In [7]:
from sklearn.model_selection import train_test_split

# Define our text and label arrays
train_texts = df['review_es'].tolist()
train_labels = df['sentimiento'].tolist()

# Splitting the data into training, validation and test data
# Split the train data, containing 80%
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.1)
# split the validation and test data
val_texts, test_texts, val_labels, test_labels = train_test_split(val_texts, val_labels, test_size=.5)

print('Length Train dataset: ', len(train_texts))
print('Length Val dataset: ', len(val_texts))
print('Length test dataset: ', len(test_texts))


Length Train dataset:  9000
Length Val dataset:  500
Length test dataset:  500


In [None]:
type(train_texts)

list

In the next section, we try to limit the number of examples to train on in order to reduce the cost and time for training during the experiments. When the model is ready to be trained, we must train on the whole training dataset.

In [None]:
'''
# RUN THIS CELL only IF YOU WANT TO LIMIT THE EXAMPLES TO REDUCE TRAINING TIME
# To limit the training and validation dataset, for testing
max_train=5000
max_val=500
train_texts = train_texts[:max_train]
train_labels = train_labels[:max_train]
val_texts = val_texts[:max_val]
val_labels = val_labels[:max_val]
print('Length Train dataset: ', len(train_texts))
print('Length Val dataset: ', len(val_texts))
print('Length test dataset: ', len(test_texts))
print(type(train_texts))
'''

"\n# RUN THIS CELL only IF YOU WANT TO LIMIT THE EXAMPLES TO REDUCE TRAINING TIME\n# To limit the training and validation dataset, for testing\nmax_train=5000\nmax_val=500\ntrain_texts = train_texts[:max_train]\ntrain_labels = train_labels[:max_train]\nval_texts = val_texts[:max_val]\nval_labels = val_labels[:max_val]\nprint('Length Train dataset: ', len(train_texts))\nprint('Length Val dataset: ', len(val_texts))\nprint('Length test dataset: ', len(test_texts))\nprint(type(train_texts))\n"

# Prepare the dataset

**Describe the tokenizer and classes for text classification**



## Setting the model and training parameters

Now it is time to set the model and training parameters, they will be passed to the dataset generator and to the Trainer object in a latter section.

In [8]:
TRAIN_BATCH_SIZE = 16   # input batch size for training (default: 64)
VALID_BATCH_SIZE = 4    # input batch size for testing (default: 1000)
TRAIN_EPOCHS = 2       # number of epochs to train (default: 10)
#LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
SEED = 42               # random seed (default: 42)
MAX_LEN = 256           # Max length for product description


## Load the trained tokenizer on our specific language


In [9]:
# Loading the RoBERTa Tokenizer
tokenizer = AutoTokenizer.from_pretrained("mrm8488/RuPERTa-base", do_lower_case=True)
# Setting the BOS and EOS token
print(tokenizer.bos_token)
print(tokenizer.eos_token)

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/524 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/958k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/511k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

<s>
</s>


## Prepare and create the dataset

Now we can simply pass our texts to the tokenizer. We’ll pass truncation=True and padding=True, which will ensure that all of our sequences are padded to the same length and are truncated to be no longer than the max length. This will allow us to feed batches of sequences into the model at the same time.

In [10]:
# Tokenize the test data
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=MAX_LEN) # max_length
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=MAX_LEN) # max_length
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=MAX_LEN) # max_length


Now, let’s turn our labels and encodings into a Dataset object. In PyTorch, this is done by subclassing a torch.utils.data.Dataset object and implementing __len__ and __getitem__. 

In [11]:
import torch

class IMDbReviews(torch.utils.data.Dataset):
  ''' Dataset for our IMDb revies in spanish 
  '''

  def __init__(self, encodings, labels):
      self.encodings = encodings
      self.labels = labels

  def __getitem__(self, idx):
      item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
      item['labels'] = torch.tensor(self.labels[idx])
      return item

  def __len__(self):
      return len(self.labels)

# Create the train dataset
train_dataset = IMDbReviews(train_encodings, train_labels)
# Create the validation dataset
val_dataset = IMDbReviews(val_encodings, val_labels)
# Create the test dataset
test_dataset = IMDbReviews(test_encodings, test_labels)

## Load and dedfine our Ruperta based model

We are building our model based on the pretrained model we build in Part 1 of this series, thanks to Hugginface's libraries and wrappers it is very simple to create the model. We just need to invoke the method `from_encoder_decoder_pretrained` from the `EncoderDecoderModel` Class of the Hugginface transformers library. We set the encoder and decoder pretrained model passing the folder where we saved our RoBERTa model trained and indicate that we want to "tie" the weights of both models, so they share their weights.


In [12]:
# Load the Ruperta model
model = AutoModelForSequenceClassification.from_pretrained("mrm8488/RuPERTa-base")

Downloading:   0%|          | 0.00/483M [00:00<?, ?B/s]

Some weights of the model checkpoint at mrm8488/RuPERTa-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.decoder.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at mrm8488/RuPERTa-base and are newly initialized

# Training the model

## Define the metric for model evaluation

We will usa accuracy as our metric.

The datasets library provides us with that metric, so we load it and define a `compute_metrics` method where we calculate the metric for the target text, the product description, and the predicted text.


In [15]:
import numpy as np

metric = datasets.load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


Downloading:   0%|          | 0.00/2.64k [00:00<?, ?B/s]

## Create the Trainer

Now it is time to set the training arguments: batch_size, training epochs, save the model, etc. And then we can instantiate a `Trainer`object, selecting the model to train, the training arguments, the metrics computation, the train, and the evaluation datasets.


In [16]:
training_args = TrainingArguments(
    output_dir=model_folder,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=VALID_BATCH_SIZE,
    #predict_with_generate=True,
    #evaluate_during_training=True,
    evaluation_strategy="epoch",
    do_train=True,
    do_eval=True,
    logging_steps=512,  
    #save_steps=512, 
    save_strategy="epoch", 
    warmup_steps=500,  
    weight_decay=0.01,
    num_train_epochs = TRAIN_EPOCHS, #TRAIN_EPOCHS
    load_best_model_at_end=True,
    overwrite_output_dir=True,
    save_total_limit=1,
    fp16=True, 
)

# instantiate trainer
trainer = Trainer(
    tokenizer=tokenizer,
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

Using amp fp16 backend


Now, we start training the model:

In [29]:
# Fine-tune the model, training and evaluating on the train dataset
trainer.train()

***** Running training *****
  Num examples = 9000
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1126


Epoch,Training Loss,Validation Loss,Accuracy
1,0.178,0.420261,0.822
2,0.1798,0.420261,0.822


***** Running Evaluation *****
  Num examples = 500
  Batch size = 4


Saving model checkpoint to /content/drive/My Drive/Projects/transformer_sentiment_analysis_spanish/RuPERTa_base_sentiment_analysis_es/checkpoint-563
Configuration saved in /content/drive/My Drive/Projects/transformer_sentiment_analysis_spanish/RuPERTa_base_sentiment_analysis_es/checkpoint-563/config.json
Model weights saved in /content/drive/My Drive/Projects/transformer_sentiment_analysis_spanish/RuPERTa_base_sentiment_analysis_es/checkpoint-563/pytorch_model.bin
tokenizer config file saved in /content/drive/My Drive/Projects/transformer_sentiment_analysis_spanish/RuPERTa_base_sentiment_analysis_es/checkpoint-563/tokenizer_config.json
Special tokens file saved in /content/drive/My Drive/Projects/transformer_sentiment_analysis_spanish/RuPERTa_base_sentiment_analysis_es/checkpoint-563/special_tokens_map.json
Deleting older checkpoint [/content/drive/My Drive/Projects/transformer_sentiment_analysis_spanish/RuPERTa_base_sentiment_analysis_es/checkpoint-1126] due to args.save_total_limit
*

TrainOutput(global_step=1126, training_loss=0.1783766873564644, metrics={'train_runtime': 3169.5711, 'train_samples_per_second': 5.679, 'train_steps_per_second': 0.355, 'total_flos': 2367999498240000.0, 'train_loss': 0.1783766873564644, 'epoch': 2.0})

Save the encoder-decoder model just trained:

In [30]:
# Save the encoder-decoder model just trained
trainer.save_model(model_folder)

Saving model checkpoint to /content/drive/My Drive/Projects/transformer_sentiment_analysis_spanish/RuPERTa_base_sentiment_analysis_es
Configuration saved in /content/drive/My Drive/Projects/transformer_sentiment_analysis_spanish/RuPERTa_base_sentiment_analysis_es/config.json
Model weights saved in /content/drive/My Drive/Projects/transformer_sentiment_analysis_spanish/RuPERTa_base_sentiment_analysis_es/pytorch_model.bin
tokenizer config file saved in /content/drive/My Drive/Projects/transformer_sentiment_analysis_spanish/RuPERTa_base_sentiment_analysis_es/tokenizer_config.json
Special tokens file saved in /content/drive/My Drive/Projects/transformer_sentiment_analysis_spanish/RuPERTa_base_sentiment_analysis_es/special_tokens_map.json


# Evaluate the model on the test dataset

Once we have our model trained, we can use it to generate names for our products and check the result of our fine-tuning process on our objective task. 

We load a test dataset, a subset of our original dataset and delete rows containing null values.

In [31]:
# eVALUATE ON the test set
trainer.evaluate(test_dataset)

***** Running Evaluation *****
  Num examples = 500
  Batch size = 4


{'epoch': 2.0,
 'eval_accuracy': 0.874,
 'eval_loss': 0.3199535310268402,
 'eval_runtime': 15.8467,
 'eval_samples_per_second': 31.552,
 'eval_steps_per_second': 7.888}

If you need to **restore the trained model from a checkpoint** run the next cell, selecting the folder where the checkpoint was saved.

In [None]:
checkpoint_path = os.path.abspath(os.path.join(model_folder,'checkpoint-3072'))
print(checkpoint_path)

Then we load the Tokenizer and the fine-tuned model from a saved version.

In [None]:
#Load the Tokenizer and the fine-tuned model
tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer_folder)
model = EncoderDecoderModel.from_pretrained(model_folder)

model.to("cuda")

The following encoder weights were not tied to the decoder ['roberta/pooler']
The following encoder weights were not tied to the decoder ['roberta/pooler']


EncoderDecoderModel(
  (encoder): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(8192, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

## Check for checkpoint directory

In [9]:
from distutils.dir_util import copy_tree

def copy_model_checkpoint(model_folder, ckpt_folder, dest_folder):
  ''' Copy the checkpoint folder of a HFmodel from the model folder to dest folder
      Input:
      - model_folder: folder containing the checkpoint saved
      - ckpt_dir: name or substring of the nameof the checkpoint folder to copy
      - dest_folder: folder where the checkpoint folder will be saved
  '''
  # Check if the model folder is a dir folder
  if os.path.isdir(model_folder):
    # Extract all checkpoint folders in the model folder
    ckpt_dir =[d for d in os.listdir(model_folder) if ckpt_folder in d]
    # Check if there is any checkpoint folder
    if checkd:
      # Sort checkpoint folder descending
      checkd.sort()
      print("Checkpoint folder to copy: ",checkd[0])
      # Copy the checkpoint folder to the destination folder with the same name
      copy_tree(os.path.join(model_folder,checkd[0]), os.path.join(dest_folder,checkd[0]))
      print(" Checkpoint folder copied to ",dest_folder)


In [12]:
ckpt_folder = 'checkpoint-'
test_folder = os.path.abspath(os.path.join(root_folder, 'Projects/test'))
copy_model_checkpoint(model_folder, ckpt_folder, test_folder)

Checkpoint folder to copy:  checkpoint-10000
 Checkpoint folder copied to  /content/drive/My Drive/Projects/test


# Push the model to Hugginface Hub

The first step is to make sure your credentials to the hub are stored somewhere. This can be done in two ways. If you have access to a terminal, you cam just run the following command in the virtual environment where you installed 🤗 Transformers:

In [20]:
!transformers-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        
Username: edumunozsala
Password: 
Login successful
Your token: fyBjORyHzEbmVyQEcZbnTFQprgqtKaYaqIlRdflgjbFMyNgcjLRUfwVbstmoheLJOErmSOdGDlmOWUjNadXfHmizpdKpYVASPSWvJKthJdoCjUxOQqiwHMMcCvzgJscj 

Your token has been saved to /root/.huggingface/token


First you need to install git-lfs in the environment used by the notebook:

In [24]:
!sudo apt-get install git-lfs
!git lfs install

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 37 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 2s (899 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package git-lfs.
(Reading database ... 155219 files and directories cur

Now we can upload the model to the huggingface hubin your account as "<hfusername>/<model_name".

In [32]:
model.push_to_hub(mymodel_name, commit_message = "Temporary Commit",use_temp_dir=True)

Cloning https://huggingface.co/edumunozsala/RuPERTa_base_sentiment_analysis_es into local empty directory.


Download file pytorch_model.bin:   0%|          | 446/481M [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/481M [00:00<?, ?B/s]

Configuration saved in /tmp/tmp8m5l1ejx/config.json
Model weights saved in /tmp/tmp8m5l1ejx/pytorch_model.bin


Upload file pytorch_model.bin:   0%|          | 3.36k/481M [00:00<?, ?B/s]

To https://huggingface.co/edumunozsala/RuPERTa_base_sentiment_analysis_es
   7201094..d21d36e  main -> main



'https://huggingface.co/edumunozsala/RuPERTa_base_sentiment_analysis_es/commit/d21d36e042b7fdecc90e2d55ff1ee192c2528471'

## REferences

https://www.analyticsvidhya.com/blog/2021/06/why-and-how-to-use-bert-for-nlp-text-classification/

https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb

Huggingface documentation
https://huggingface.co/transformers/custom_datasets.html?highlight=trainer



In [None]:
# Generate the text without setting a decoding strategy
def generate_summary(batch):
    # Tokenizer will automatically set [BOS] <text> [EOS]
    # cut off at BERT max length 512
    inputs = tokenizer(batch["description"], padding="max_length", truncation=True, max_length=MAX_LEN, return_tensors="pt")
    input_ids = inputs.input_ids.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")

    #outputs = roberta_shared.generate(input_ids, attention_mask=attention_mask)
    outputs = roberta_shared.generate(input_ids, attention_mask=attention_mask)

    # all special tokens including will be removed
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    batch["pred"] = output_str

    return batch


In [None]:
# Generate a text using beams search
def generate_summary_beam_search(batch):
    # Tokenizer will automatically set [BOS] <text> [EOS]
    # cut off at BERT max length 512
    inputs = tokenizer(batch["description"], padding="max_length", truncation=True, max_length=MAX_LEN, return_tensors="pt")
    input_ids = inputs.input_ids.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")

    outputs = roberta_shared.generate(input_ids, attention_mask=attention_mask,
                                  num_beams=15,
                                  repetition_penalty=3.0, 
                                  length_penalty=2.0, 
                                  num_return_sequences = 1
    )

    # all special tokens including will be removed
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    batch["pred"] = output_str

    return batch

# Generate a text using beams search
def generate_summary_topk(batch):
    # Tokenizer will automatically set [BOS] <text> [EOS]
    # cut off at BERT max length 512
    inputs = tokenizer(batch["description"], padding="max_length", truncation=True, max_length=MAX_LEN, return_tensors="pt")
    input_ids = inputs.input_ids.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")

    outputs = roberta_shared.generate(input_ids, attention_mask=attention_mask,
                                  repetition_penalty=3.0, 
                                  length_penalty=2.0, 
                                  num_return_sequences = 1,
                                  do_sample=True,
                                  top_k=50, 
                                  top_p=0.95,

    )

    # all special tokens including will be removed
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    batch["pred"] = output_str

    return batch


Now, we can make predictions for the test dataset using Beam search strategy and top-k sampling technique.

In [None]:
batch_size = TRAIN_BATCH_SIZE

#results = test_data.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["description"])
# Generate predictions using beam search
results = test_data.map(generate_summary_beam_search, batched=True, batch_size=batch_size, remove_columns=["description"])
pred_str_bs = results["pred"]
# Generate predictions using top-k sampling
results = test_data.map(generate_summary_topk, batched=True, batch_size=batch_size, remove_columns=["description"])
pred_str_topk = results["pred"]

#label_str = results["Summary"]


  0%|          | 0/91 [00:00<?, ?ba/s]

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


  0%|          | 0/91 [00:00<?, ?ba/s]

Now, we can see some results from our trained model to check its performance on the task.

In [None]:
#Show an example
print("Product Description: ",df['description'][1])
print(" Name using BS: ", pred_str_bs[1])
print(" Name using Top-K Sampling: ", pred_str_topk[1])


Product Description:  loosefitting dress with round neckline long sleeves pleat details buttoned opening back.height model 69.6″
 Name using BS:  flowing dress with pleats
 Name using Top-K Sampling:  printed dress with pleats


In [None]:
#Show an example
print("Product Description: ",df['description'][50])
print(" Name using BS: ", pred_str_bs[50])
print(" Name using Top-K Sampling: ", pred_str_topk[50])


Product Description:  long sleeve knit sweater with round neckline ribbed trims.height model . 74.4″
 Name using BS:  textured sweater with stripes
 Name using Top-K Sampling:  textured sweater with stripes


When more than one output are generated we need to join them on a single list 

In [None]:
import numpy as np

preds=np.reshape(pred_str, (-1, 3))
print('Predictions Shape: ',preds.shape)
predictions = [','.join(p) for p in preds]
print('Num predictions: ', len(predictions),predictions)

Predictions Shape:  (1441, 3)
Num predictions:  1441 ['lace midi dress trf,lacetrimmed camisole dress,lacetrimmed dress tr', 'pleated dress trf,floral print dress,poplin dress', 'corduroy nautical cap,check nautical cap,checked nautical cap', 'corduroy nautical cap,faux shearling nautical cap,check nautical cap', 'check nautical cap,checked nautical cap,corduroy nautical cap', 'cropped tiedye tshirt trf,fadedeffect tshirt print trf,fadedeffect tshirt slogan trf', 'faux suede coat,coat faux suede,doublefaced faux suede coat', 'full sleeve tshirt,fringed tshirt,balloon sleeve tshirt', 'stretch top,stretch top trf,stretch top straps', 'stretch top trf,stretch top straps trf,stretch top', 'stretch top straps trf,stretch top,stretch top trf', 'stretch top straps trf,stretch top trf,stretch top straps', 'stretch top straps trf,stretch top trf,cropped vest top trf', 'stretch top straps trf,stretch top trf,stretch top straps', 'stretch top,stretch top trf,stretch top straps trf', 'mini dress p

In [None]:
print(predictions)

['lace midi dress trf,lacetrimmed camisole dress,lacetrimmed dress tr', 'pleated dress trf,floral print dress,poplin dress', 'corduroy nautical cap,check nautical cap,checked nautical cap', 'corduroy nautical cap,faux shearling nautical cap,check nautical cap', 'check nautical cap,checked nautical cap,corduroy nautical cap', 'cropped tiedye tshirt trf,fadedeffect tshirt print trf,fadedeffect tshirt slogan trf', 'faux suede coat,coat faux suede,doublefaced faux suede coat', 'full sleeve tshirt,fringed tshirt,balloon sleeve tshirt', 'stretch top,stretch top trf,stretch top straps', 'stretch top trf,stretch top straps trf,stretch top', 'stretch top straps trf,stretch top,stretch top trf', 'stretch top straps trf,stretch top trf,stretch top straps', 'stretch top straps trf,stretch top trf,cropped vest top trf', 'stretch top straps trf,stretch top trf,stretch top straps', 'stretch top,stretch top trf,stretch top straps trf', 'mini dress pockets,dress contrast pockets,ruffled dress', 'stripe

Save the predictions to a file:

In [None]:
final_df = pd.DataFrame({'name':pred_str})
final_df.to_csv(outputfile_path, index=False)
print('Output Files generated for review')

Output Files generated for review


In [None]:
# Code to compare results, not used with test dataset
'''
#Calculate the Rouge score for the predictions using beam search
print("ROUGE 1 SCORE: ",rouge.compute(predictions=pred_str_bs, references=label_str, rouge_types=["rouge1"])["rouge1"].mid)
print("ROUGE 2 SCORE: ",rouge.compute(predictions=pred_str_bs, references=label_str, rouge_types=["rouge2"])["rouge2"].mid)
print("ROUGE F SCORE: ",rouge.compute(predictions=pred_str_bs, references=label_str, rouge_types=["rougeL"])["rougeL"].mid)
#Calculate the Rouge score for the predictions using sampling search
print("ROUGE 1 SCORE: ",rouge.compute(predictions=pred_str_topk, references=label_str, rouge_types=["rouge1"])["rouge1"].mid)
print("ROUGE 2 SCORE: ",rouge.compute(predictions=pred_str_topk, references=label_str, rouge_types=["rouge2"])["rouge2"].mid)
print("ROUGE F SCORE: ",rouge.compute(predictions=pred_str_topk, references=label_str, rouge_types=["rougeL"])["rougeL"].mid)
'''


'\n#Calculate the Rouge score for the predictions using beam search\nprint("ROUGE 1 SCORE: ",rouge.compute(predictions=pred_str_bs, references=label_str, rouge_types=["rouge1"])["rouge1"].mid)\nprint("ROUGE 2 SCORE: ",rouge.compute(predictions=pred_str_bs, references=label_str, rouge_types=["rouge2"])["rouge2"].mid)\nprint("ROUGE F SCORE: ",rouge.compute(predictions=pred_str_bs, references=label_str, rouge_types=["rougeL"])["rougeL"].mid)\n#Calculate the Rouge score for the predictions using sampling search\nprint("ROUGE 1 SCORE: ",rouge.compute(predictions=pred_str_topk, references=label_str, rouge_types=["rouge1"])["rouge1"].mid)\nprint("ROUGE 2 SCORE: ",rouge.compute(predictions=pred_str_topk, references=label_str, rouge_types=["rouge2"])["rouge2"].mid)\nprint("ROUGE F SCORE: ",rouge.compute(predictions=pred_str_topk, references=label_str, rouge_types=["rougeL"])["rougeL"].mid)\n'