# Using a Huggingface T5 model to generate product names from longer descriptions

In this notebook we use the text summarization capabilities of the T5 model to generate product names from a longer descriptive text. It is not a model for text generation but summarizing a description can be a good approach to that model.

We also show how to register the metrics and performance of the model during the training in the Weight&Biases platform. 

## Loading the libraries

In [None]:
!pip install sentencepiece
!pip install transformers -q
!pip install wandb -q

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |▎                               | 10kB 21.5MB/s eta 0:00:01[K     |▌                               | 20kB 28.6MB/s eta 0:00:01[K     |▉                               | 30kB 32.5MB/s eta 0:00:01[K     |█                               | 40kB 35.6MB/s eta 0:00:01[K     |█▍                              | 51kB 37.1MB/s eta 0:00:01[K     |█▋                              | 61kB 39.0MB/s eta 0:00:01[K     |██                              | 71kB 23.3MB/s eta 0:00:01[K     |██▏                             | 81kB 21.7MB/s eta 0:00:01[K     |██▌                             | 92kB 23.2MB/s eta 0:00:01[K     |██▊                             | 102kB 23.4MB/s eta 0:00:01[K     |███                             | 112kB 23.4MB/s eta 0:00:01[K     |███▎        

In [None]:
#Show packages version


In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
import os

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

import wandb

Selecting the GPU or CPU device 

In [None]:
# # Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

## Loading the dataset

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


The variables containing the data folders and filenames, they should be changed when training in another task.

In [None]:
#Set the path to the data folder, datafile and output folder and files
root_folder = '/content/drive/My Drive/'
data_folder = os.path.abspath(os.path.join(root_folder, 'datasets/text_gen_product_names'))
model_folder = os.path.abspath(os.path.join(root_folder, 'Projects/text_generation_names/T5Model'))
output_folder = os.path.abspath(os.path.join(root_folder, 'Projects/text_generation_names'))
# Set the filenames for your input and output datasets
test_filename='cl_test_descriptions.csv'
datafile= 'product_names_desc_cl_train.csv'
outputfile = 'submission.csv'
# Set the path to the files
datafile_path = os.path.abspath(os.path.join(data_folder,datafile))
testfile_path = os.path.abspath(os.path.join(data_folder,test_filename))
outputfile_path = os.path.abspath(os.path.join(output_folder,outputfile))

In [None]:
# Set random seeds and deterministic pytorch for reproducibility
torch.manual_seed(42) # pytorch random seed
np.random.seed(42) # numpy random seed
torch.backends.cudnn.deterministic = True


Load the datafile with the product descriptions and names:

In [None]:
# Load the dataset: sentence in english, sentence in spanish 
df=pd.read_csv(datafile_path, header=0, usecols=[0,1])
print('Num Examples: ',len(df))
print('Null Values\n', df.isna().sum())

Num Examples:  31593
Null Values
 name           44
description     1
dtype: int64


Remove rows with null values: 

In [None]:
df.dropna(inplace=True)
print('Num Examples: ',len(df))

Num Examples:  31548


Prepare the data and adjust it to be consumed by the T5 model

In [None]:
# Add the tag summarize to the description column
df.description = 'summarize: ' + df.description
print(df.head())

                                name                                        description
0  towel with metallic thread border  summarize: towel with border with lines metall...
1     technical cargo bermuda shorts  summarize: printed bermuda shorts made technic...
2              body shaping bodysuit  summarize: bodysuit with shapewear effect . th...
3                     satin camisole  summarize: vneck with thin adjustable straps.h...
4   check print puritan collar dress  summarize: puritan collar dress featuring long...


## Split the data into train and validation dataset

In [None]:
# Creation of Dataset and Dataloader
# Defining the train size. So 80% of the data will be used for training and the rest will be used for validation. 
train_size = 0.9
train_dataset=df.sample(frac=train_size,random_state = 42)
val_dataset=df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

In [None]:
#Only for testing and example
text = train_dataset.name
ctext = train_dataset.description
print(text[0])
print(ctext[0])

women miranda makaroff print
summarize: crop with straight neckline adjustable thin straps . featuring woman miranda makaroff print .


# Create the Tokenizer for T5 model

In [None]:
base_model="t5-base"

In [None]:
# tokenzier for encoding the text
tokenizer = T5Tokenizer.from_pretrained(base_model)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1389353.0, style=ProgressStyle(descript…




In [None]:
print(tokenizer.bos_token_id, tokenizer.eos_token_id, tokenizer.pad_token_id)
print(tokenizer.bos_token, tokenizer.eos_token, tokenizer.pad_token)

Using bos_token, but it is not set yet.


None 1 0
None </s> <pad>


## Setting the model parameters

In [None]:
TRAIN_BATCH_SIZE = 8    # input batch size for training (default: 64)
VALID_BATCH_SIZE = 2    # input batch size for testing (default: 1000)
TRAIN_EPOCHS = 3        # number of epochs to train (default: 10)
VAL_EPOCHS = 1 
LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
SEED = 42               # random seed (default: 42)
MAX_LEN = 150 #256
SUMMARY_LEN = 7 

# Create the Datasets

In [None]:
# Creating a custom dataset for reading the dataframe and loading it into the dataloader to pass it to the neural network at a later stage for finetuning the model and to prepare it for predictions
class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, source_len, summ_len, generation_only):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = summ_len
        self.generation_only = generation_only
        if not generation_only:
            self.text = self.data.name

        self.ctext = self.data.description

    def __len__(self):
        return len(self.ctext)

    def __getitem__(self, index):
        ctext = str(self.ctext[index])
        ctext = ' '.join(ctext.split())

        #source = self.tokenizer.batch_encode_plus([ctext], max_length= self.source_len, pad_to_max_length=True,return_tensors='pt')
        source = self.tokenizer.batch_encode_plus([ctext], max_length= self.source_len, padding='max_length', 
                                                  truncation=True, return_tensors='pt')
        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()

        output ={
            'source_ids': source_ids.to(dtype=torch.long), 
            'source_mask': source_mask.to(dtype=torch.long)}
        # Create the labels 
        if not self.generation_only: 
            text = str(self.text[index])
            text = ' '.join(text.split())

            #target = self.tokenizer.batch_encode_plus([text], max_length= self.summ_len, pad_to_max_length=True,return_tensors='pt')
            target = self.tokenizer.batch_encode_plus([text], max_length= self.summ_len, padding='max_length', 
                                                      truncation=True, return_tensors='pt')

            target_ids = target['input_ids'].squeeze()
            target_mask = target['attention_mask'].squeeze()

            output['target_ids']=target_ids.to(dtype=torch.long)
            output['target_ids_y']=target_ids.to(dtype=torch.long)

        return output


In [None]:
# Creating the Training and Validation dataset for further creation of Dataloader
training_set = CustomDataset(train_dataset, tokenizer, MAX_LEN, SUMMARY_LEN, False)
val_set = CustomDataset(val_dataset, tokenizer, MAX_LEN, SUMMARY_LEN, False)

    # Defining the parameters for creation of dataloaders
train_params = {
        'batch_size': TRAIN_BATCH_SIZE,
        'shuffle': True,
        'num_workers': 0
        }

val_params = {
        'batch_size': VALID_BATCH_SIZE,
        'shuffle': False,
        'num_workers': 0
        }

    # Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
training_loader = DataLoader(training_set, **train_params)
val_loader = DataLoader(val_set, **val_params)

## Testing the Dataloader

In [None]:
it = iter(training_loader)
batch = next(it)
print(batch["source_ids"].shape, batch["target_ids"].shape)

torch.Size([2, 256]) torch.Size([2, 10])


Let's check the shapes of a single sample:

In [None]:
y = batch['target_ids'].to(device, dtype = torch.long)
y_ids = y[:,:-1].contiguous() # Original
#print(batch['target_ids'][:,target_ids > 0])
print(y.shape)
#print(y_ids[:,y_ids>0])
print(y_ids.shape)
lm_labels = y[:,1:].clone().detach() # Original
lm_labels[y[:,1:] == tokenizer.pad_token_id] = -100 # Original
print(lm_labels.shape)


torch.Size([2, 10])
torch.Size([2, 9])
torch.Size([2, 9])


# Create the T5 model

In [None]:
# Defining the model. We are using t5-base model and added a Language model layer on top for generation of Summary. 
# Further this model is sent to device (GPU/TPU) for using the hardware.
model = T5ForConditionalGeneration.from_pretrained("t5-base")
model = model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691430.0, style=ProgressStyle(descri…




# Create the train function

In [None]:
# Creating the training function. This will be called in the main function. It is run depending on the epoch value.
# The model is put into train mode and then we wnumerate over the training loader and passed to the defined network 
def train(epoch, tokenizer, model, device, loader, optimizer):
    model.train()
    for _,data in enumerate(loader, 0):
        y = data['target_ids'].to(device, dtype = torch.long)
        y_ids = y[:, :-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data['source_ids'].to(device, dtype = torch.long)
        mask = data['source_mask'].to(device, dtype = torch.long)
        outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels)
        loss = outputs[0]
        # Register loss in W&B
        #if _%100 == 0:
        #    wandb.log({"Training Loss": loss.item()})

        if _%500==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()


# Create the validation function

In [None]:
# Validation function. After training we apply the model to the test dataset to produce summaries. 
# We apply a beam search strategy
def validate(epoch, tokenizer, model, device, loader):
    model.eval()
    predictions = []
    actuals = []
    with torch.no_grad():
        for _, data in enumerate(loader, 0):
            y = data['target_ids'].to(device, dtype = torch.long)
            ids = data['source_ids'].to(device, dtype = torch.long)
            mask = data['source_mask'].to(device, dtype = torch.long)

            generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=SUMMARY_LEN, 
                num_beams=2,
                repetition_penalty=2.5, 
                length_penalty=1.0, 
                early_stopping=True
                )
            preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
            if _%100==0:
                print(f'Completed {_}')

            predictions.extend(preds)
            actuals.extend(target)
    return predictions, actuals

# Training the model

In [None]:
# Defining the optimizer that will be used to tune the weights of the network in the training session. 
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

# Log metrics with wandb
#wandb.watch(model, log="all")
# Training loop
print('Initiating Fine-Tuning for the model on our dataset')

for epoch in range(TRAIN_EPOCHS):
        train(epoch, tokenizer, model, device, training_loader, optimizer)


Initiating Fine-Tuning for the model on our dataset
Epoch: 0, Loss:  8.945070266723633
Epoch: 0, Loss:  2.5906875133514404
Epoch: 0, Loss:  1.5396645069122314
Epoch: 0, Loss:  0.7348856925964355
Epoch: 0, Loss:  1.5720382928848267
Epoch: 0, Loss:  2.343564033508301
Epoch: 0, Loss:  1.3179247379302979
Epoch: 0, Loss:  0.8092208504676819
Epoch: 1, Loss:  1.729297161102295
Epoch: 1, Loss:  0.840223491191864
Epoch: 1, Loss:  0.779865562915802
Epoch: 1, Loss:  1.058944821357727
Epoch: 1, Loss:  0.5908939242362976
Epoch: 1, Loss:  0.6256150007247925
Epoch: 1, Loss:  0.778674840927124
Epoch: 1, Loss:  0.8843976855278015
Epoch: 2, Loss:  0.8240967988967896
Epoch: 2, Loss:  0.9908113479614258
Epoch: 2, Loss:  0.6734641194343567
Epoch: 2, Loss:  1.1316100358963013
Epoch: 2, Loss:  0.9343733191490173
Epoch: 2, Loss:  1.347994089126587
Epoch: 2, Loss:  1.0585122108459473
Epoch: 2, Loss:  0.7353636026382446


## Save the trained model

In [None]:
# Save the model
model.save_pretrained(model_folder)

# Validate the model with the validation dataset

In [None]:
# Validation loop and saving the resulting file with predictions and acutals in a dataframe.
# Saving the dataframe as predictions.csv
print('Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe')
for epoch in range(VAL_EPOCHS):
    predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
    final_df = pd.DataFrame({'Generated Text':predictions,'Actual Text':actuals})
    final_df.to_csv(outputfile_path)
    print('Output Files generated for review')


Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe
Completed 0
Completed 100
Completed 200
Completed 300
Completed 400
Completed 500
Completed 600
Completed 700
Completed 800
Completed 900
Completed 1000
Completed 1100
Completed 1200
Completed 1300
Completed 1400
Completed 1500
Output Files generated for review


# Test the model 

Once, we are happy with the results on the validation dataset, we can test the model on the test dataset, making prediction applying a beam search strategy.

## Load the trained model

If it is neccessary we can load the model saved previously.

In [None]:
# Defining the model. We are using t5-base model and added a Language model layer on top for generation of Summary. 
# Further this model is sent to device (GPU/TPU) for using the hardware.
model = T5ForConditionalGeneration.from_pretrained(model_folder)
model = model.to(device)

## Load the test dataset

In [None]:
# Load the dataset: sentence in english, sentence in spanish 
df=pd.read_csv(testfile_path, header=0)
print('Num Examples: ',len(df))
print('Null Values\n', df.isna().sum())
df.head(5)

Num Examples:  1441
Null Values
 description    0
dtype: int64


Unnamed: 0,description
0,knit midi dress with vneckline straps matching...
1,loosefitting dress with round neckline long sl...
2,nautical with peak.this item must returned wit...
3,nautical with peak . adjustable inner strap de...
4,nautical with side button detail.this item mus...


Prepare the data for inference, we just need to append the prefix `summarize:` at the beginning of the text.

In [None]:
# Add the tag summarize to the description column
df.description = 'summarize: ' + df.description
print(df.head())

                                         description
0  summarize: knit midi dress with vneckline stra...
1  summarize: loosefitting dress with round neckl...
2  summarize: nautical with peak.this item must r...
3  summarize: nautical with peak . adjustable inn...
4  summarize: nautical with side button detail.th...


Create a Dataset to iterate on the test samples.

In [None]:
# Creating the Test dataset for further creation of Dataloader
test_set = CustomDataset(df, tokenizer, MAX_LEN, SUMMARY_LEN, True)

In [None]:
test_params = {
        'batch_size': VALID_BATCH_SIZE,
        'shuffle': False,
        'num_workers': 0
        }

# Creation of Dataloader for testing.
test_loader = DataLoader(test_set, **test_params)


In [None]:
it = iter(test_loader)
batch = next(it)
print(batch["source_ids"].shape)

torch.Size([2, 256])


Invoke the `generate` method using beam search and return 3 alternative summaries. Here we show the results from an example taken from the test dataset. 

In [None]:
ids = batch['source_ids'].to(device, dtype = torch.long)
mask = batch['source_mask'].to(device, dtype = torch.long)

generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=SUMMARY_LEN, 
                num_beams=3,
                repetition_penalty=2.5, 
                length_penalty=1.0, 
                num_return_sequences = 3,
                early_stopping=True
                )


In [None]:
print(generated_ids.shape)
num_outputs=3
generated_ids = generated_ids.view(-1, num_outputs,generated_ids.shape[1])
print(generated_ids.shape)
print(generated_ids)
predictions = []
for batch_gid in generated_ids:
    preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in batch_gid]
    preds=','.join(list(filter(None, preds)))
    predictions.append(preds)
print(predictions)

torch.Size([6, 7])
torch.Size([2, 3, 7])
tensor([[[    0,     3, 11706,  3270,     1,     0,     0],
         [    0,  3270,    28,     3, 11706,  2736,     1],
         [    0,     1,     0,     0,     0,     0,     0]],

        [[    0, 32099,     3, 16091,  3270,     1,     0],
         [    0, 32099,     3, 24698,  3270,     1,     0],
         [    0, 32099,  3270,     1,     0,     0,     0]]], device='cuda:0')
['lace dress,dress with lace detail', 'midi dress,oversized dress,dress']


In [None]:
l=['lace dress', 'dress with lace detail', '']
l=list(filter(None, l))
print(l)
line=','.join(l)
print(line)

['lace dress', 'dress with lace detail']
lace dress,dress with lace detail


# Create functions for text generation

For testing purpouses, we create a function using beam search strategy to produce the output text and another function where 

In [None]:
# Beam  search strategy
def generate_beamsearch(tokenizer, model, device, loader, n_beams, rep_penalty, len_penalty, num_outputs):
    model.eval()
    predictions = []

    with torch.no_grad():
        for _, data in enumerate(loader, 0):
            #y = data['target_ids'].to(device, dtype = torch.long)
            ids = data['source_ids'].to(device, dtype = torch.long)
            mask = data['source_mask'].to(device, dtype = torch.long)

            generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=SUMMARY_LEN, 
                num_beams=n_beams,
                repetition_penalty=rep_penalty, 
                length_penalty=len_penalty, 
                num_return_sequences = num_outputs,
                early_stopping=True
                )
            # Reshape hte out to batch_size, num_outputs, len
            generated_ids = generated_ids.view(-1, num_outputs,generated_ids.shape[1])
            #preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            for batch_gid in generated_ids:
                preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in batch_gid]
                preds=','.join(list(filter(None, preds)))
                predictions.append(preds)

            #target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
            if _%100==0:
                print(f'Completed {_}')

            #predictions.append(preds)
            #actuals.extend(target)
    return predictions

# K sampling strategy
def generate_ksampling(tokenizer, model, device, loader, top_k, top_p, rep_penalty, len_penalty, num_outputs):
    model.eval()
    predictions = []

    with torch.no_grad():
        for _, data in enumerate(loader, 0):
            #y = data['target_ids'].to(device, dtype = torch.long)
            ids = data['source_ids'].to(device, dtype = torch.long)
            mask = data['source_mask'].to(device, dtype = torch.long)

            generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=SUMMARY_LEN, 
                do_sample=True,
                repetition_penalty=rep_penalty, 
                length_penalty=len_penalty, 
                num_return_sequences = num_outputs,
                top_k=top_k, 
                top_p=top_p,
                early_stopping=True
                )
            # Reshape hte out to batch_size, num_outputs, len
            generated_ids = generated_ids.view(-1, num_outputs,generated_ids.shape[1])
            for batch_gid in generated_ids:
                preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in batch_gid]
                preds=','.join(list(filter(None, preds)))
                predictions.append(preds)

            #target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
            if _%100==0:
                print(f'Completed {_}')

            #predictions.append(preds)
            #actuals.extend(target)
    return predictions

Finally, we generate the text names for the test dataset and save it in a CSV file.

In [None]:
# Validation loop and saving the resulting file with predictions and acutals in a dataframe.
# Saving the dataframe as predictions.csv
print('Now generating summaries on our fine tuned model for the test dataset and saving it in a dataframe')
#generate_beamsearch(tokenizer, model, device, loader, n_beams, rep_penalty, len_penalty)
#predictions = generate_beamsearch(tokenizer, model, device, test_loader, 25, 2.5, 1.0, 1)
predictions = generate_ksampling(tokenizer, model, device, test_loader, 50, 0.95, 1.5, 1.0, 3)
final_df = pd.DataFrame({'name':predictions})
final_df.to_csv(outputfile_path, index=False)
print('Output Files generated for review')

Now generating summaries on our fine tuned model for the test dataset and saving it in a dataframe
Completed 0
Completed 100
Completed 200
Completed 300
Completed 400
Completed 500
Completed 600
Completed 700
Output Files generated for review


In [None]:
len(predictions)

1441