# Installing Dependencies

In order to get started, we will install the libraries in `requirements.txt` that we will use to load any pretrained huggingface model.

In [1]:
#!pip install -r requirements.txt

# Preparing data in csv files

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("banking77")

train_dataset = raw_datasets["train"]
eval_dataset = raw_datasets["test"]

Using custom data configuration default
Reusing dataset banking77 (/home/azureuser/.cache/huggingface/datasets/banking77/default/1.1.0/aec0289529599d4572d76ab00c8944cb84f88410ad0c9e7da26189d31f62a55b)


In [3]:
import pandas as pd

train_df = pd.DataFrame()
eval_df = pd.DataFrame()


for i in train_dataset:
    train_df = train_df.append(i, ignore_index=True)

for i in eval_dataset:
    eval_df = eval_df.append(i, ignore_index=True)

In [4]:
train_df.head()

Unnamed: 0,label,text
0,11.0,I am still waiting on my card?
1,11.0,What can I do if my card still hasn't arrived ...
2,11.0,I have been waiting over a week. Is the card s...
3,11.0,Can I track my card while it is in the process...
4,11.0,"How do I know if I will get my card, or if it ..."


In [5]:
eval_df.head()

Unnamed: 0,label,text
0,11.0,How do I locate my card?
1,11.0,"I still have not received my new card, I order..."
2,11.0,I ordered a card but it has not arrived. Help ...
3,11.0,Is there a way to know when my card will arrive?
4,11.0,My card has not arrived yet.


In [6]:
train_df.to_csv("../data/train.csv", index=False)
eval_df.to_csv("../data/eval.csv", index=False)

# Experiment Parameters

In [7]:
# Processing Parameters
preprocessing_num_workers = None #The number of processes to use for the preprocessing.
overwrite_cache = True # Overwrite the cached training and evaluation sets.

# Training Parameters
max_train_samples = None #For debugging purposes or quicker training, truncate the number of training examples to this value if set.
max_eval_samples = None #For debugging purposes or quicker training, truncate the number of evaluation examples to this value if set.
model_name = "gpt2"
output_dir = "outputcsvFiles" 

# Load dataset

We will use a small dataset for testing purposes. 

Dataset `banking77` composed of online banking queries annotated with their corresponding intents.

`banking77` dataset provides a very fine-grained set of intents in a banking domain. It comprises 13,083 customer service queries labeled with 77 intents. 

For our purpose, we will ignore the intent label and focus on generating texts from the banking domain.

**In this notebook, the dataset is already saved in csv files. We'll load the dataset from there!**

In [8]:
from datasets import load_dataset

raw_datasets = load_dataset('csv', data_files={'train': '../data/train.csv', 'test': '../data/eval.csv'})

Using custom data configuration default-86eb0fae5e1c7c0e


Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/azureuser/.cache/huggingface/datasets/csv/default-86eb0fae5e1c7c0e/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff...


0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /home/azureuser/.cache/huggingface/datasets/csv/default-86eb0fae5e1c7c0e/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff. Subsequent calls will reuse this data.


In [9]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 10003
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 3080
    })
})

In [10]:
import random

index = random.sample(range(len(raw_datasets["train"])), 1)
print(raw_datasets["train"][index])

index = random.sample(range(len(raw_datasets["test"])), 1)
print(raw_datasets["test"][index])

OrderedDict([('label', [24.0]), ('text', ['Which countries do you operate in'])])
OrderedDict([('label', [56.0]), ('text', ['I would like to refill my account using SWIFT.'])])


# Preprocess & Tokenize Datasets

In [11]:
from transformers import AutoConfig, AutoTokenizer

config = AutoConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text_column_name = "text"
column_names = raw_datasets["train"].column_names

## Preprocess Dataset & add eos_token 

In [12]:
# Main data processing function that will add eos_token to each text in the dataset
def add_eos_token(examples):
    examples_with_eos = examples
    examples_with_eos[text_column_name] = [x + tokenizer.eos_token for x in examples[text_column_name]]  
    return examples_with_eos

raw_datasets = raw_datasets.map(
    add_eos_token,
    batched=True,
    num_proc=preprocessing_num_workers,
    load_from_cache_file=not overwrite_cache,
    desc=f"Adding eos_token to each example in the dataset",
)

Adding eos_token to each example in the dataset:   0%|          | 0/11 [00:00<?, ?ba/s]

Adding eos_token to each example in the dataset:   0%|          | 0/4 [00:00<?, ?ba/s]

In [13]:
index = random.sample(range(len(raw_datasets["train"])), 1)

print(raw_datasets["train"][index])

OrderedDict([('label', [49.0]), ('text', ['After inputting the wrong pin too many times, can you now help me unblock my pin?<|endoftext|>'])])


## Tokenize dataset using gpt2 tokenizer

In [14]:
def tokenize_function(examples):
    return tokenizer(examples[text_column_name])

tokenized_datasets = raw_datasets.map(
    tokenize_function,
    batched=True,
    num_proc=preprocessing_num_workers,
    remove_columns=column_names,
    load_from_cache_file=not overwrite_cache,
    desc="Running tokenizer on dataset",
)

Running tokenizer on dataset:   0%|          | 0/11 [00:00<?, ?ba/s]

Running tokenizer on dataset:   0%|          | 0/4 [00:00<?, ?ba/s]

In [15]:
index = random.sample(range(len(raw_datasets["train"])), 1)

print(raw_datasets["train"][index])
print(tokenized_datasets["train"][index])

OrderedDict([('label', [0.0]), ('text', ['Can I activate my card?<|endoftext|>'])])
OrderedDict([('attention_mask', [[1, 1, 1, 1, 1, 1, 1]]), ('input_ids', [[6090, 314, 15155, 616, 2657, 30, 50256]])])


# Concatenate all texts from our dataset and generate chunks of block_size

In [16]:
block_size = tokenizer.model_max_length
if block_size > 1024:
    # The tokenizer picked seems to have a very large `model_max_length`
    block_size = 1024

# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=len(tokenized_datasets["train"]), # if training size is very small, like in our case.
    num_proc=preprocessing_num_workers,
    load_from_cache_file=not overwrite_cache,
    desc=f"Grouping texts in chunks of {block_size}",
)

Grouping texts in chunks of 1024:   0%|          | 0/1 [00:00<?, ?ba/s]

Grouping texts in chunks of 1024:   0%|          | 0/1 [00:00<?, ?ba/s]

In [17]:
print(raw_datasets["train"][0])
print(raw_datasets["train"][1])
print(raw_datasets["train"][2])
print(raw_datasets["train"][3])

{'label': 11.0, 'text': 'I am still waiting on my card?<|endoftext|>'}
{'label': 11.0, 'text': "What can I do if my card still hasn't arrived after 2 weeks?<|endoftext|>"}
{'label': 11.0, 'text': 'I have been waiting over a week. Is the card still coming?<|endoftext|>'}
{'label': 11.0, 'text': 'Can I track my card while it is in the process of delivery?<|endoftext|>'}


In [18]:
print(tokenized_datasets["train"][0])
print(tokenized_datasets["train"][1])
print(tokenized_datasets["train"][2])
print(tokenized_datasets["train"][3])

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [40, 716, 991, 4953, 319, 616, 2657, 30, 50256]}
{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [2061, 460, 314, 466, 611, 616, 2657, 991, 5818, 470, 5284, 706, 362, 2745, 30, 50256]}
{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [40, 423, 587, 4953, 625, 257, 1285, 13, 1148, 262, 2657, 991, 2406, 30, 50256]}
{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [6090, 314, 2610, 616, 2657, 981, 340, 318, 287, 262, 1429, 286, 7585, 30, 50256]}


In [19]:
print(lm_datasets["train"][0]['input_ids'][:40])

[40, 716, 991, 4953, 319, 616, 2657, 30, 50256, 2061, 460, 314, 466, 611, 616, 2657, 991, 5818, 470, 5284, 706, 362, 2745, 30, 50256, 40, 423, 587, 4953, 625, 257, 1285, 13, 1148, 262, 2657, 991, 2406, 30, 50256]


# Prepare Training & Evaluation Datasets

<span style="color:red">Recheck script train/eval datasets! It seems training data is split even if test set is provided!</span>

In [20]:
train_dataset = lm_datasets["train"]
eval_dataset = lm_datasets["test"]

In [21]:
if max_train_samples is not None:
    train_dataset = train_dataset.select(range(max_train_samples))
if max_eval_samples is not None:
    eval_dataset = eval_dataset.select(range(max_eval_samples))

# Set Logging Level

In [22]:
import random
from importlib import reload  # Not needed in Python 2
import logging
reload(logging)
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.DEBUG, datefmt='%I:%M:%S')

logger = logging.getLogger()

# Log a few random samples from the training set:
#for index in random.sample(range(len(train_dataset)), 3):
    #logger.info(f"Sample {index} of the training set: {train_dataset[index]}. \n")
    #logger.info(f"Sample {index} of the training set shape: {len(train_dataset[index]['input_ids'])}. \n")    

In [23]:
import tensorflow as tf

index = random.sample(range(len(train_dataset)), 1)
example = train_dataset[index]
example = {key: tf.convert_to_tensor(arr, dtype_hint=tf.int64) for key, arr in example.items()}
print(example)

07:44:06 DEBUG:Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
07:44:06 DEBUG:Creating converter from 7 to 5
07:44:06 DEBUG:Creating converter from 5 to 7
07:44:06 DEBUG:Creating converter from 7 to 5
07:44:06 DEBUG:Creating converter from 5 to 7


{'attention_mask': <tf.Tensor: shape=(1, 1024), dtype=int64, numpy=array([[1, 1, 1, ..., 1, 1, 1]])>, 'input_ids': <tf.Tensor: shape=(1, 1024), dtype=int64, numpy=array([[ 307,  284,  779, ..., 1280, 1848,   13]])>, 'labels': <tf.Tensor: shape=(1, 1024), dtype=int64, numpy=array([[ 307,  284,  779, ..., 1280, 1848,   13]])>}


# Check Training Parameters

We can customize the training arguments using training_args if we want, or hypertune some on a seperate validation set (might take a huge amount of time though).

For more arguments, check: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TFTrainingArguments

In [24]:
from transformers import TFTrainingArguments

training_args = TFTrainingArguments(output_dir=output_dir)

num_replicas = training_args.strategy.num_replicas_in_sync
batches_per_epoch = len(train_dataset) // (num_replicas * training_args.per_device_train_batch_size)

{
    "init_lr": training_args.learning_rate,
    "num_replicas": num_replicas,
    "num_train_epochs": training_args.num_train_epochs,
    "per_device_train_batch_size": training_args.per_device_train_batch_size,
    "batches_per_epoch": len(train_dataset) // (num_replicas * training_args.per_device_train_batch_size),
    "num_train_steps": int(training_args.num_train_epochs * batches_per_epoch),
    "num_warmup_steps": training_args.warmup_steps,
    "adam_beta1": training_args.adam_beta1,
    "adam_beta2": training_args.adam_beta2,
    "adam_epsilon": training_args.adam_epsilon,
    "weight_decay_rate": training_args.weight_decay
}


07:44:08 INFO:PyTorch: setting up devices
07:44:08 INFO:The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
07:44:08 INFO:Tensorflow: setting up strategy


{'init_lr': 5e-05,
 'num_replicas': 1,
 'num_train_epochs': 3.0,
 'per_device_train_batch_size': 8,
 'batches_per_epoch': 18,
 'num_train_steps': 54,
 'num_warmup_steps': 0,
 'adam_beta1': 0.9,
 'adam_beta2': 0.999,
 'adam_epsilon': 1e-08,
 'weight_decay_rate': 0.0}

Steps:

* Load Pretrained Model 
* Resize the number of token embeddings in the model to that of the tokenizer
    * Since our model and tokenizer belong to the same model, the number of token embeddings should be the same.
    
* Generate tf.data.Dataset (s) Sample Generator:
    * Reoreder batch randomly.
    * Convert each tokenized text to a tensor.
 
* Define a callback SavePretrainedCallback that will save the model checkpoint at the end of each epoch.

* Define the neural network optimizer from the arguments set in the training_args!

* Define the loss: We are using a dummy loss that will minimize the difference between predicted and real next token.
    * There should be a smarter loss.

* Fit the model over the training dataset & evaluate the model over the eval dataset.

* Log the loss & the perplexity metric of the model.

* Save the final model to the output directory.

In [25]:
import numpy as np
import tensorflow as tf
import math
from functools import partial
from transformers import AutoConfig, TFAutoModelForCausalLM
from transformers import create_optimizer

def sample_generator(dataset, tokenizer):
    # Trim off the last partial batch if present
    sample_ordering = np.random.permutation(len(dataset))
    for sample_idx in sample_ordering:
        example = dataset[int(sample_idx)]
        # Handle dicts with proper padding and conversion to tensor.
        example = {key: tf.convert_to_tensor(arr, dtype_hint=tf.int64) for key, arr in example.items()}
        yield example, example["labels"]  # TF needs some kind of labels, even if we don't use them
    return

# region Helper classes
class SavePretrainedCallback(tf.keras.callbacks.Callback):
    # Hugging Face models have a save_pretrained() method that saves both the weights and the necessary
    # metadata to allow them to be loaded as a pretrained model in future. This is a simple Keras callback
    # that saves the model with this method after each epoch.
    def __init__(self, output_dir, **kwargs):
        super().__init__()
        self.output_dir = output_dir

    def on_epoch_end(self, epoch, logs=None):
        self.model.save_pretrained(self.output_dir)

training_args = TFTrainingArguments(output_dir=output_dir)
#training_args.per_device_train_batch_size = 32

with training_args.strategy.scope():

    config = AutoConfig.from_pretrained(model_name)
    model = TFAutoModelForCausalLM.from_pretrained(model_name, config=config)

    model.resize_token_embeddings(len(tokenizer))

    num_replicas = training_args.strategy.num_replicas_in_sync

    # region TF Dataset preparation
    train_generator = partial(sample_generator, train_dataset, tokenizer)
    train_signature = {
        feature: tf.TensorSpec(shape=(None,), dtype=tf.int64)
        for feature in train_dataset.features
        if feature != "special_tokens_mask"
    }
    train_sig = (train_signature, train_signature["labels"])
    options = tf.data.Options()
    options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF
    tf_train_dataset = (
        tf.data.Dataset.from_generator(train_generator, output_signature=train_sig)
        .with_options(options)
        .batch(batch_size=num_replicas * training_args.per_device_train_batch_size, drop_remainder=True)
        .repeat(int(training_args.num_train_epochs))
    )
    eval_generator = partial(sample_generator, eval_dataset, tokenizer)
    eval_signature = {
        feature: tf.TensorSpec(shape=(None,), dtype=tf.int64)
        for feature in eval_dataset.features
        if feature != "special_tokens_mask"
    }
    eval_sig = (eval_signature, eval_signature["labels"])
    tf_eval_dataset = (
        tf.data.Dataset.from_generator(eval_generator, output_signature=eval_sig)
        .with_options(options)
        .batch(batch_size=num_replicas * training_args.per_device_eval_batch_size, drop_remainder=True)
        .repeat(int(training_args.num_train_epochs))
    )
    # endregion
    # region Optimizer and loss
    
    batches_per_epoch = len(train_dataset) // (num_replicas * training_args.per_device_train_batch_size)
    # Bias and layernorm weights are automatically excluded from the decay
    optimizer, lr_schedule = create_optimizer(
        init_lr=training_args.learning_rate,
        num_train_steps=int(training_args.num_train_epochs * batches_per_epoch),
        num_warmup_steps=training_args.warmup_steps,
        adam_beta1=training_args.adam_beta1,
        adam_beta2=training_args.adam_beta2,
        adam_epsilon=training_args.adam_epsilon,
        weight_decay_rate=training_args.weight_decay,
    )

    def dummy_loss(y_true, y_pred):
        return tf.reduce_mean(y_pred)

    model.compile(optimizer=optimizer, loss={"loss": dummy_loss})
    # endregion

    # region Training and validation
    logger.info("***** Running training *****")
    logger.info(f"  Num examples = {len(train_dataset)}")
    logger.info(f"  Num Epochs = {training_args.num_train_epochs}")
    logger.info(f"  Instantaneous batch size per device = {training_args.per_device_train_batch_size}")
    logger.info(f"  Total train batch size = {training_args.per_device_train_batch_size * num_replicas}")

    history = model.fit(
        tf_train_dataset,
        validation_data=tf_eval_dataset,
        epochs=int(training_args.num_train_epochs),
        steps_per_epoch=len(train_dataset) // (training_args.per_device_train_batch_size * num_replicas),
        callbacks=[SavePretrainedCallback(output_dir=training_args.output_dir)],
    )
    try:
        train_perplexity = math.exp(history.history["loss"][-1])
    except OverflowError:
        train_perplexity = math.inf
    try:
        validation_perplexity = math.exp(history.history["val_loss"][-1])
    except OverflowError:
        validation_perplexity = math.inf
    logger.info(f"  Final train loss: {history.history['loss'][-1]:.3f}")
    logger.info(f"  Final train perplexity: {train_perplexity:.3f}")
    logger.info(f"  Final validation loss: {history.history['val_loss'][-1]:.3f}")
    logger.info(f"  Final validation perplexity: {validation_perplexity:.3f}")
    # endregion

    if training_args.output_dir is not None:
        model.save_pretrained(training_args.output_dir)

07:44:09 INFO:PyTorch: setting up devices
07:44:09 INFO:The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
07:44:09 INFO:Tensorflow: setting up strategy
07:44:09 INFO:loading weights file https://huggingface.co/gpt2/resolve/main/tf_model.h5 from cache at /home/azureuser/.cache/huggingface/transformers/4029f7287fbd5fa400024f6bbfcfeae9c5f7906ea97afcaaa6348ab7c6a9f351.723d8eaff3b27ece543e768287eefb59290362b8ca3b1c18a759ad391dca295a.h5

If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
07:44:12 INFO:***** Running training *****
07:44:12 INFO:  Num examples = 144
07:44:12 INFO:  Num Epochs = 3.0
07:44:12 INFO:  Instantaneous batch size per device = 8
07:44:12 INFO:  Tot

Epoch 1/3


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f0f53b51d90> is not a module, class, method, function, traceback, frame, or code object


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f0f53b51d90> is not a module, class, method, function, traceback, frame, or code object



Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.








07:57:29 DEBUG:Creating converter from 5 to 3
07:57:33 INFO:Model weights saved in outputcsvFiles/tf_model.h5


Epoch 2/3


08:11:07 INFO:Model weights saved in outputcsvFiles/tf_model.h5


Epoch 3/3


08:23:47 INFO:Model weights saved in outputcsvFiles/tf_model.h5
08:23:47 INFO:  Final train loss: 2.122
08:23:47 INFO:  Final train perplexity: 8.351
08:23:47 INFO:  Final validation loss: 2.103
08:23:47 INFO:  Final validation perplexity: 8.193
08:23:51 INFO:Model weights saved in outputcsvFiles/tf_model.h5


# Use Fine-tuned Model

Now that we have trained our new language model on new data, lets give it a try! We will want to use the path to the directory that the script outputs the model file to, and load it up to see results.

In [26]:
# setup imports to use the model
from transformers import TFGPT2LMHeadModel
from transformers import GPT2Tokenizer

model = TFGPT2LMHeadModel.from_pretrained(output_dir)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

08:23:51 INFO:loading weights file outputcsvFiles/tf_model.h5
08:23:53 DEBUG:Creating converter from 3 to 5

If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [36]:
input_ids = tokenizer.encode("My card", return_tensors='tf')

generated_text_samples = model.generate(
    input_ids, 
    max_length=30,  
    num_return_sequences=5,
    no_repeat_ngram_size=2,
    #repetition_penalty=1.5,
    #top_p=0.92,
    #temperature=.85,
    do_sample=True,
    #top_k=125,
    early_stopping=True
)

#Print output for each sequence generated above
for i, beam in enumerate(generated_text_samples):
  print("{}: {}".format(i + 1,tokenizer.decode(beam, skip_special_tokens=True)))



1: My card has been stolen. I need money. Thank you for the statement from the bank. If I forget that I did a transaction, would you
2: My card gets locked when I go into my account, and I have a few minutes before I need to leave due to unexpected security rules. Can I
3: My card was charged some time ago but it never returned, this will cost again
4: My card is on file now.
5: My card was charged but it was just taken out of my hand!" The guy just said, "Sorry, I thought you were charged. I would
