# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

# Load the pre-trained model and tokenizer
model_name = "distilgpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)



  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from transformers import pipeline
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token
orig_generator = pipeline(
    "text-generation",
    model=model_name,
    tokenizer=tokenizer,
)

print(tokenizer.model_max_length)  # Check the maximum length of the tokenizer

Device set to use cpu


1024


Choose some prompts. Then evaluate the model generated responses.

In [3]:
prompts = [
    "I need to open a checking account",
    "I want to apply for a credit card",
    "I want to transfer money to my savings account",
    "I need to setup automatic payments for my utilities",
]

for prompt in prompts:
    print(f"Prompt: {prompt}")
    print(orig_generator(prompt))


Prompt: I need to open a checking account
[{'generated_text': "I need to open a checking account, and I can't find out why. We have this in mind.\n\n\nThe following is an example of how to do this.\nFirst, you have to create a log.logger.logger.logger.logger.logger.logger. In this case, you have to create a new log.logger.logger.logger.logger.logger.logger. In this case, you have to create a log.logger.logger.logger.logger.logger.logger.logger.logger.log.logger.log.log.logger.log.logger.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log.log"}]
Prompt: I want to apply for a credit card
[{'generated_text': 'I want to apply for a credit card.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n

## Prepare the PEFT Model and Dataset for Training

Setup the model for PEFT lora training


In [4]:
from peft import LoraConfig, get_peft_model
config = LoraConfig(
    r=4,  # To reduce the number of trainable parameters
)
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()


trainable params: 73,728 || all params: 81,986,304 || trainable%: 0.0899




### Load the Dataset and Preprocess the Data

In [5]:
# Load the finance instruction dataset
from datasets import load_dataset, DatasetDict

# Login using e.g. `huggingface-cli login` to access this dataset
# ds = load_dataset("Josephgflowers/Finance-Instruct-500k", split="train[:5000]")

# Just read the first 5000 entries only due to resource limits
ds = load_dataset("talkmap/banking-conversation-corpus", split="train[:5000]")

# split into train and test sets
ds = ds.train_test_split(test_size=0.1)
# explore the dataset
print(ds)

DatasetDict({
    train: Dataset({
        features: ['conversation_id', 'speaker', 'date_time', 'text'],
        num_rows: 4500
    })
    test: Dataset({
        features: ['conversation_id', 'speaker', 'date_time', 'text'],
        num_rows: 500
    })
})


#### Tokenize the dataset

In [6]:
# quick check that things are working

inputs = tokenizer(ds['train'][0]['text'], return_tensors="pt")
inputs['input_ids'].shape
#print(tokenizer.decode(inputs['input_ids']))
outputs = lora_model(**inputs)  # Forward pass with the tokenized inputs
outputs

CausalLMOutputWithCrossAttentions(loss=None, logits=tensor([[[-33.1717, -31.6326, -33.7443,  ..., -44.9420, -44.7888, -33.9273],
         [-67.7578, -67.2971, -67.5085,  ..., -73.7265, -72.1330, -66.4876],
         [-55.9518, -56.5346, -58.7248,  ..., -64.3963, -62.8929, -57.7981],
         ...,
         [-73.1146, -74.6174, -76.4408,  ..., -79.0709, -80.8036, -73.2944],
         [-79.6601, -83.9914, -87.3437,  ..., -92.5419, -94.3485, -83.1136],
         [-72.7861, -72.3935, -73.0458,  ..., -81.6174, -83.1252, -66.5091]]],
       grad_fn=<UnsafeViewBackward0>), past_key_values=((tensor([[[[-8.7863e-01,  2.6339e+00,  7.7920e-01,  ..., -1.2441e+00,
           -1.5730e-01,  1.6261e+00],
          [-1.6237e+00,  2.7957e+00,  1.6042e+00,  ..., -9.6159e-01,
           -1.8298e+00,  2.1775e+00],
          [-2.1820e+00,  2.4550e+00,  1.8944e+00,  ..., -1.0605e+00,
           -2.0144e+00,  1.5937e+00],
          ...,
          [-1.9662e+00,  2.5642e+00,  2.4126e+00,  ..., -1.7261e+00,
        

Define a function to group the *tokenized* text into smaller (block size 128) chunks

In [7]:
# Group the tokenized texts into chunks and also copy the unput_ids to labels
# labels don't really matter here for this text generation model
block_size = 128
def chunk_texts(examples):
    concatenated = {}
    for k, v in examples.items():
        concatenated[k] = sum(v, [])

    total_length = len(concatenated[list(examples.keys())[0]])
    # just drop the small remainder
    total_length = (total_length // block_size) * block_size
    # Split by chunks
    result = {}
    for k, v in concatenated.items():
        result[k] = [v[i : i + block_size] for i in range(0, total_length, block_size)]

    result["labels"] = result["input_ids"].copy()
    return result

In [10]:
# Do the simple tokenization first and drop the un-used features.

tokenized_datasets = {}
for split in ds.keys():
    tokenized_datasets[split] = ds[split].map(
        lambda x: tokenizer(x['text']),
        batched=True,
        remove_columns=["conversation_id", "speaker", "date_time", "text"],
    )


In [11]:
# complete the preprocessing with grouping
preprocessed_ds = {split : tokenized_datasets[split].map(
    chunk_texts,
    batched=True,
) for split in tokenized_datasets.keys()}

Map: 100%|██████████| 4500/4500 [00:00<00:00, 15895.78 examples/s]
Map: 100%|██████████| 500/500 [00:00<00:00, 22331.03 examples/s]


In [12]:
preprocessed_ds

{'train': Dataset({
     features: ['input_ids', 'attention_mask', 'labels'],
     num_rows: 1006
 }),
 'test': Dataset({
     features: ['input_ids', 'attention_mask', 'labels'],
     num_rows: 107
 })}

#### Set the training Arguments and the Trainer

In [13]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./data/lora-bank-gpt2",
    num_train_epochs=1,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-5,
    weight_decay=0.01,
    load_best_model_at_end=False,
    push_to_hub=False,
)

In [14]:
# Train
from transformers import Trainer
from transformers import DataCollatorForLanguageModeling

# let the data_collator handle the batching jobs
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="pt")

trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=preprocessed_ds['train'],
    eval_dataset=preprocessed_ds['test'],
    data_collator=data_collator,
    processing_class=tokenizer,
)

No label_names provided for model class `PeftModel`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


#### Now train the model. Without GPU, this will take a long time

In [15]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,3.7864,No log


TrainOutput(global_step=126, training_loss=3.7301211130051386, metrics={'train_runtime': 135.8309, 'train_samples_per_second': 7.406, 'train_steps_per_second': 0.928, 'total_flos': 32915029229568.0, 'train_loss': 3.7301211130051386, 'epoch': 1.0})

In [16]:
# Evaluate the fine-tuned model
from transformers import pipeline
results = trainer.evaluate()
print(results)



{'eval_runtime': 5.3056, 'eval_samples_per_second': 20.167, 'eval_steps_per_second': 2.639, 'epoch': 1.0}


###  ⚠️ IMPORTANT ⚠️

Due to workspace storage constraints, we should not store the model weights in the same directory but rather use `/tmp` to avoid workspace crashes which are irrecoverable.
Ensure you save it in /tmp always.

In [17]:
# Saving the model
save_path = "/tmp/lora-bank-gpt2"
lora_model.save_pretrained(save_path)

## Performing Inference with a PEFT Model

In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [18]:
from transformers import pipeline

# Load the fine-tuned model for inference
finetuned_model = pipeline("text-generation", model=save_path, tokenizer=tokenizer)

# Generate responses for the same prompts
for prompt in prompts:
    print(f"Prompt: {prompt}")
    print(finetuned_model(prompt))

Device set to use cpu


Prompt: I need to open a checking account
[{'generated_text': 'I need to open a checking account for the current account. We will check that account too. But the account has failed to close.\n\n\n\nWhat is the state of our service?\nThis depends on the service. In your case, we were looking for a new account, but we did not find the new account. We would need to send a check message to the service.\nWhat is the state of our service?\nWe are looking for a new account, but the service has failed to close.\nWhat does the state of our service?\nThe state of our service depends on your service. In your case, we were looking for a new account, but we did not find the new account. We would need to send a check message to the service.\nWhat is the state of our service?\nThe state of our service depends on your service. In your case, we were looking for a new account, but we did not find the new account. We would need to send a check message to the service.\nWhat is the state of our service?\nW

## Conclusion

The fine-tuned model does a better job in the area of finance related topics as the additional training
dataset added more infomation to the original model.

In [19]:
# For easy reference, below is what the original model results on the same prompts
# repeated here
for prompt in prompts:
    print(f"Prompt: {prompt}")
    print(orig_generator(prompt))

Prompt: I need to open a checking account
[{'generated_text': 'I need to open a checking account for this server. You can also use it with the following commands.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'}]
Prompt: I want to apply for a credit card
[{'generated_text': "I want to apply for a credit card, but my first step is to apply for a MasterCard card.\n\n\nI'm currently looking at a MasterCard card, but I'm not sure if the card can be used on my cards.\nCheck out this list of 4 ways to get a MasterCard card.\n1. Check out the Master Card with all 