<a href="https://colab.research.google.com/github/bhaskarachalla/Developing_models_on_own_dataset/blob/master/LoRA(DistilGPT)_FineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [21]:
! pip install -q datasets
! pip install -q bitsandbytes
! pip install -q peft #parameter efficient fine tuning

In [22]:
from typing import Dict, List
from datasets import Dataset, load_dataset, disable_caching
disable_caching()
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch
from torch.utils.data import Dataset
from IPython.display import Markdown
import pandas as pd

In [31]:
dataset = load_dataset("fka/awesome-chatgpt-prompts", split = 'train')
print(dataset)

Dataset({
    features: ['act', 'prompt'],
    num_rows: 170
})


In [34]:
prompt_template = """Below is an instruction that describes a task, Write a response that approporiately complete the request. Instruction: {instruction} \n Response: """
answer_template = """{response}"""

# Creating a function to add keys in the dictionary for prompt, answer and whole text

def _add_text(rec):
  instruction = rec["prompt"]
  response = rec["act"]
  if not instruction:
    raise ValueError(f"Expected an instructions in: {rec}")
  if not response:
    raise ValueError(f"Expected a response in: {rec}")

  rec["prompt"] = prompt_template.format(instruction = instruction)
  rec["answer"] = answer_template.format(response = response)
  rec["text"] = rec["prompt"] + rec["answer"]

  return rec

dataset = dataset.map(_add_text)
print(dataset[0])

Map:   0%|          | 0/170 [00:00<?, ? examples/s]

{'act': 'An Ethereum Developer', 'prompt': 'Below is an instruction that describes a task, Write a response that approporiately complete the request. Instruction: Below is an instruction that describes a task, Write a response that approporiately complete the request. Instruction: Imagine you are an experienced Ethereum developer tasked with creating a smart contract for a blockchain messenger. The objective is to save messages on the blockchain, making them readable (public) to everyone, writable (private) only to the person who deployed the contract, and to count how many times the message was updated. Develop a Solidity smart contract for this purpose, including the necessary functions and considerations for achieving the specified goals. Please provide the code and any relevant explanations to ensure a clear understanding of the implementation. \n Response:  \n Response: ', 'answer': 'An Ethereum Developer', 'text': 'Below is an instruction that describes a task, Write a response t

In [35]:
model_id = "distilbert/distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map = "auto",
    load_in_8bit = True,
    torch_dtype = torch.float16
)

model.resize_token_embeddings(len(tokenizer))

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Embedding(50257, 768)

In [36]:
torch.cuda.empty_cache()

A data collator is a utility function used in sequence-to-sequence (seq2seq) algorithms to prepare and format the data for training. Its primary purpose is to take a batch of input data and corresponding target data, and return a formatted batch that can be fed into the seq2seq model.

The data collator typically performs the following tasks:

1. Padding: Pads the input sequences to have the same length, usually the maximum length in the batch.
2. Masking: Creates masks to indicate which parts of the input sequences are padding tokens.
3. Tokenization: Converts the input sequences into token IDs.
4. Batching: Combines the input sequences, masks, and token IDs into a single batch.

The formatted batch is then passed to the seq2seq model for training. The data collator is usually implemented as a Python function or a PyTorch Dataset class.


In [38]:
from functools import partial
import copy
from transformers import DataCollatorForSeq2Seq

MAX_LENGTH = 256

def _preprocess_batch(batch: Dict[str, List]) :
  model_inputs = tokenizer(batch["text"], max_length=MAX_LENGTH, truncation = True, padding = 'max_length')
  model_inputs["labels"] = copy.deepcopy(model_inputs['input_ids'])

  return model_inputs

_preprocessing_function = partial(_preprocess_batch)
encode_small_dataset = dataset.map(
    _preprocessing_function,
    batched = True,
    remove_columns = ["prompt", "act", "prompt", "answer"]
)


processed_dataset = encode_small_dataset.filter(lambda rec: len(rec["input_ids"]) <= MAX_LENGTH)

# SPillting dataset
split_dataset = processed_dataset.train_test_split(test_size = 14, seed = 42)
print(split_dataset)

data_collator = DataCollatorForSeq2Seq(model = model, tokenizer=tokenizer, max_length = MAX_LENGTH, pad_to_multiple_of = 8, padding = "max_length")

Map:   0%|          | 0/170 [00:00<?, ? examples/s]

Filter:   0%|          | 0/170 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 156
    })
    test: Dataset({
        features: ['text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 14
    })
})


In [54]:
from peft import LoraConfig, prepare_model_for_kbit_training # Importing LoraConfig from peft instead of transformers
from transformers import BartForCausalLM

In [56]:
LORA_R = 256
LORA_ALPHA = 512
LORA_DROPOUT = 0.05

# DEFINE LoRA Config

lora_config = LoraConfig(
    r = LORA_R, # the dimnesions low rank matrices
    lora_alpha = LORA_ALPHA, # Scaling factor for the weight matrices
    lora_dropout = LORA_DROPOUT, # dropout probability of the LoRA layer
    bias = "none",
    task_type = "CAUSAL_LM",
    target_modules=['transformer.h.0.attn.c_attn', 'transformer.h.0.attn.c_proj',
                     'transformer.h.0.mlp.c_fc', 'transformer.h.0.mlp.c_proj',
                     'transformer.h.1.attn.c_attn', 'transformer.h.1.attn.c_proj',
                     'transformer.h.1.mlp.c_fc', 'transformer.h.1.mlp.c_proj',
                     'transformer.h.2.attn.c_attn', 'transformer.h.2.attn.c_proj',
                     'transformer.h.2.mlp.c_fc', 'transformer.h.2.mlp.c_proj',
                     'transformer.h.3.attn.c_attn', 'transformer.h.3.attn.c_proj',
                     'transformer.h.3.mlp.c_fc', 'transformer.h.3.mlp.c_proj',
                     'transformer.h.4.attn.c_attn', 'transformer.h.4.attn.c_proj',
                     'transformer.h.4.mlp.c_fc', 'transformer.h.4.mlp.c_proj',
                     'transformer.h.5.attn.c_attn', 'transformer.h.5.attn.c_proj',
                     'transformer.h.5.mlp.c_fc', 'transformer.h.5.mlp.c_proj',]
)

model = prepare_model_for_kbit_training(model)

model = BartForCausalLM.from_pretrained('facebook/bart-base')

#model.print_trainable_parameters()

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['decoder.embed_tokens.weight', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [57]:
from transformers import TrainingArguments, Trainer
import bitsandbytes
# define the training arguments first.
EPOCHS = 3
LEARNING_RATE = 2e-5
MODEL_SAVE_FOLDER_NAME = "DistilGPT2_LORA"
training_args = TrainingArguments(
                    output_dir=MODEL_SAVE_FOLDER_NAME,
                    overwrite_output_dir=True,
                    fp16=True, #converts to float precision 16 using bitsandbytes
                    per_device_train_batch_size=1,
                    per_device_eval_batch_size=1,
                    learning_rate=LEARNING_RATE,
                    num_train_epochs=EPOCHS,
                    logging_strategy="epoch",
                    evaluation_strategy="epoch",
                    save_strategy="epoch",
)
# training the model
trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=split_dataset['train'],
        eval_dataset=split_dataset["test"],
        data_collator=data_collator,
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()
# only saves the incremental ðŸ¤— PEFT weights (adapter_model.bin) that were trained, meaning it is super efficient to store, transfer, and load.
trainer.model.save_pretrained(MODEL_SAVE_FOLDER_NAME)
# save the full model and the training arguments
trainer.save_model(MODEL_SAVE_FOLDER_NAME)
trainer.model.config.save_pretrained(MODEL_SAVE_FOLDER_NAME)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss
1,4.838,2.079256
2,2.2621,1.110184
3,1.4802,0.704064


Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams

In [58]:
# Function to format the response and filter out the instruction from the response.
def postprocess(response):
    messages = response.split("Response:")
    if not messages:
        raise ValueError("Invalid template for prompt. The template should include the term 'Response:'")
    return "".join(messages[1:])
# Prompt for prediction
inference_prompt = "Imagine you are an experienced Ethereum developer tasked with creating a smart contract for a blockchain messenger. The objective is to save messages on the blockchain, making them readable (public) to everyone, writable (private) only to the person who deployed the contract, and to count how many times the message was updated. Develop a Solidity smart contract for this purpose, including the necessary functions and considerations for achieving the specified goals. Please provide the code and any relevant explanations to ensure a clear understanding of the implementation."
# Inference pipeline with the fine-tuned model
inf_pipeline =  pipeline('text-generation', model=trainer.model, tokenizer=tokenizer, max_length=256, trust_remote_code=True)
# Format the prompt using the `prompt_template` and generate response
response = inf_pipeline(prompt_template.format(instruction=inference_prompt))[0]['generated_text']
# postprocess the response
formatted_response = postprocess(response)
formatted_response

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


'    that that that\n\n\n that that an an an instruction instruction instruction about about about act act act that that a a a task task task is is is::: you you you the the the Instruction Instruction Instruction is is describes describes describes as as as and and and,,, Instruction Instruction the the an an you you response response response that that You You You is is You You that that for for for to to to that that " " " to to for for that that you you an an act actporporpor Response Response Response an an Response Response is is an an for foriii#'

In [59]:
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'cat',"is","sitting","on","the","mat"]]
test = ["on",'the',"mat","is","a","cat"]
score = sentence_bleu(  reference, test)
print(score)

5.5546715329196825e-78


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
