<a href="https://colab.research.google.com/github/VomV/NLP/blob/main/Fine_tuning_Dolly_2_0_with_LoRA_and_Alpaca.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning Dolly 2.0 with LoRA

*   Dolly-v2-3b - https://huggingface.co/databricks/dolly-v2-3b
*   LoRA paper - https://arxiv.org/abs/2106.09685
*   Alpaca Cleaned Dataset - https://github.com/gururise/AlpacaDataCleaned





In [1]:
!git clone https://github.com/gururise/AlpacaDataCleaned.git

Cloning into 'AlpacaDataCleaned'...
remote: Enumerating objects: 747, done.[K
remote: Counting objects: 100% (124/124), done.[K
remote: Compressing objects: 100% (69/69), done.[K
remote: Total 747 (delta 64), reused 95 (delta 54), pack-reused 623[K
Receiving objects: 100% (747/747), 76.51 MiB | 6.12 MiB/s, done.
Resolving deltas: 100% (411/411), done.
Updating files: 100% (69/69), done.


In [2]:
ls AlpacaDataCleaned/

DATA_LICENSE                      [0m[34;42massets[0m/                  pyproject.toml
LICENSE                           [34;42mdataset_extensions[0m/      requirements.txt
README.md                         [34;42meval[0m/                    schema.json
alpacaModifier.py                 generate_instruction.py  seed_tasks.jsonl
alpaca_data.json                  [34;42mgui[0m/                     [34;42mtools[0m/
alpaca_data_cleaned.json          modifierGui.py           utils.py
alpaca_data_cleaned_archive.json  prompt.txt


In [3]:
!pip install accelerate>=0.12.0 transformers[torch]==4.25.1
!pip install -q datasets loralib sentencepiece
!pip -q install git+https://github.com/huggingface/peft.git
!pip -q install bitsandbytes

In [4]:
# Create Instruct Pipeline
import logging
import re

import numpy as np
from transformers import Pipeline, PreTrainedTokenizer

logger = logging.getLogger(__name__)

INSTRUCTION_KEY = "### Instruction:"
RESPONSE_KEY = "### Response:"
END_KEY = "### End"
INTRO_BLURB = (
    "Below is an instruction that describes a task. Write a response that appropriately completes the request."
)

# This is the prompt that is used for generating responses using an already trained model.  It ends with the response
# key, where the job of the model is to provide the completion that follows it (i.e. the response itself).
PROMPT_FOR_GENERATION_FORMAT = """{intro}
{instruction_key}
{instruction}
{response_key}
""".format(
    intro=INTRO_BLURB,
    instruction_key=INSTRUCTION_KEY,
    instruction="{instruction}",
    response_key=RESPONSE_KEY,
)


def get_special_token_id(tokenizer: PreTrainedTokenizer, key: str) -> int:
    """Gets the token ID for a given string that has been added to the tokenizer as a special token.
    When training, we configure the tokenizer so that the sequences like "### Instruction:" and "### End" are
    treated specially and converted to a single, new token.  This retrieves the token ID each of these keys map to.
    Args:
        tokenizer (PreTrainedTokenizer): the tokenizer
        key (str): the key to convert to a single token
    Raises:
        RuntimeError: if more than one ID was generated
    Returns:
        int: the token ID for the given key
    """
    token_ids = tokenizer.encode(key)
    if len(token_ids) > 1:
        raise ValueError(f"Expected only a single token for '{key}' but found {token_ids}")
    return token_ids[0]


class InstructionTextGenerationPipeline(Pipeline):
    def __init__(
        self, *args, do_sample: bool = True, max_new_tokens: int = 256, top_p: float = 0.92, top_k: int = 0, **kwargs
    ):
        super().__init__(*args, do_sample=do_sample, max_new_tokens=max_new_tokens, top_p=top_p, top_k=top_k, **kwargs)

    def _sanitize_parameters(self, return_instruction_text=False, **generate_kwargs):
        preprocess_params = {}

        # newer versions of the tokenizer configure the response key as a special token.  newer versions still may
        # append a newline to yield a single token.  find whatever token is configured for the response key.
        tokenizer_response_key = next(
            (token for token in self.tokenizer.additional_special_tokens if token.startswith(RESPONSE_KEY)), None
        )

        response_key_token_id = None
        end_key_token_id = None
        if tokenizer_response_key:
            try:
                response_key_token_id = get_special_token_id(self.tokenizer, tokenizer_response_key)
                end_key_token_id = get_special_token_id(self.tokenizer, END_KEY)

                # Ensure generation stops once it generates "### End"
                generate_kwargs["eos_token_id"] = end_key_token_id
            except ValueError:
                pass

        forward_params = generate_kwargs
        postprocess_params = {
            "response_key_token_id": response_key_token_id,
            "end_key_token_id": end_key_token_id,
            "return_instruction_text": return_instruction_text,
        }

        return preprocess_params, forward_params, postprocess_params

    def preprocess(self, instruction_text, **generate_kwargs):
        prompt_text = PROMPT_FOR_GENERATION_FORMAT.format(instruction=instruction_text)
        inputs = self.tokenizer(
            prompt_text,
            return_tensors="pt",
        )
        inputs["prompt_text"] = prompt_text
        inputs["instruction_text"] = instruction_text
        return inputs

    def _forward(self, model_inputs, **generate_kwargs):
        input_ids = model_inputs["input_ids"]
        attention_mask = model_inputs.get("attention_mask", None)
        generated_sequence = self.model.generate(
            input_ids=input_ids.to(self.model.device),
            attention_mask=attention_mask,
            pad_token_id=self.tokenizer.pad_token_id,
            **generate_kwargs,
        )[0].cpu()
        instruction_text = model_inputs.pop("instruction_text")
        return {"generated_sequence": generated_sequence, "input_ids": input_ids, "instruction_text": instruction_text}

    def postprocess(self, model_outputs, response_key_token_id, end_key_token_id, return_instruction_text):
        sequence = model_outputs["generated_sequence"]
        instruction_text = model_outputs["instruction_text"]

        # The response will be set to this variable if we can identify it.
        decoded = None

        # If we have token IDs for the response and end, then we can find the tokens and only decode between them.
        if response_key_token_id and end_key_token_id:
            # Find where "### Response:" is first found in the generated tokens.  Considering this is part of the
            # prompt, we should definitely find it.  We will return the tokens found after this token.
            response_pos = None
            response_positions = np.where(sequence == response_key_token_id)[0]
            if len(response_positions) == 0:
                logger.warn(f"Could not find response key {response_key_token_id} in: {sequence}")
            else:
                response_pos = response_positions[0]

            if response_pos:
                # Next find where "### End" is located.  The model has been trained to end its responses with this
                # sequence (or actually, the token ID it maps to, since it is a special token).  We may not find
                # this token, as the response could be truncated.  If we don't find it then just return everything
                # to the end.  Note that even though we set eos_token_id, we still see the this token at the end.
                end_pos = None
                end_positions = np.where(sequence == end_key_token_id)[0]
                if len(end_positions) > 0:
                    end_pos = end_positions[0]

                decoded = self.tokenizer.decode(sequence[response_pos + 1 : end_pos]).strip()
        else:
            # Otherwise we'll decode everything and use a regex to find the response and end.

            fully_decoded = self.tokenizer.decode(sequence)

            # The response appears after "### Response:".  The model has been trained to append "### End" at the
            # end.
            m = re.search(r"#+\s*Response:\s*(.+?)#+\s*End", fully_decoded, flags=re.DOTALL)

            if m:
                decoded = m.group(1).strip()
            else:
                # The model might not generate the "### End" sequence before reaching the max tokens.  In this case,
                # return everything after "### Response:".
                m = re.search(r"#+\s*Response:\s*(.+)", fully_decoded, flags=re.DOTALL)
                if m:
                    decoded = m.group(1).strip()
                else:
                    logger.warn(f"Failed to find response in:\n{fully_decoded}")

        if return_instruction_text:
            return {"instruction_text": instruction_text, "generated_text": decoded}

        return decoded

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-3b", padding_side="left")

model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-3b", 
                                             device_map="auto",
                                             torch_dtype=torch.bfloat16)

# generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)

Downloading (…)okenizer_config.json: 100%|██████████| 450/450 [00:00<00:00, 1.65MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 2.11M/2.11M [00:00<00:00, 5.60MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 228/228 [00:00<00:00, 1.03MB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 819/819 [00:00<00:00, 2.27MB/s]
Downloading pytorch_model.bin:  25%|██▌       | 1.44G/5.68G [00:05<00:19, 221MB/s]

In [None]:
from datasets import load_dataset

data = load_dataset("json", 
                    data_files="./AlpacaDataCleaned/alpaca_data.json")

def generate_prompt(data_point):
    # taken from https://github.com/tloen/alpaca-lora
    if data_point["instruction"]:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{data_point["instruction"]}

### Input:
{data_point["input"]}

### Response:
{data_point["output"]}"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{data_point["instruction"]}

### Response:
{data_point["output"]}"""


data = data.map(lambda data_point: {"prompt": tokenizer(generate_prompt(data_point))})

data

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-f46f8869c4dab24f/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-f46f8869c4dab24f/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/52002 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'prompt'],
        num_rows: 52002
    })
})

## Finetuning Dolly

In [None]:
import os

# os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from datasets import load_dataset
import transformers
from transformers import AutoTokenizer, AutoModel, AutoConfig, GPTJForCausalLM

from peft import prepare_model_for_int8_training, LoraConfig, get_peft_model


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


In [None]:
# Settings for A100 - For 3090 
MICRO_BATCH_SIZE = 4  # change to 4 for 3090
BATCH_SIZE = 128
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
EPOCHS = 2  # paper uses 3
LEARNING_RATE = 2e-5  
CUTOFF_LEN = 256  
LORA_R = 4
LORA_ALPHA = 16
LORA_DROPOUT = 0.05

In [None]:
model = prepare_model_for_int8_training(model, 
                                        use_gradient_checkpointing=True)

In [None]:
config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
tokenizer.pad_token_id = 0  # unk. we want this to be different from the eos token

data = load_dataset("json", data_files="./AlpacaDataCleaned/alpaca_data_cleaned.json")

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-96bfeeb2eac821b5/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-96bfeeb2eac821b5/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
data = data.shuffle().map(
    lambda data_point: tokenizer(
        generate_prompt(data_point),
        truncation=True,
        max_length=CUTOFF_LEN,
        padding="max_length",
    )
)

Map:   0%|          | 0/51760 [00:00<?, ? examples/s]

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'input_ids', 'attention_mask'],
        num_rows: 51760
    })
})

In [None]:

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=MICRO_BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        warmup_steps=100,
        num_train_epochs=EPOCHS,
        learning_rate=LEARNING_RATE,
        fp16=True,
        logging_steps=1,
        output_dir="lora-dolly",
        save_total_limit=3,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train(resume_from_checkpoint=False)

model.save_pretrained("alpaca-lora-dolly-2.0")

Using cuda_amp half precision backend
The following columns in the training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: output, input, instruction. If output, input, instruction are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 51760
  Num Epochs = 2
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 32
  Total optimization steps = 808
  Number of trainable parameters = 1310720
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,2.6533
2,2.7105
3,2.6417
4,2.6372
5,2.5318
6,2.6893
7,2.6241
8,2.659
9,2.6604
10,2.6859


Saving model checkpoint to lora-dolly/checkpoint-500
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


Training completed. Do not forget to share your model on huggingface.co/models =)




In [None]:
generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)

In [None]:
generate_text("Look up the boiling point of water.")

'Water boils at 100 ° C (212 ° F) at atmospheric pressure.\n\n\n  \xa0\xa0\xa0\nTo look up the boiling point of water, you can look up the properties of water in a chemistry book or ask a knowledgeable friend for advice.'

In [None]:
generate_text("Find the capital of Spain.")

'The capital of Spain is Madrid. Madrid is the largest city in the country and one of the oldest settlements in Europe, having been inhabited since prehistoric times.\nBelow is the list of the main cities:\n1. Barcelona\n2. Valencia\n3. Seville\n4. Madrid\n5. Bilbao\n6. Bilbao is also known as the 3 cities.\n7. Valencia is also known as the 8 cities because of its eight cities that have been the capital.\n8. Barcelona is also known as the 10 cities and is the capital of the regional capital in Catalonia.\n9. Valencia and Barcelona are connected by a tunnel.\nBelow is a map of the main cities of Spain:\nBelow is the city of Valencia in Spain:\n\n clocks in is 6:45 am and my head is reeling with the knowledge that today is the first day of my first year in college and it has been an exciting morning so far. I just finished getting dressed and just went downstairs to my new class, the 101 Science II course. I find myself overwhelmed with both excitement and nervousness as I sit down at my

In [None]:
generate_text("Translate the following phrase into French: I love my dog")

'I adore mon chien.\n\n### Response:\nJe suis passionnée par mon chien.'

In [None]:
generate_text("Given a set of numbers, find the maximum value: Set: {10, 3, 25, 62, 16}")

"The maximum value of the given set of numbers is 16. The numbers are arranged in descending order: 3, 10, 25, 62. And the value of the maximum element of the given set is 16. \n\nThe reason behind this is that the given set is made of smaller elements with increasing values, in descending order. So, the maximum element can be obtained by taking the smallest element first and then finding the maximum value of the rest of the numbers that are lesser than it, in the given order.\n\nIn the given set, the first element, in the order from largest to smallest is 25, the element next to it is 10. So, 25 - 10 = 15. The value of the maximum element in the given set is 16 = 15 + (25 - 10) / 2 = 15 + 15/2 = 16.\n\n�\nThe answer is 16.\n\nLet's look at the given set.\nThe first element is 25. The next element is 10. The remaining elements are 62, 16. \n25 - 10 = 15.\nNext, we find the maximum value of the rest of the numbers in the given set:\n62 - 16 = 46.\nThe value of the maximum element in the