## Finetune Falcon-7b-Instruct

We will leverage PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuning

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [1]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops
!pip install langdetect

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25ldone
[?25h  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993224 sha256=5a1d1e869b90b62dfd7e21dbce0254d61d28981857d245e1e79ea368ea30acd1
  Stored in directory: /home/ec2-user/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


## Dataset

For our experiment, we will use the Guanaco dataset, which is a clean subset of the OpenAssistant dataset adapted to train general purpose chatbots.

The dataset can be found [here](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)

In [5]:
import os
from datasets import load_dataset

os.environ['WANDB_DISABLED'] = 'true'

data = load_dataset("timdettmers/openassistant-guanaco", split="train",  
                    cache_dir=os.path.join(os.environ['PWD'],'hf/data_cache/'))

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Downloading and preparing dataset json/timdettmers--openassistant-guanaco to /home/ec2-user/SageMaker/hf/data_cache/timdettmers___json/timdettmers--openassistant-guanaco-6126c710748182cf/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /home/ec2-user/SageMaker/hf/data_cache/timdettmers___json/timdettmers--openassistant-guanaco-6126c710748182cf/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.


In [6]:
from langdetect import detect

def extract_conversation(text):
    """
    1. Focus only on english examples
    2. Non-empty examples
    3. Include only first Human & Assistant interaction
    """
    try:
        if detect(text) != "en":
            return ''
    except:
        return ''
    
    # Split the text into individual statements
    statements = text.split("###")
    statements = [statement.strip() for statement in statements if statement.strip().startswith("Human") or statement.strip().startswith("Assistant")]
    
    if len(statements) < 2:
        return ''
    
    # Construct the conversation by joining the human and assistant statements
    conversation = " ### ".join(statements[:2])  # Exclude the last statement

    # Add "###" back to the beginning and end of the conversation
    conversation = "### " + conversation
    return conversation#.replace("### Human:", "User:").replace("###Assistant:", "Assistant:")


def generate_and_tokenize_prompt(data_point):
    full_prompt = extract_conversation(data_point['text'])
    return {"text": full_prompt}
num_rows = data.num_rows#500
new_data = data.filter(lambda example, idx: idx < num_rows, with_indices=True).map(generate_and_tokenize_prompt).filter(lambda example: len(example["text"]) > 0)

Filter:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9846 [00:00<?, ? examples/s]

In [7]:
data['text'][0]

'### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, lead

In [8]:
new_data['text'][0]

'### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research. ### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, lea

In [9]:
data['text'][1]

'### Human: ¿CUales son las etapas del desarrollo y en qué consisten según Piaget?### Assistant: Jean Piaget fue un psicólogo suizo que propuso una teoría sobre el desarrollo cognitivo humano que consta de cuatro etapas:\n\nEtapa sensoriomotora (0-2 años): Durante esta etapa, el niño aprende a través de sus sentidos y movimientos. Descubre que sus acciones pueden tener un impacto en el entorno y comienza a formarse una idea básica de objetividad y continuidad.\n\nEtapa preoperatoria (2-7 años): En esta etapa, el niño comienza a desarrollar un pensamiento simbólico y a comprender que las cosas pueden representar a otras cosas. También comienzan a desarrollar un pensamiento lógico y a comprender conceptos como la causa y el efecto.\n\nEtapa de operaciones concretas (7-12 años): Durante esta etapa, el niño desarrolla un pensamiento lógico y comprende las relaciones causales. Empiezan a comprender que las cosas pueden tener múltiples perspectivas y que los conceptos pueden ser más complejo

In [10]:
new_data['text'][1]

'### Human: Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML? ### Assistant: Sure! Let\'s say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and a dog picture could have very similar representations (their numbers would be close to each other), while at others two cat images may be represented far apart. In simple terms, the model wouldn\'t be able to tell cats and dogs apart. This is where contrastive learning comes in.\n\nThe point of contrastive learning is to take pairs of samples (in this case images of cats and dogs), then train the mode

## Loading the model

In this section we will load the [Falcon 7B instruct model](https://huggingface.co/tiiuae/falcon-7b-instruct), quantize it in 4bit and attach LoRA adapters on it. Let's get started!

Load model & tokenizer

In [11]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [12]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

try:
    del model
except:
    pass
torch.cuda.empty_cache()
MODEL_NAME = "tiiuae/falcon-7b-instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config,
    cache_dir=os.path.join(os.environ['PWD'],'hf/model_cache/')
)
print_trainable_parameters(model)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

print(f"Number in billion: {model.num_parameters()/ 1_000_000_000}")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

trainable params: 295768960 || all params: 3608744832 || trainable%: 8.195895630450604


Downloading (…)okenizer_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Number in billion: 3.608744832


Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer.

In [13]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
#         "dense",
#         "dense_h_to_4h",
#         "dense_4h_to_h",
    ]
)

## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [14]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
logging_steps = 10
save_steps = 4*logging_steps
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 100#new_data.num_rows * 1//per_device_train_batch_size
warmup_ratio = 0.03
lr_scheduler_type = "cosine"
do_eval = True
eval_steps = 4*logging_steps
evaluation_strategy = 'steps'


training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    report_to = 'none',
    do_eval = do_eval,
    eval_steps = eval_steps,
    evaluation_strategy = evaluation_strategy,
)

Then finally pass everthing to the trainer

In [15]:
from trl import SFTTrainer

max_seq_length = tokenizer.model_max_length
train_testvalid = new_data.train_test_split(test_size=0.3)

trainer = SFTTrainer(
    model=model,
    train_dataset= train_testvalid['train'],
    eval_dataset = train_testvalid['test'],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=512,#tokenizer.model_max_length,
    tokenizer=tokenizer,
    args=training_arguments,
    #packing=True
)
model.config.use_cache = False



Map:   0%|          | 0/2474 [00:00<?, ? examples/s]

Map:   0%|          | 0/1061 [00:00<?, ? examples/s]

We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [16]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Train the model

Now let's train the model! Simply call `trainer.train()`

In [17]:
trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
40,1.5619,1.491142
80,1.4961,1.472996


TrainOutput(global_step=100, training_loss=1.5516366863250732, metrics={'train_runtime': 1468.8156, 'train_samples_per_second': 1.089, 'train_steps_per_second': 0.068, 'total_flos': 1.63781843877888e+16, 'train_loss': 1.5516366863250732, 'epoch': 0.65})

## Save Trained Model

In [18]:
trainer.save_model("trained-model")

In [19]:
## Load Trained Model
from peft import PeftConfig, PeftModel

PEFT_MODEL = "trained-model"

config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    cache_dir=os.path.join(os.environ['PWD'],'hf/model_cache/')
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path,
                                         )
tokenizer.pad_token = tokenizer.eos_token

model = PeftModel.from_pretrained(model, PEFT_MODEL)

print(f"Model size: {model.get_memory_footprint()/ (1024 ** 3):.2f} GB")
print(f"Number in billion: {model.num_parameters()/ 1_000_000_000}")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model size: 3.71 GB
Number in billion: 3.6276192


## Inference

In [20]:
import re
def process_promt(text):
    pattern_human = r'### Human:(.*?)### Assistant:(.*)'

    match = re.search(pattern_human, text)

    human_text = match.group(1).strip()
    final_text = f"### Human: {human_text}### Assistant:"
    return final_text, text[len(final_text):]

In [23]:
%%time
train_index = 5
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
prompt, answer = process_promt(new_data['text'][train_index])
print(f"Model prompt: {prompt}")
print(f"Correct Answer: {answer}")

encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,  
        max_new_tokens=200, 
        temperature=0.1, 
        top_p=0.75, #select from top tokens whose probability adds up to 15%
        top_k=40, #selecting from top 0 tokens 
        repetition_penalty=1.9, #without a penalty, output starts to repeat 
        do_sample=True, 
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
text_with_prompt = tokenizer.decode(outputs[0], skip_special_tokens=True)
text = text_with_prompt[len(prompt):]
print(f"Model Response: {text}")

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Model prompt: ### Human: Listened to Dvorak's "The New World" symphony, liked it much. What composers, not necessarily from Dvorak's time, wrote similar music? Provide a few suggestions, give composer names and their respective works. Also, what Dvorak's other works sound like the "New World"?### Assistant:
Correct Answer: : If you enjoyed Dvorak's "New World" Symphony, here are a few other composers and works you might enjoy:

1. Pyotr Ilyich Tchaikovsky - Symphony No. 5 in E minor, Op. 64
2. Jean Sibelius - Symphony No. 2 in D major, Op. 43
3. Aaron Copland - Appalachian Spring
4. Edward Elgar - Enigma Variations, Op. 36
5. Gustav Mahler - Symphony No. 1 in D major, "Titan"
6. Samuel Barber - Adagio for Strings

Regarding other works by Dvorak that have similar musical characteristics to the "New World" Symphony, here are some suggestions:

1. Dvorak - Symphony No. 8 in G major, Op. 88
2. Dvorak - String Quartet No. 12 in F major, Op. 96 "American"
3. Dvorak - Symphony No. 7 in D min

In [24]:
text_with_prompt

'### Human: Listened to Dvorak\'s "The New World" symphony, liked it much. What composers, not necessarily from Dvorak\'s time, wrote similar music? Provide a few suggestions, give composer names and their respective works. Also, what Dvorak\'s other works sound like the "New World"?### Assistant: Here are some composers who write music that sounds similar to Dvorak\'s "The New World":\n\n1. Gustav Mahler - Symphony No. 5 in C Minor\n2. Johann Sebastian Bach - Brandenburg Concertos Nos. 6-9\n3. Ludwig van Beethoven - Symphony No. 9 in D minor\n4. Wolfgang Amadeus Mozart - Symphony No. 40 in G Minor\n5. Johannes Brahms - Symphony No. 1 in E flat major\n6. Franz Joseph Haydn - Symphony No. 94 in B flat major\n7. George Frideric Handel - Messiah\n8. Antonio Vivaldi - The Four Seasons\n9. Johann Christian Reissig - Symphony No. 1 in D minor\n10. Johann Georg Pisendel - Symphony No. 1 in C major\n11. Johann Sebastian Bach - Symphony No. 3 in F'