## Finetune Falcon-7b (sharded version) on a Google colab notebook

Project C - Team 4

## Setup

The used libraries are `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent `SFTTrainer`. We will use `bitsandbytes` to quantize the base model into 4bit (QLoRA approach). We will also install `einops` as it is a requirement to load Falcon models.

In [None]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## Dataset load

We make use of PubMedQA, which has more than 200000 instances (Artificially created)

The dataset can be found [here](https://huggingface.co/datasets/pubmed_qa)

In [None]:
from datasets import load_dataset

dataset_name = "pbaoo2705/processed_dataset_v2"
dataset = load_dataset(dataset_name, split='train')
eval_dataset = load_dataset(dataset_name, split='test')

## Loading the model

In this section we will load the [Falcon 7B model](https://huggingface.co/tiiuae/falcon-7b), quantize it in 4bit and attach LoRA adapters on it. Let's get started!

In [None]:
import torch
from transformers import  AutoTokenizer, BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
model_name = "cosmin/falcon-7b-sharded-bf16"

#Quantizer initialize
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

#Get model from HuggingFace's transformers library
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

You are using a model of type RefinedWebModel to instantiate a model of type falcon. This is not supported for all configurations of models and can yield errors.



Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

In [None]:
#Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer.

In [None]:
print(model)

FalconForCausalLM(
  (transformer): FalconModel(
    (word_embeddings): Embedding(65024, 4544)
    (h): ModuleList(
      (0-31): 32 x FalconDecoderLayer(
        (self_attention): FalconAttention(
          (maybe_rotary): FalconRotaryEmbedding()
          (query_key_value): Linear4bit(in_features=4544, out_features=4672, bias=False)
          (dense): Linear4bit(in_features=4544, out_features=4544, bias=False)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): FalconMLP(
          (dense_h_to_4h): Linear4bit(in_features=4544, out_features=18176, bias=False)
          (act): GELU(approximate='none')
          (dense_4h_to_h): Linear4bit(in_features=18176, out_features=4544, bias=False)
        )
        (input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=4544, out_features=65024, bias=False)
)


In [None]:
from peft import PromptEncoderConfig

#Use LoRA config for fine-tuning this model
peft_config = PromptEncoderConfig(task_type="CAUSAL_LM", num_virtual_tokens=20, encoder_hidden_size=128)


## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [None]:
from transformers import TrainingArguments, Adafactor

#Arguments needed for training process
output_dir = "falcon-7b-sharded"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
#Paged adamw 32 bits optimization algorithm is used in QLoRA
optim = "adafactor"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 500
warmup_ratio = 0.03
lr_scheduler_type = "constant"



training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    eval_accumulation_steps=1,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    report_to=None,
    lr_scheduler_type=lr_scheduler_type

)

In [None]:
#Get text example
#print(dataset_train['text'][0])
tokens = tokenizer.tokenize("### Human: Can you write a short introduction about the relevance of the term \"monopsony\" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: \"Monopsony\" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, leading to a situation where they are dependent on the employer for their livelihood. This dependence can result in further suppression of wages and a decline in working conditions.\n\nOverall, the concept of monopsony is essential to understanding the dynamics of labor markets and the impact of market power on workers. Further research is needed to understand the extent and impact of monopsonies on the economy and to develop policies to address this issue.\n\nReferences:\nBivens, J., & Mishel, L. (2013). The Pay of Corporate Executives and Financial Professionals as Evidence of Rents in Top 1 Percent Incomes. Journal of Economic Perspectives, 27(3), 57-78.### Human: Now explain it to a dog")
print(tokenizer.convert_tokens_to_ids(tokens))

[19468, 6823, 37, 1853, 299, 3131, 241, 1866, 9705, 544, 248, 23335, 275, 248, 1650, 204, 13, 3032, 387, 1799, 100, 13, 272, 17750, 42, 4012, 745, 6432, 3236, 271, 2879, 988, 387, 1799, 424, 272, 248, 17170, 1208, 273, 27421, 5228, 1959, 25, 19468, 12453, 37, 204, 13, 8350, 387, 1799, 100, 13, 11869, 271, 241, 1208, 4352, 881, 629, 304, 736, 532, 12639, 312, 241, 2057, 822, 379, 1506, 25, 529, 17750, 23, 414, 1650, 304, 4235, 5228, 272, 248, 5477, 1208, 23, 881, 241, 988, 387, 1799, 100, 10679, 504, 2589, 1484, 648, 248, 17324, 273, 1660, 3192, 275, 525, 3900, 25, 390, 5312, 275, 241, 988, 387, 1799, 100, 418, 1226, 272, 3089, 17324, 273, 6511, 8093, 4269, 312, 4862, 23, 345, 248, 10679, 504, 1278, 19785, 271, 2638, 17324, 379, 1586, 1286, 1660, 3192, 25, 193, 193, 15770, 1959, 504, 6935, 2879, 988, 387, 1799, 424, 272, 9462, 963, 345, 5543, 273, 2794, 1655, 23, 881, 241, 1218, 1902, 2404, 1873, 241, 2589, 8091, 275, 248, 1208, 204, 19, 45, 419, 594, 204, 17, 22399, 2776, 23, 204, 626,

Then finally pass everthing to the trainer

In [None]:
!pip install evaluate



In [None]:
from trl import SFTTrainer

max_seq_length = 512
eval_subset = eval_dataset.select(range(200))

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    eval_dataset=eval_subset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments
)



We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [None]:
print(dataset)

Dataset({
    features: ['pubid', 'question', 'context', 'long_answer', 'final_decision', 'text'],
    num_rows: 5000
})


In [None]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Train the model

Now let's train the model! Simply call `trainer.train()`

In [None]:
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33mhungnguyennguyen200504[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
trainer.train()

Step,Training Loss,Validation Loss


TypeError: ignored

In [None]:
metrics = trainer.evaluate()
print(metrics)


During training, the model should converge nicely as follows:

![image](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/loss-falcon-7b.png)

The `SFTTrainer` also takes care of properly saving only the adapters during training instead of saving the entire model.

In [None]:
!pip install huggingface_hub



In [None]:
#Login to huggingface
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
#Push model to hub
trainer.push_to_hub("falcon-7b-sharded-adafactor-constant-p-tuning")

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.09k [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/364k [00:00<?, ?B/s]

'https://huggingface.co/hung200504/falcon-7b-sharded/tree/main/'