# Local Fine-Tuning a foundation model for multiple tasks (with QLoRA)
The following notebook is an example of performing QLoRA fine-tuning on an LLM using an instruction-following dataset. This script produces the same instruction-following adapter as shown in the amp_adapters_prebuilt directory and the CML Job "Job for fine-tuning on Instruction Dataset"

Note: This does not run fine-tuning distributed accross multiple CML Workers. That requires launching the huggingface accelerate cli specifying fine-tuning python scripts. See the implementation README in dsitributed_peft_scripts for a description of launching fine-tuning with accelerate.

## Part 0: Install Dependencies

In [1]:
!pip install -q --no-cache-dir -r requirements.txt

## Part 1: Parameter Efficient Fine-tuning 

### Load the base model with 4bit quantization

In [None]:
import bitsandbytes as bnb
import datasets
import torch
import torch.nn as nn
from peft import get_peft_model, LoraConfig, PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, DataCollatorForLanguageModeling
from trl import SFTTrainer

### Load the tokenizer and base model in quantized mode

In [3]:
base_model = "bigscience/bloom-1b1"
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token = tokenizer.eos_token


# Configuration to load the model in 4bit quantized mode
# bitsandbytes is used for loading the base model we want to fine-tune in nf4 (4-bit normal float as described in the QLoRA paper)
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    base_model, 
    quantization_config=bnb_config,
    device_map='auto',
)

### Get Peft Model with LoRA training configuration
The peft library from huggingface gives a convenient function to give us a fine-tunable model object where we can specify any lora parameters required.

In [4]:
lora_config = LoraConfig(
          r=16,
          lora_alpha=32,
          target_modules=["query_key_value", "xxx"],
          lora_dropout=0.05,
          bias="none",
          task_type="CAUSAL_LM"
      )
model = get_peft_model(model, lora_config)

### Get and modify dataset
The dataset we are using for fine-tuning is split up into columns. We need to map each of those example rows into a single string which includes any special tokens we want to use for prompting later for SFTTrainer (from the trl library) to use during fine-tuning. 

In [None]:
# Use only 30% of the dataset
dataset_fraction = 30
data = datasets.load_dataset('teknium/GPTeacher-General-Instruct', split=f'train[:{dataset_fraction}%]')

# Merge function to combine two columns of the dataset to have examples that look like
#<Instruction>: %s
#<Input>: %s
#<Response>: %s
#    or
#<Instruction>: %s
#<Response>: %s
def merge_columns(example):
    if example["input"]:
      prediction_format = """<Instruction>: %s
<Input>: %s
<Response>: %s"""
      example["prediction"] = prediction_format %(example["instruction"], example["input"], example["response"])
    else:
      prediction_format = """<Instruction>: %s
<Response>: %s"""
      example["prediction"] = prediction_format %(example["instruction"], example["response"])
    return example

finetuning_data = data.map(merge_columns)

### Set up SFTTrainer for PEFT fine-tuning
Specify all of the fine-tuning options in TrainingArguments (transformers library) and SFTTrainer (trl library) 

In [None]:
# TrainingArguments from the huggingface transformers library
training_args = TrainingArguments(
                output_dir="outputs",
                num_train_epochs=1,
                optim="paged_adamw_32bit",
                per_device_train_batch_size=1, 
                gradient_accumulation_steps=4,
                warmup_ratio=0.03, 
                max_grad_norm=0.3,
                learning_rate=2e-4, 
                fp16=True,
                logging_steps=1,
                lr_scheduler_type="constant",
                disable_tqdm=True,
                report_to='tensorboard',
)

# SFTTrainer from the huggingface trl library
trainer = SFTTrainer(
    model=model,                       # The model we loaded in quantized mode with lora configuration
    train_dataset=finetuning_data,     # The downloaded dataset to use for training
    dataset_text_field = "prediction", # The column which contains examples used for training
    peft_config=lora_config,
    tokenizer=tokenizer,
    packing=True,                      # Pack multiple fine-tuning examples into the context window for the base-model (cutting down on time to fine-tune)
    args=training_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

### Launch fine-tuning
Fine-tuning takes approximately 14 minutes on a V100 GPU

In [None]:
trainer.train()

### Save adapter
Save the fine-tuned adapter into a directory on disk for loading later during inference time

NOTE: sfttrainer savemodel() saves the adapter only

In [8]:
trainer.save_model("amp_adapters_custom/bloom1b1-lora-instruct-notebook")

## Part 2: Inference Comparison (Base Model vs Base Model + Adapter)

### Reset CUDA device for inferencing

Removing the perviously loaded assets (from Part 1) to free up room on GPU and have a clean place to load resources for inference.

In [9]:
del trainer
del model
del tokenizer
import gc
gc.collect()
torch.cuda.empty_cache()

### Load base model and tokenizer

In [10]:
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b1", return_dict=True, device_map='cuda')
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-1b1")

### Load the fine-tuned adapter for use with the base model

In [11]:
model = PeftModel.from_pretrained(model=model,                                                 # The base model to load fine-tuned adapters with
                                  model_id="amp_adapters_custom/bloom1b1-lora-instruct-notebook",  # The directory path of the fine-tuned adapater built in Part 1
                                  adapter_name="bloom1b1-lora-instruct-notebook",              # A label for this adapter to enable and disable on demand later
)

### Define an instruction-following test prompt

In [12]:
prompt = """<Instruction>: Classify the following items into two categories: fruits and vegetables.
<Input>: tomato, apple, cucumber, carrot, banana, zucchini, strawberry, cauliflower
<Response>:"""
batch = tokenizer(prompt, return_tensors='pt')
batch = batch.to('cuda')

#### Base Model Response

In [13]:
# Inference with base model only:

with model.disable_adapter():
    with torch.cuda.amp.autocast():
        output_tokens = model.generate(**batch, max_new_tokens=60)
    prompt_length = len(prompt)
    print(tokenizer.decode(output_tokens[0], skip_special_tokens=True)[prompt_length:])

 green, yellow, red, orange, red, yellow, green, blue, yellow, red, orange, red, yellow, green, blue, yellow, red, orange, red, yellow, green, blue, yellow, red, orange, red, yellow, green, blue, yellow,


^ The base model shows no ability to follow instructions in the promp

#### Fine-tuned adapter Response

In [14]:
# Inference with fine-tuned adapter:
model.set_adapter("bloom1b1-lora-instruct-notebook")
with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, max_new_tokens=60)
prompt_length = len(prompt)
print(tokenizer.decode(output_tokens[0], skip_special_tokens=True)[prompt_length:])

 Fruits: Tomato, Apple, Cucumber, Carrot, Banana, Zucchini, Strawberry, Cauliflower. Vegetables: Tomato, Apple, Cucumber, Carrot, Banana, Zucchini, Strawberry, Cauliflower


^ This is not a perfect response, but a good step towards a usable instruction-following LLM