# Training Mistral-7b on a Single GPU using PEFT LORA with Google Colab (Free Version)
In this notebook, I will show you how to finetune Mistral-7b using the  recent peft library and bitsandbytes for loading large models in 4-bit.

The fine-tuning method will rely on a method called "Low Rank Adapters" (LoRA), instead of fine-tuning the entire model you just have to fine-tune these adapters and load them properly inside the model. After fine-tuning the model you can also share your adapters on the 🤗 Hub and load them very easily. Let's get started!

## Step 0 -  Define some helper functions
1. Enable text wrapping so we don't have to scroll horizontally
2. Define a wrapper function which pass our query to the model for inference and return decoded model's completion(response).


In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))

get_ipython().events.register('pre_run_cell', set_css)

Let's define a wrapper function which will get completion from the model from a user question

In [None]:
def get_completion(query: str, model, tokenizer) -> str:
  device = "cuda:0"

  prompt_template = """
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  ### Question:
  {query}

  ### Answer:
  """
  prompt = prompt_template.format(query=query) # this is just plain old string formatting; don't confuse this with PromptTemplate in LangChain

  encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

  model_inputs = encodeds.to(device)


  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
  decoded = tokenizer.batch_decode(generated_ids)
  return (decoded[0])

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

## Step 1 - Install necessary packages
First, install the dependencies below to get started.

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

## Step 2 - Model loading
We'll load the model using QLoRA quantization to reduce the usage of memory


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Now we specify the model ID and then we load it with our previously defined quantization configuration.

In [None]:
model_id = "mistralai/Mistral-7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)

# model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}) # best practice for Colab

Run a inference on the base model. The model does not seem to understand our instruction and gives us a list of questions related to our query.

In [None]:
result = get_completion(query="Will capital gains affect my tax bracket?", model=model, tokenizer=tokenizer)
print(result)

## Step 3 - Load dataset for finetuning

Let's load a dataset on finance, to fine tune our model on basic finance knowledges. In this guide, we'll load 10% data from the original dataset for the sake of the demo just to showcase how to use this integration with existing tools on the HF ecosystem.

In [None]:
from datasets import load_dataset

data = load_dataset("gbharti/finance-alpaca", split='train')

# Explore the data
df = data.to_pandas()
df.head(10)

Instruction Fintuning - Prepare the dataset under the format of "prompt" so the model can better understand :
1. the function generate_prompt : take the instruction and output and generate a prompt
2. shuffle the dataset
3. tokenizer the dataset

In [None]:
def generate_prompt(data_row):
    """Gen. input text based on a prompt, task instruction, (context info.), and answer

    :param data_row: dict: Data point
    :return: dict: tokenzed prompt
    """
    # Samples with additional context into.
    if data_row['input']:
        text = 'Below is an instruction that describes a task, paired with an input that provides' \
               ' further context. Write a response that appropriately completes the request.\n\n'
        text += f'### Instruction:\n{data_row["instruction"]}\n\n'
        text += f'### Context:\n{data_row["input"]}\n\n'
        text += f'### Response:\n{data_row["output"]}'

    # Without
    else:
        text = 'Below is an instruction that describes a task. Write a response that ' \
               'appropriately completes the request.\n\n'
        text += f'### Instruction:\n{data_row["instruction"]}\n\n'
        text += f'### Response:\n{data_row["output"]}'
    return text

# add the "prompt" column in the dataset
text_column = [generate_prompt(data_point) for data_point in data]
data = data.add_column("prompt", text_column)

In [None]:
str(data['prompt'])

In [None]:
generate_prompt(data['prompt'][10000])

We'll need to tokenize our data so the model can understand.


In [None]:
data = data.shuffle(seed=42)  # Shuffle dataset here
data = data.map(lambda samples: tokenizer(samples["prompt"]), batched=True) # tokenize all rows in the data of the prompt column

Split dataset into 90% for training and 10% for testing

In [None]:
data = data.train_test_split(test_size=0.1)
train_data = data["train"]
test_data = data["test"]

In [None]:
print(test_data)

## Step 4 - Apply Lora  
Here comes the magic with peft! Let's load a PeftModel and specify that we are going to use low-rank adapters (LoRA) using get_peft_model utility function and  the prepare_model_for_kbit_training method from PEFT.

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable() # tells AutoModel library that training is going to happen, and checkpoints need to be saved.
model = prepare_model_for_kbit_training(model)

In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj","o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

peft_model = get_peft_model(model, lora_config)
print_trainable_parameters(peft_model)

Add adapter to the Model

In [None]:
model.add_adapter(lora_config, adapter_name="lora_adapter")

## Step 5 - Run the training!

In [None]:
from huggingface_hub import notebook_login
notebook_login()

Setting the training arguments:
* for the reason of demo, we just ran it for few steps (5) just to showcase how to use this integration with existing tools on the HF ecosystem.

In [None]:
!pip install -q trl==0.12.0

In [None]:
# Here I reload the model AGAIN and specify it should be loaded in a single GPU to avoid errors "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when resuming training"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

In [None]:
# code using SFTTrainer
import transformers

from trl import SFTTrainer, SFTConfig

tokenizer.pad_token = tokenizer.eos_token
torch.cuda.empty_cache()

# https://huggingface.co/docs/trl/en/sft_trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    dataset_text_field="prompt",
    peft_config=lora_config,
    args=SFTConfig(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=1,
        max_steps=100,
        learning_rate=2e-4,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        save_strategy="step",
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

Start the training

In [None]:
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

 Share adapters on the 🤗 Hub

In [None]:
model.push_to_hub("mistral_7b_finance_finetuned_test")
tokenizer.push_to_hub("mistral_7b_finance_finetuned_test")

## Step 6 Evaluating the model qualitatively: run an inference!



In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

Load directly adapters from the Hub using the command below

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "sampurnr/mistral_7b_finance_finetuned_test"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_4bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

You can then directly use the trained model that you have loaded from the 🤗 Hub for inference as you would do it usually in transformers.

In [None]:
result = get_completion(query="Will capital gains affect my tax bracket?", model=model, tokenizer=tokenizer)
print(result)