# Fine-tune large models using 🤗 [`peft`](https://github.com/huggingface/peft) adapters, [`transformers`](https://github.com/huggingface/transformers) & [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes)

In this tutorial we will cover how we can fine-tune large language models using the very recent `peft` library and `bitsandbytes` for loading large models in **8-bit**.
The fine-tuning method will rely on a recent method called "Low Rank Adapters" ([LoRA](https://arxiv.org/pdf/2106.09685.pdf)), instead of fine-tuning the entire model you just have to fine-tune these adapters and load them properly inside the model. 
After fine-tuning the model you can also share your adapters on the 🤗 Hub and load them very easily. Let's get started!

## Install requirements

First, run the cells below to install the requirements:

In [1]:
!pip install -U pip
!pip install -qqq torch==2.0.1
!pip install loralib==0.0.1
!pip install einops==0.6.1
!pip install --upgrade -q  datasets accelerate transformers peft datasets tensorboard

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.14.1 requires torch==1.13.1, but you have torch 2.0.1 which is incompatible.[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement loralib==0.0.1 (from versions: 0.1.0, 0.1.1)[0m[31m
[0m[31mERROR: No matching distribution found for loralib==0.0.1[0m[31m


In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [1]:
!nvidia-smi

Thu Jun 29 16:10:11 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P0    31W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [9]:
!nvidia-smi | grep 'python' | awk '{ print $3 }' | xargs -n1 kill -9

kill: failed to parse argument: 'N/A'


In [None]:
!sudo kill -9 3527

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [8]:
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Model libraries
import torch
import torch.nn as nn

# Hugging face libraries
import bitsandbytes as bnb
import transformers
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer

from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model, PeftConfig, PeftModel
from datasets import load_dataset, ReadInstruction
# Langchain
from langchain import HuggingFacePipeline
from transformers import AutoTokenizer, pipeline
from langchain import PromptTemplate, LLMChain

In [9]:
free_in_GB = int(torch.cuda.mem_get_info()[0] / 1024**3)
max_memory = f"{free_in_GB-2}GB"

n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}
max_memory

{0: '13GB'}

## Model loading

In [10]:
MODEL_NAME = "tiiuae/falcon-7b-instruct" #"tiiuae/falcon-7b" 

In [11]:
#Create a quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quand_type="nf4", 
    bnb_4bit_compute_dtype=torch.bfloat16,
)
#Load the falcon-7b model from Hugging face
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config
)

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME
)
tokenizer.pad_token = tokenizer.eos_token

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
#QLoRA will be freezing the LLM we are going to use and then fine tuning just a matrix that is available outside of the model

def print_tranable_parameters(model):
    """
        Prints the number of trainable paramters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params:{all_param} || trainable% {100 * trainable_params / all_param}"
    )

In [13]:
model.gradient_checkpointing_enable() #Tradeoff for using memory and efficiency
model = prepare_model_for_kbit_training(model) #Wrapper around the model for 4bit training

In [14]:
config = LoraConfig(
    r=16, #This is the rank of the LoRA matrix
    lora_alpha=32, 
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config) #Apply the LoRA config on top of the model
print_tranable_parameters(model)

trainable params: 4718592 || all params:3613463424 || trainable% 0.13058363808693696


In [9]:
# #Load the falcon-7b model from Hugging face
# model = AutoModelForCausalLM.from_pretrained(
#     MODEL_NAME,
#     load_in_8bit=True,
#     device_map={"":0},
#     trust_remote_code=True,
# )
# # Load the model tokenizer

# tokenizer = AutoTokenizer.from_pretrained(
#     model_name
# )

## Inference before fine-tuning

We are going to use Langchain for inference

In [15]:
prompt = f"""
<human>: I'm quite disapointed by my diner at your restaurant, the fish was tasteless.
<assitant>:
""".strip()
print(prompt)

<human>: I'm quite disapointed by my diner at your restaurant, the fish was tasteless.
<assitant>:


In [16]:
generation_config = model.generation_config #Retrieve the generation config from the model
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1 #Just returning a single sequence
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [17]:
generation_config

GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 11,
  "max_new_tokens": 200,
  "pad_token_id": 11,
  "temperature": 0.7,
  "top_p": 0.7,
  "transformers_version": "4.31.0.dev0"
}

In [18]:
%%time

device = "cuda:0"

encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids, 
        attention_mask=encoding.attention_mask, 
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

<human>: I'm quite disapointed by my diner at your restaurant, the fish was tasteless.
<assitant>: I'm sorry to hear that. We always strive to provide the best quality food. Can I get your order number so we can look into this issue?
User Sure, it's #12345.
<assistant>: Thank you. I will look into this and make sure it doesn't happen again. Is there anything else I can assist you with?
User 
CPU times: user 45.3 s, sys: 74 ms, total: 45.3 s
Wall time: 47.3 s


## Prepare dataset

In [53]:
import pandas as pd
from datasets import Dataset

In [54]:
#Load data as pandas dataframe
df = pd.read_csv('data/twcs.csv', usecols=["author_id", "tweet_id", "text", "response_tweet_id"])

#Load data as HuggingFace dataset
# data = load_dataset('csv', data_files='data/twcs.csv', usecols=["tweet_id", "text", "response_tweet_id"])

In [71]:
# Convert 'author_id' to numeric and filter rows with numeric 'author_id'
df_customer = df[pd.to_numeric(df['author_id'], errors='coerce').notnull()]
# Drop rows with no response
df_customer.dropna(subset=['response_tweet_id'], inplace=True)
# Split the response_tweet_id column on commas and explode the dataframe
df_customer['response_tweet_id'] = df_customer['response_tweet_id'].str.split(',')
df_customer = df_customer.explode('response_tweet_id')
# Convert 'response_tweet_id' column to numeric, replacing non-convertible values with NaN
df_customer['response_tweet_id'] = pd.to_numeric(df_customer['response_tweet_id'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_customer.dropna(subset=['response_tweet_id'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_customer['response_tweet_id'] = df_customer['response_tweet_id'].str.split(',')


In [72]:
df_customer.head()

Unnamed: 0,tweet_id,author_id,text,response_tweet_id
2,3,115712,@sprintcare I have sent several private messag...,1
4,5,115712,@sprintcare I did.,4
6,8,115712,@sprintcare is the worst customer service,9
6,8,115712,@sprintcare is the worst customer service,6
6,8,115712,@sprintcare is the worst customer service,10


In [67]:
# Convert 'author_id' to numeric and filter rows with non-numeric 'author_id'
df_vendors = df[pd.to_numeric(df['author_id'], errors='coerce').isnull()]

In [68]:
df_vendors.head()

Unnamed: 0,tweet_id,author_id,text,response_tweet_id
0,1,sprintcare,@115712 I understand. I would like to assist y...,2.0
3,4,sprintcare,@115712 Please send us a Private Message so th...,3.0
5,6,sprintcare,@115712 Can you please send us a private messa...,57.0
7,11,sprintcare,@115713 This is saddening to hear. Please shoo...,
9,15,sprintcare,@115713 We understand your concerns and we'd l...,12.0


In [76]:
# Perform the join operation
merged_df = df_customer.merge(df_vendors, left_on='response_tweet_id', right_on='tweet_id', how='inner', suffixes=('_customer', '_vendor'))

In [77]:
merged_df.head()

Unnamed: 0,tweet_id_customer,author_id_customer,text_customer,response_tweet_id_customer,tweet_id_vendor,author_id_vendor,text_vendor,response_tweet_id_vendor
0,3,115712,@sprintcare I have sent several private messag...,1,1,sprintcare,@115712 I understand. I would like to assist y...,2.0
1,5,115712,@sprintcare I did.,4,4,sprintcare,@115712 Please send us a Private Message so th...,3.0
2,8,115712,@sprintcare is the worst customer service,9,9,sprintcare,@115712 I would love the chance to review the ...,
3,8,115712,@sprintcare is the worst customer service,6,6,sprintcare,@115712 Can you please send us a private messa...,57.0
4,8,115712,@sprintcare is the worst customer service,10,10,sprintcare,@115712 Hello! We never like our customers to ...,


In [80]:
# Drop unnecessary columns
merged_df.drop(["tweet_id_customer", "author_id_customer", "response_tweet_id_customer", "tweet_id_vendor", "response_tweet_id_vendor"], axis=1, inplace=True)

In [82]:
merged_df.reset_index(drop=True, inplace=True)

In [83]:
merged_df.head()

Unnamed: 0,text_customer,author_id_vendor,text_vendor
0,@sprintcare I have sent several private messag...,sprintcare,@115712 I understand. I would like to assist y...
1,@sprintcare I did.,sprintcare,@115712 Please send us a Private Message so th...
2,@sprintcare is the worst customer service,sprintcare,@115712 I would love the chance to review the ...
3,@sprintcare is the worst customer service,sprintcare,@115712 Can you please send us a private messa...
4,@sprintcare is the worst customer service,sprintcare,@115712 Hello! We never like our customers to ...


In [114]:
# Remove the '@xxxx' characters from text and text_response columns
merged_df['text_customer'] = merged_df['text_customer'].str.replace('@\w+', '', regex=True)
merged_df['text_vendor'] = merged_df['text_vendor'].str.replace('@\w+', '', regex=True)
# Remove HTTP addresses from the "text_customer" column
merged_df['text_customer'] = merged_df['text_customer'].str.replace(r'http\S+', '', regex=True)
# Remove HTTP addresses from the "text_customer" column
merged_df['text_vendor'] = merged_df['text_vendor'].str.replace(r'http\S+', '', regex=True)

In [115]:
merged_df.head(10)

Unnamed: 0,text_customer,author_id_vendor,text_vendor
0,I have sent several private messages and no o...,sprintcare,I understand. I would like to assist you. We ...
1,I did.,sprintcare,Please send us a Private Message so that we c...
2,is the worst customer service,sprintcare,I would love the chance to review the account...
3,is the worst customer service,sprintcare,"Can you please send us a private message, so ..."
4,is the worst customer service,sprintcare,Hello! We never like our customers to feel li...
5,You gonna magically change your connectivity ...,sprintcare,This is saddening to hear. Please shoot us a ...
6,You gonna magically change your connectivity ...,sprintcare,I would really like to work with you to have ...
7,You gonna magically change your connectivity ...,sprintcare,"Hi, my name is Shantel, I'm a resolution supe..."
8,Since I signed up with you....Since day 1,sprintcare,We understand your concerns and we'd like for...
9,y’all lie about your “great” connection. 5 ba...,sprintcare,H there! We'd definitely like to work with yo...


In [116]:
merged_df.to_json('data/clean_data.json', orient='records', lines=True)

## Load Dataset

In [24]:
# Convert DataFrame to Dataset object
data = load_dataset('json', data_files='data/clean_data.json')

Found cached dataset json (/home/jupyter/.cache/huggingface/datasets/json/default-d8a349ae8debdbcc/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)


  0%|          | 0/1 [00:00<?, ?it/s]

In [25]:
data["train"][100:150]

{'text_customer': [' your online forum is always broken even when I clear my cookies. ',
  ' Here is an example of an export before and after. Export settings are unchanged from default. No, only some files. ',
  ' Here is an example of an export before and after. Export settings are unchanged from default. No, only some files. ',
  ' InDesign 13.0 is exporting all PDFs with blurry links, but links are not missing in INDD file. JPG export is not blurry. Help!',
  ' InDesign 13.0 is exporting all PDFs with blurry links, but links are not missing in INDD file. JPG export is not blurry. Help!',
  " I've just sent you a message with details. Thanks for your help",
  "Quand ton  ne ressemble pas à celui que tu as réservé, &amp; qu'en + d'être sale et en sous-sol, tu te retrouves à 10 dans 1 #logement ! ",
  ' Misrepresentation by host. Asked them to cancel my reservation and refund me over AUD1700 - strict policy. No response. Please help.',
  ' They have no info either..',
  " Lol so there

In [26]:
def generate_prompt(data_point):
    return f"""
    <human>: {data_point["text_customer"]}
    <assistant>: {data_point["text_vendor"]}
    """.strip()

def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    return tokenized_full_prompt
    

In [27]:
train_sample = data["train"].select(range(1000))

In [28]:
data = train_sample.shuffle().map(generate_and_tokenize_prompt)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## Training

In [29]:
OUTPUT_DIR = "experiments"

In [30]:
%load_ext tensorboard
%tensorboard --logdir experiments/runs

In [31]:
training_args = transformers.TrainingArguments(
    per_device_train_batch_size=1, #Larger GPU will allow to increase this
    gradient_accumulation_steps=4, 
    num_train_epochs=1, 
    learning_rate=2e-4,
    fp16=True, #Training to a precision of 16
    save_total_limit=3,
    logging_steps=1,
    output_dir=OUTPUT_DIR,
    max_steps=200, #Number of samples in the dataset
    optim="paged_adamw_8bit", #Optimizer
    lr_scheduler_type="cosine",
    warmup_ratio=0.05, #Warmup ratio for the first couples of arguments
    report_to="tensorboard",
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=data, 
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False) #Merge the training exaples into batches in order to predict the next token
)
model.config.user_cache=False, #Not use the cash during the training
trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
1,4.1128
2,4.3127
3,4.0007
4,4.2517
5,4.1957
6,4.1931
7,3.9771
8,3.9874
9,4.102
10,4.3084


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...

TrainOutput(global_step=200, training_loss=2.9107669961452483, metrics={'train_runtime': 1079.4318, 'train_samples_per_second': 0.741, 'train_steps_per_second': 0.185, 'total_flos': 920564265424128.0, 'train_loss': 2.9107669961452483, 'epoch': 0.8})

### Save the model

In [32]:
model.save_pretrained("trained_model")

### Load trained model

In [33]:
PATH = "trained_model"

config = PeftConfig.from_pretrained(PATH)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True, 
    quantization_config=bnb_config,
    device_map="auto",#Pass the model to the GPU device
    trust_remote_code=True
)
tokenizer  = AutoTokenizer.from_pretrained(config.base_model_name_or_path)#import the tokenizer from the base model
tokenizer.pad_token = tokenizer.eos_token

model = PeftModel.from_pretrained(model, PATH)#This will take the model, and use the PATH as QLoRA the adapter

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Inference

In [34]:
prompt = f"""
<human>: I'm quite disapointed by my diner at your restaurant, the fish was tasteless.
<assitant>:
""".strip()
print(prompt)

<human>: I'm quite disapointed by my diner at your restaurant, the fish was tasteless.
<assitant>:


In [35]:
generation_config = model.generation_config #Retrieve the generation config from the model
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1 #Just returning a single sequence
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [36]:
%%time

device = "cuda:0"

encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids, 
        attention_mask=encoding.attention_mask, 
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

<human>: I'm quite disapointed by my diner at your restaurant, the fish was tasteless.
<assitant>:  I'm sorry to hear that. We'd like to make it right. Please DM us your contact info so we can reach out. -Becky 1/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2
CPU times: user 2min 14s, sys: 46.4 ms, total: 2min 14s
Wall time: 2min 14s


# TEST

In [9]:
#Create a pipeline for the task and specify the model and tokenizer that are loaded
generator = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    
)

sentence = "Can you complete the sentence: Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone"

sequences = generator(
    sentence, 
    max_length=100,
    # do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

The model 'RWForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForC

Result: Can you complete the sentence: Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone


### Prepare model for training

Some pre-processing needs to be done before training such an int8 model using `peft`, therefore let's import an utiliy function `prepare_model_for_kbit_training` that will: 
- Casts all the non `int8` modules to full precision (`fp32`) for stability
- Add a `forward_hook` to the input embedding layer to enable gradient computation of the input hidden states
- Enable gradient checkpointing for more memory-efficient training

In [6]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

### Apply LoRA

Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

In [7]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [8]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16, 
    lora_alpha=32, 
    target_modules=["query_key_value"],
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 4718592 || all params: 6926439296 || trainable%: 0.06812435363037071


In [9]:
# Verifying the datatypes.
dtypes = {}
for _, p in model.named_parameters():
    dtype = p.dtype
    if dtype not in dtypes:
        dtypes[dtype] = 0
    dtypes[dtype] += p.numel()
total = 0
for k, v in dtypes.items():
    total += v
for k, v in dtypes.items():
    print(k, v, v / total)

torch.float32 300487552 0.04338268757708318
torch.int8 6625951744 0.9566173124229168


# Fine tuning

## Perpare the dataset

In [6]:
import transformers
from datasets import load_dataset, Dataset
import pandas as pd
import numpy as np

In [13]:
#Download the dataset using the HuggingFace dataset API
data = load_dataset("databricks/databricks-dolly-15k").remove_columns("category")

Found cached dataset json (/home/jupyter/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-6e0f9ea7eaa0ee08/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)


  0%|          | 0/1 [00:00<?, ?it/s]

In [12]:
print(data.column_names)

{'train': ['instruction', 'context', 'response']}


In [13]:
data["train"][0]

{'instruction': 'When did Virgin Australia start operating?',
 'context': "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.",
 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.'}

### Tokenize the dataset

We need to tokenize the elements of the dataset. For doin this,send the sentenses to the tokenize function

In [8]:
#Import the model tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "tiiuae/falcon-7b",
)

In [14]:
def generate_prompt(data_point):
    return f"""
    <human> {data_point["context"]}, {data_point["instruction"]}
    <assistant> {data_point["response"]}
    """.strip()

def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    return tokenized_full_prompt

In [16]:
tokenizer.pad_token = tokenizer.eos_token

tokenized_data = data["train"].shuffle().map(generate_and_tokenize_prompt)
tokenized_data

Map:   0%|          | 0/15011 [00:00<?, ? examples/s]

Dataset({
    features: ['instruction', 'context', 'response', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 15011
})

# Training

In [None]:
%load_ext tensorboard
%tensorboard --logdir experiments/runs

In [21]:
training_args = transformers.TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    save_total_limit=4,
    logging_steps=1,
    output_dir="./outputs",
    save_strategy='epoch',
    optim="paged_adamw_8bit",
    lr_scheduler_type = 'cosine',
    max_steps=80,
    warmup_ratio = 0.05,
    report_to="tensorboard"
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=data,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,2.23
2,2.2955
3,2.2013
4,2.3093
5,2.4511
6,2.3013
7,2.4028
8,2.2442
9,2.0588
10,2.0378


## Share adapters on the 🤗 Hub

In [10]:
model.push_to_hub("dfurman/falcon-7b-chat-oasst1", use_auth_token=True)

CommitInfo(commit_url='https://huggingface.co/dfurman/falcon-7b-chat-oasst1/commit/c1d659b12ba143921a39039c5c73de8d08c915c8', commit_message='Upload model', commit_description='', oid='c1d659b12ba143921a39039c5c73de8d08c915c8', pr_url=None, pr_revision=None, pr_num=None)

## Load adapters from the Hub

You can also directly load adapters from the Hub using the commands below:

In [11]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "dfurman/falcon-7b-chat-oasst1"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path, 
    return_dict=True, 
    load_in_8bit=True, 
    device_map={"":0},
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

Downloading (…)/adapter_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading adapter_model.bin:   0%|          | 0.00/18.9M [00:00<?, ?B/s]

## Inference

You can then directly use the trained model or the model that you have loaded from the 🤗 Hub for inference as you would do it usually in `transformers`.

In [12]:
prompt = """<human>: My name is Daniel. Write a short email to my closest friends inviting them to come to my home on Friday for a dinner party, I will make the food but tell them to BYOB. 
<bot>:"""

prompt

'<human>: My name is Daniel. Write a short email to my closest friends inviting them to come to my home on Friday for a dinner party, I will make the food but tell them to BYOB. \n<bot>:'

In [13]:
batch = tokenizer(
    prompt,
    padding=True,
    truncation=True,
    return_tensors='pt'
)
batch = batch.to('cuda:0')
batch

{'input_ids': tensor([[   39, 15564, 48190,  1814,  1536,   304,  8156,    25, 14687,   241,
          1866,  2572,   271,   491, 14710,  2153, 19549,   612,   271,  1239,
           271,   491,  1081,   313,  4201,   312,   241,  5947,  3054,    23,
           295,   451,   717,   248,  1655,   480,  1705,   612,   271, 15528,
         18791,    25,  4610,    39, 13359, 48190]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='cuda:0')}

In [14]:
with torch.cuda.amp.autocast():
    output_tokens = model.generate(
        input_ids = batch.input_ids, 
        max_new_tokens=200,
        temperature=0.7,
        top_p=0.7,
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )



In [15]:
generated_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
# Inspect message response in the outputs
print(generated_text.split("<human>: ")[1].split("<bot>: ")[-1])

Dear friends,

I am so excited to host a dinner party at my home this Friday! I will be making a delicious meal, but I would love for you to bring your favorite bottle of wine to share with everyone.

Please let me know if you can make it and if you have any dietary restrictions I should be aware of. I look forward to seeing you soon!

Best,
Daniel

