# QLoRA - How to Fine-tune an LLM on a Single GPU

[https://www.youtube.com/watch?v=XpoKB3usmKc](https://www.youtube.com/watch?v=XpoKB3usmKc)

---

The dataset here is from youtube comments - the idea is to finetune to respond to youtube comments

In [2]:
!pip install auto-gptq
!pip install optimum
!pip install bitsandbytes



In [3]:
import transformers

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from datasets import load_dataset

2024-05-20 15:31:55.938632: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-20 15:31:55.938752: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-20 15:31:56.074245: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [4]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
# load quantized model
model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

config.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.16G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [6]:
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [7]:
# baseline performance of model when responding to youtube comments

model.eval()

# Example prompt i.e. youtube comment
comment = "Great content, thanks a lot it was good"
prompt = f'''[INST] {comment} [/INST]'''

# tokenize prompt
inputs = tokenizer(prompt, return_tensors="pt")

# generate output
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=140)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] Great content, thanks a lot it was good [/INST] I'm glad you found the content helpful! If you have any specific questions or topics you'd like me to cover in the future, feel free to ask. I'm here to help. Have a great day!</s>


In [8]:
# prompt engineering part

intstructions_string = f"""ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.

Please respond to the following comment.
"""

prompt_template = lambda comment: f'''[INST] {intstructions_string} \n{comment} \n[/INST]'''

prompt = prompt_template(comment)
print(prompt)

[INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
Great content, thanks a lot it was good 
[/INST]


In [9]:
# try interrogate nonfinetuned model but with prompt now

# tokenize new prompt
inputs = tokenizer(prompt, return_tensors="pt")

# generate output
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=140)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
Great content, thanks a lot it was good 
[/INST] Thank you for your kind words! I'm glad you found the content helpful. If you have any specific questions or topics you'd like me to cover in more detail, just let me know and I'll be happy to help –ShawGPT.</s>


## Model finetuning with QLoRA

In [11]:
model.train() # model in training mode (dropout modules are activated)

# enable gradient check pointing - this is for memory optimisation TODO: read more
model.gradient_checkpointing_enable()

# enable quantized training
model = prepare_model_for_kbit_training(model)

# CARE! the base model is in 4 bit, but we want to do LoRA in higher precision - this is what the prepare_model_.. is doing

### Set up LoRA config

In [12]:
# LoRA config
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# LoRA trainable version of model
model = get_peft_model(model, config)

# trainable parameter count
model.print_trainable_parameters()

trainable params: 2,097,152 || all params: 264,507,392 || trainable%: 0.7929


### Dataset

It's a series of real comments + replies from his channel (small dataset of 50 examples, formatted with the previous prompt engineering also

In [13]:
# load dataset
data = load_dataset("shawhin/shawgpt-youtube-comments")

Downloading readme:   0%|          | 0.00/531 [00:00<?, ?B/s]

Downloading data: 100%|██████████| 18.0k/18.0k [00:00<00:00, 69.8kB/s]
Downloading data: 100%|██████████| 8.09k/8.09k [00:00<00:00, 36.3kB/s]


Generating train split:   0%|          | 0/50 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/9 [00:00<?, ? examples/s]

In [14]:
# create tokenize function
def tokenize_function(examples):
    # extract text
    text = examples["example"]

    #tokenize and truncate text
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=512
    )

    return tokenized_inputs

# tokenize training and validation datasets
tokenized_data = data.map(tokenize_function, batched=True)

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/9 [00:00<?, ? examples/s]

In [15]:
# setting pad token
tokenizer.pad_token = tokenizer.eos_token
# data collator
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)

## Finetuning

### Hyperparameters

In [17]:
# hyperparameters
lr = 2e-4
batch_size = 4
num_epochs = 10

# define training arguments
training_args = transformers.TrainingArguments(
    output_dir= "shawgpt-ft",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    gradient_accumulation_steps=4,
    warmup_steps=2,
    fp16=True, # <---------------- NOTE IMPORTANT !! TODO: BE 100% CLEAR ON THIS - WE USE 16bit FOR TRAINING (but only the LoRA adapters are being trained AFAICT)
    optim="paged_adamw_8bit", # <=== THIS IS THE "PAGED OPTIMIZER" IDEA which involves moving examples around from GPU to CPU

)

In [18]:
# configure trainer
trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    args=training_args,
    data_collator=data_collator
)


# train model
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

# renable warnings
model.config.use_cache = True

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Epoch,Training Loss,Validation Loss
0,4.5913,3.958954
1,4.0357,3.427283
2,3.4564,2.973857
4,2.6469,2.283207
5,2.3161,2.077054
6,2.0471,1.890489
8,1.7862,1.714866
9,1.2471,1.71034




In [20]:
# push to hub -- make clear / confirm that you are only saving the PEFT config ???


hf_name = 'benjaminzwhite'
model_id = hf_name + "/" + "shawgpt-ft"

model.push_to_hub(model_id)
trainer.push_to_hub(model_id)

README.md:   0%|          | 0.00/1.95k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/benjaminzwhite/shawgpt-ft/commit/caa5dd06ad57da04f4b4d2a138c158c884cd0d84', commit_message='benjaminzwhite/shawgpt-ft', commit_description='', oid='caa5dd06ad57da04f4b4d2a138c158c884cd0d84', pr_url=None, pr_revision=None, pr_num=None)

In [21]:
# practice loading the model from hub 
# still not 100% clear what is being saved - just the PEFT config AFAICT??


from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

config = PeftConfig.from_pretrained("benjaminzwhite/shawgpt-ft")
model = PeftModel.from_pretrained(model, "benjaminzwhite/shawgpt-ft")

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)



adapter_config.json:   0%|          | 0.00/649 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/8.40M [00:00<?, ?B/s]

In [26]:
# testing the finetuned model

instructions_string = f"""ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.

Please respond to the following comment.
"""
prompt_template = lambda comment: f'''[INST] {instructions_string} \n{comment} \n[/INST]'''

comment = "Nice video but can you please do a video about LLMs ? Thanks"

prompt = prompt_template(comment)

In [27]:
prompt

"[INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.\n\nPlease respond to the following comment.\n \nNice video but can you please do a video about LLMs ? Thanks \n[/INST]"

In [28]:
model.eval()

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
Nice video but can you please do a video about LLMs ? Thanks 
[/INST]

Thanks for the feedback!

I'd be happy to do a video on LLMs (Large Language Models) in the future. I'll make sure to cover the differences between them and the models I've discussed in this video.

–ShawGPT</s>
