<a href="https://colab.research.google.com/github/yashsawant22/Fine-Tuning-Llama/blob/main/QLora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -U accelerate bitsandbytes datasets peft transformers tokenizers


# Preparing Dataset

Data Quality is very important and should consist Question-Answer pairs,

To prepare for training, We chose OpenAI's ChatML here because its adopted frequently in recent model releases and might become new standard

In [None]:
#example of a ChatML-formatted dialogue
"""
<|im_start|>system
You are an AI assistant. User will you give you a task. Your goal is to
complete the task as faithfully as you can. While performing the task
think step-by-step and justify your steps.<|im_end|>
<|im_start|>user
Premise: A man is inline skating in front of a wooden bench. Hypothesis:
A man is having fun skating in front of a bench. .Choose the correct
answer: Given the premise, can we conclude the hypothesis?
Select from: a). yes b). it is not possible to tell c). no<|im_end|>
<|im_start|>assistant
b). it is not possible to tell Justification: Although the man is inline
skating in front of the wooden bench, we cannot conclude whether he is
having fun or not, as his emotions are not explicitly mentioned.<|im_end|>

"""

#The abbove example will then be tokenized, batched and input into the training algo

In [None]:
## We load Pre-made dataset from Huggingface

from datasets import load_dataset
dataset = load_dataset("OpenAssistant/oasst_top1_2023-08-25")

In [None]:
dataset

Post-loading, the dataset is pre-divided into training (13k entries) and testing splits (700 entries).


In [None]:
# Print first entry
print(dataset["train"][0]["text"])

This is ChatML already so we don’t need to do anything. Except telling the tokenizer and model that the strings <|im_start|> and <|im_end|> are tokens, should not be split, and <|im_end|> is a special token (eos, "end-of-sequence") marking the end of an answer by the model, otherwise the model will generate forever and never stop. How to integrate these tokens with base models such as llama2 and mistral will be elaborated

In [None]:
from datasets import load_dataset

dataset = load_dataset("Open-Orca/OpenOrca")
dataset = dataset["train"].train_test_split(test_size=0.1)

## We need to formatt this CHATML Format

In [None]:
def format_conversation(row):
    template="<|im_start|>system\n{sys}<|im_end|>\n<|im_start|>user\n{q}<|im_end|>\n<|im_start|>assistant\n{a}<|im_end|>"

    conversation=template.format(
        sys=row["system_prompt"],
        q=row["question"],
        a=row["response"],
    )

    return {"text": conversation}

'''import os
dataset = dataset.map(
    format_conversation,
    remove_columns=dataset["train"].column_names, # remove all columns; only "text" will be left
    num_proc=os.cpu_count()  # multithreaded
)'''

# To create dataset using a book

To dive deeper into the nuances of dataset creation, let’s consider a case where we want to train an AI to mirror the voice and personality of a renowned figure. I chose to turn autobiography of the famous American chef Anthony Bourdain into a dataset. He wrote “Kitchen Confidential” where he vividly describes all the craziness in the kitchen and mind of a chef.

This process involves transforming the narrative of Bourdain’s book into an engaging dialogue, much like a back-and-forth interview that captures his spirit.

Steps required:

Converting the book to text
Paragraph analysis and segmentation: Once the book is in text form, we segment it into paragraphs. Short paragraphs are merged, and longer ones are split to ensure that each segment can stand on its own while still contributing to the overall storyline.
Generating interview questions: For each paragraph, we construct an artificial interview scenario where an LLM plays the role of an interviewer, generating questions that elicit responses naturally fitting the given paragraph from the book. The goal is to stimulate an insightful dialogue, giving the impression that Bourdain himself is answering questions about his life and experiences.

In [None]:
!pip install evaluate


In [None]:
import transformers
import evaluate
import torch
import json
import random
from tqdm import tqdm
from datasets import load_dataset
import argparse

def read_file(fn):
	with open(fn) as f:
		data = f.read()
	return data

def write_pretty_json(file_path, data):
    with open(file_path, "w") as write_file:
        json.dump(data, write_file, indent=4)
    print(f"wrote {file_path}")


In [None]:
model_path="teknium/OpenHermes-2-Mistral-7B"
input_file="/content/sample_data/Atomic_Habits.txt"

file_content=read_file(input_file)
chapters=file_content.split("\n\n")
paragraphs=file_content.split("\n")
passage_minlen=300
passage_maxlen=2000
outputfn=input_file.split(".")[0]+"_interview.json"

passages=[]
for chap in chapters:
	passage=""
	for par in chap.split("\n"):
		if(len(passage)<passage_minlen) or not passage[-1]=="." and len(passage)<passage_maxlen:
			passage+="\n" + par
		else:
			passages.append(passage.strip().replace("\n", " "))
			passage=par


In [None]:
with open("/content/sample_data/Atomic_Habits.txt") as f:
    file_content = f.read()

# Step 1: Define chunk size (number of characters per chunk)
chunk_size = 2000  # You can adjust this to your preferred chunk length

# Step 2: Break the file content into equal-length chunks
passages = [file_content[i:i + chunk_size] for i in range(0, len(file_content), chunk_size)]

# Step 3: Print the number of chunks and optionally preview the first few
print(f"Number of chunks: {len(passages)}")

# Optionally print the first 200 characters of each chunk for verification
for i, chunk in enumerate(passages[:5]):  # Print first 5 chunks for sanity check
    print(f"Chunk {i+1}: {chunk[:200]}...")

In [None]:
'''# Gather paragraphs to target
with open("/content/sample_data/Atomic_Habits.txt") as f:
    file_content = f.read()

chapters=file_content.split("\n\n")

# Define minimum and maximum lengths to ensure a good interview flow
passage_minlen=300  # if paragraph <300 chars -> merge with next
passage_maxlen=2000  # if paragraph >2k chars -> split

# Process the chapters into suitable interview passages
passages=[]
for chap in chapters:
    passage=""
    for par in chap.split("\n"):
        if(len(passage)<passage_minlen) or not passage[-1]=="." and len(passage)<passage_maxlen:
            passage+="\n" + par
        else:
            passages.append(passage.strip().replace("\n", " "))
            passage=par
'''

In [None]:
#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

In [None]:
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import login

# Login using the token directly in the script
login(token="REPLICATE_API_TOKEN")

# Load the model and tokenizer
#model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-chat-hf", use_auth_token=True)
#tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", use_auth_token=True)


In [None]:

# The model that you want to train from the Hugging Face hub



# Ask Open Hermes
prompt_template="""<|im_start|>system
You are an expert interviewer who interviews the writer of the book Atomic Habits which aims to teach discipline and effecient way to build good habits.
You formulate questions based on quotes from the Book. Below is one
such quote. Formulate a question that the quote would be the perfect answer to.
The question should be short and directed at the author of the Book
like in an interview. The question is short. Remember, make the question as
short as possible. Do not give away the answer in your question.
Also: If possible, ask for motvations, feelings, and perceptions rather than
events or facts.

Here is some context that might help you formulate the question regarding the quote:
{ctx}
<|im_end|>
<|im_start|>user
Quote:
{par}<|im_end|>
<|im_start|>assistant
Question:"""

'''
Eg:
{
  "question": "Why you choose to share your experiences and insights from
    your career in the restaurant industry despite the angry or wanting
    to horrify the dining public?",
  "answer": "I'm not spilling my guts about everything I've seen, learned
    and done in my long and checkered career as dishwasher, prep drone,
    fry cook, grillardin, saucier, sous-chef and chef because I'm angry
    at the business, or because I want to horrify the dining public. I'd
    still like to be a chef, too, when this thing comes out, as this life
    is the only life I really know. If I need a favor at four o'clock in
    the morning, whether it's a quick loan, a shoulder to cry on, a sleeping
    pill, bail money, or just someone to pick me up in a car in a bad
    neighborhood in the driving rain, I'm definitely not calling up a fellow
    writer. I'm calling my sous-chef, or a former sous-chef, or my saucier,
    someone I work with or have worked with over the last twenty-plus years."
},
{
  "question": "Why do you feel more comfortable sharing the \"dark recesses\"
    of the restaurant underbelly instead of writing about your personal
    experiences outside of the culinary world?",
  "answer": "No, I want to tell you about the dark recesses of the restaurant
    underbelly-a subculture whose centuries-old militaristic hierarchy and
    ethos of 'rum, buggery and the lash' make for a mix of unwavering order
    and nerve-shattering chaos-because I find it all quite comfortable, like
    a nice warm bath. I can move around easily in this life. I speak the
    language. In the small, incestuous community of chefs and cooks in New
    York City, I know the people, and in my kitchen, I know how to behave
    (as opposed to in real life, where I'm on shakier ground). I want the
     professionals who read this to enjoy it for what it is: a straight look
    at a life many of us have lived and breathed for most of our days and
    nights to the exclusion of 'normal' social interaction. Never having had
    a Friday or Saturday night off, always working holidays, being busiest
    when the rest of the world is just getting out of work, makes for a
    sometimes peculiar world-view, which I hope my fellow chefs and cooks
    will recognize. The restaurant lifers who read this may or may not like
    what I'm doing. But they'll know I'm not lying."
}'''

prompts=[]
for i,p in enumerate(passages):
	if i==0:
		continue
	prompt=prompt_template.format(par=passages[i], ctx=passages[i-1])
	prompts.append(prompt)

prompts_generator=(p for p in prompts)	# pipeline needs a generator, not a list

#print(f"{len(chapters)} chapters")
#print(f"{len(paragraphs)} paragraphs")
#print(f"{len(passages)} passages")



#model_path = "teknium/OpenHermes-2-Mistral-7B"

pipeline = transformers.pipeline(
		"text-generation",
		model=model_path,
		torch_dtype=torch.bfloat16,
		device_map="auto",
	)

pipeline.tokenizer.add_special_tokens({"pad_token":"<pad>"})
pipeline.model.resize_token_embeddings(len(pipeline.tokenizer))
pipeline.model.config.pad_token_id = pipeline.tokenizer.pad_token_id

gen_config = {
    "temperature": 0.7,
    "top_p": 0.1,
    "repetition_penalty": 1.18,
    "top_k": 40,
	"do_sample": True,
	"num_return_sequences": 1,
	"eos_token_id": pipeline.tokenizer.eos_token_id,
	"max_new_tokens": 50,
}


In [None]:

results={
	"model": model_path,
	"input_file": input_file,
	"gen_config": gen_config,
	"passage_minlen": passage_minlen,
	"passage_maxlen": passage_maxlen,
	"num_passages": len(passages),
	"template": prompt_template,
	"interview": []
}

for i, out in enumerate(tqdm(pipeline(prompts_generator, batch_size=2, **gen_config),total=len(prompts))):
	question=out[0]["generated_text"][len(prompts[i]):].strip()
	answer=passages[i+1]

	results["interview"].append({"question": question, "answer": answer})

	write_pretty_json(outputfn,results)

In [None]:
import replicate

def llama2(prompt, temperature=0.0, input_print=True):
  output = replicate.run(
    "meta/llama-2-7b-chat",
    input={
        "prompt": prompt,
        "max_tokens": 2048,
        "temperature": temperature})
  return "".join(output)

In [None]:
import os
from getpass import getpass

REPLICATE_API_TOKEN = getpass()

os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN

prompt = "Whats the best way to build a habit"
output = llama2(prompt)
md(output)


In [None]:
## Convert again to ChatML format
from transformers import AutoTokenizer

# Initialize the tokenizer with the correct model path
model_path = "teknium/OpenHermes-2-Mistral-7B"  # Use your model path

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

interview_fn="/content/sample_data/Atomic_Habits_interview.json"
dataset = load_dataset('json', data_files=interview_fn, field='interview')
dataset=dataset["train"].train_test_split(test_size=0.1)

# chatML template, from https://huggingface.co/docs/transformers/main/chat_templating
tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"

def format_interview(conv):
    messages = [
        {"role": "user", "content": conv["question"]},
        {"role": "assistant", "content": conv["answer"]}
    ]
    chat=tokenizer.apply_chat_template(messages, tokenize=False).strip()
    return {"text": chat}



In [None]:
def format_conversation(example):
    # Use 'question' and 'answer' instead of 'system_prompt'
    question = example.get("question", "")
    answer = example.get("answer", "")

    # Combine them into a single conversation format
    conversation = f"User: {question}\nAssistant: {answer}"

    return {"text": conversation}

In [None]:
dataset = dataset.map(
    format_conversation,
    remove_columns=dataset["train"].column_names
)

# Next step : Tokenize

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

modelpath="teknium/OpenHermes-2-Mistral-7B"

# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
    modelpath,
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type="nf4",
    ),
    torch_dtype=torch.bfloat16,
)



In [None]:
# Load (slow) Tokenizer, fast tokenizer sometimes ignores added tokens
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast=False)

# Add tokens <|im_start|> and <|im_end|>, latter is special eos token
tokenizer.pad_token = "</s>"
tokenizer.add_tokens(["<|im_start|>"])
tokenizer.add_special_tokens(dict(eos_token="<|im_end|>"))
model.resize_token_embeddings(len(tokenizer))
model.config.eos_token_id = tokenizer.eos_token_id

Since we are not training all the parameters but only a subset, we have to add the LoRA adapters to the model using huggingface peft. Make sure to use peft >= 0.6, otherwise 1) get_peft_model will be very slow and 2) training will fail with Mistral.

In [None]:
# Add LoRA adapters to model
model = prepare_model_for_kbit_training(model)
config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules = ['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'],
    lora_dropout=0.1,
    bias="none",
    modules_to_save = ["lm_head", "embed_tokens"],        # needed because we added new tokens to tokenizer/model
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
model.config.use_cache = False

LoRA rank r: Determines the size of the low-rank matrices. The higher the rank the more parameters you train and the bigger your adapter files will be. Usually a number between 8 and 128. The maximum possible value, ie. training all parameters, would be 4096 for llama2-7b and Mistral (=hidden_size in config.json) and defeat the purpose of adding adapters. The QLoRA paper suggests 64 for Guanaco (Open Assistant dataset) which works well for me.

target_modules: Another suggestion/finding of the QLoRA authors in their paper:
we find that the most critical LoRA hyperparameter is how many LoRA adapters are used in total and that LoRA on all linear transformer block layers are required to match full finetuning performance

modules_to_save: Specifies the modules apart from the LoRA layers to be set as trainable and saved in the final checkpoint. Since we added the ChatML tags as tokens to the vocabulary, we need to train and save the linear layer lm_head and the embedding matrix embed_tokens too. This will be relevant for merging the adapter back into the base model later.


In [None]:
## Prepare data for Training
def tokenize(element):
    return tokenizer(
        element["text"],
        truncation=True,
        max_length=2048,
        add_special_tokens=False,
    )

dataset_tokenized = dataset.map(
    tokenize,
    batched=True,
    num_proc=os.cpu_count()   # multithreaded
    #remove_columns=["text"]     # don't need the strings anymore, we have tokens from here on
)

In [None]:
dataset_tokenized

# Batching

The Hugging Face trainer expects a collator function to transform a list of samples to a dictionary holding a batch of padded

input_ids (tokenized text)
labels (target text, same as input_ids)
and attention_masks (tensor of zeros and ones).
We will adopt a simplified version of the DataCollatorForCausalLM from the QLoRA repository for this purpose.

In [None]:
# collate function - to transform list of dictionaries [ {input_ids: [123, ..]}, {.. ] to single batch dictionary { input_ids: [..], labels: [..], attention_mask: [..] }
def collate(elements):
    tokenlist=[e["input_ids"] for e in elements]
    tokens_maxlen=max([len(t) for t in tokenlist])  # length of longest input

    input_ids,labels,attention_masks = [],[],[]
    for tokens in tokenlist:
        # how many pad tokens to add for this sample
        pad_len=tokens_maxlen-len(tokens)

        # pad input_ids with pad_token, labels with ignore_index (-100) and set attention_mask 1 where content, otherwise 0
        input_ids.append( tokens + [tokenizer.pad_token_id]*pad_len )
        labels.append( tokens + [-100]*pad_len )
        attention_masks.append( [1]*len(tokens) + [0]*pad_len )

    batch={
        "input_ids": torch.tensor(input_ids),
        "labels": torch.tensor(labels),
        "attention_mask": torch.tensor(attention_masks)
    }
    return batch

In [None]:
# training Hyperparameters

bs=8        # batch size
ga_steps=1  # gradient acc. steps
epochs=5
steps_per_epoch=len(dataset_tokenized["train"])//(bs*ga_steps)

args = TrainingArguments(
    output_dir="out",
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs,
    evaluation_strategy="steps",
    logging_steps=1,
    eval_steps=steps_per_epoch,  # eval and save once per epoch
    save_steps=steps_per_epoch,
    gradient_accumulation_steps=ga_steps,
    num_train_epochs=epochs,
    lr_scheduler_type="constant",
    optim="paged_adamw_32bit",
    learning_rate=0.0002,
    group_by_length=True,
    fp16=True,
    ddp_find_unused_parameters=False,    # needed for training with accelerate
)

batch size: As high as possible to increase speed. Consumes VRAM, reduce if OOM.

gradient_accumulation_steps: Increases effective batch size without consuming additional VRAM but makes training slower. The effective batch size is batch_size * gradient_accumulation_steps.

steps_per_epoch: If your dataset has 80 samples and your effective batch size is 8 (e.g. batch_size 8 and gradient_accumulation_steps 1) you will process your entire dataset in 10 steps (=1 epoch).

num_train_epochs: How many epochs to train depends on your dataset. Ideally the loss on your eval split will tell you when to stop training and which checkpoint is the best - but training Guanaco for example results in increasing eval_loss after epoch 2 already, indicating overfitting to the training set, even though the model improves in quality. More on this and an official reply by the QLoRA authors on github and in one of my previous stories.

To sum up: you will simply have to see which checkpoint performs best for your specific task. Usually, 3-4 epochs is a good start.

learning_rate: We will use the default learning rate suggested by the QLoRA authors, 0.0002 for a 7B (or 13 B) model. For models with more parameters, lower learning rates are suggested: 0.0001 for models with 33B and 65B parameters.


In [None]:
# Training

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=collate,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
    args=args,
)

trainer.train()

In [None]:
model.save_pretrained("Savianto/qlora-mistral", safe_serialization=True, max_shard_size='4GB')
tokenizer.save_pretrained("Savianto/qlora-mistral")

##Merge LoRA adapters with base model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_path="teknium/OpenHermes-2-Mistral-7B"    # input: base model
adapter_path="out/checkpoint-130"     # input: adapters
save_to="Savianto/qlora-mistral-v2"    # out: merged model ready for inference

base_model = AutoModelForCausalLM.from_pretrained(
    base_path,
    return_dict=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(base_path)

# Add/set tokens (same 5 lines of code we used before training)
tokenizer.pad_token = "</s>"
tokenizer.add_tokens(["<|im_start|>"])
tokenizer.add_special_tokens(dict(eos_token="<|im_end|>"))
base_model.resize_token_embeddings(len(tokenizer))
base_model.config.eos_token_id = tokenizer.eos_token_id

# Load LoRA adapter and merge
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload()

model.save_pretrained(save_to, safe_serialization=True, max_shard_size='4GB')
tokenizer.save_pretrained(save_to)

# Troubleshooting

OOM
If you encounter an Out of Memory (OOM) error:

Consider reducing the batch size.
Shorten training samples by cutting down on context length (max_length in tokenize()).
Training too Slow
If training seems sluggish:

Increase batch size.
Multiple GPUs, buy or rent (on runpod for example). The code provided here is ready for accelerate and can be used to train in multi-GPU settings, simply launch with accelerate launch qlora.py instead of python qlora.py.
Bad Quality of the Final Model
The quality of your model is a reflection of your dataset’s quality. To improve model quality:

Ensure your dataset is rich and relevant.
Tune hyperparameters: learning_rate, epochs, rank r, lora_alpha
Wrap-Up
Understand what you are doing. There are excellent training tools like axolotl which allow you to focus on dataset creation rather than writing your own padding function. Still, a solid grasp of the underlying mechanisms is invaluable. This knowledge empowers you to navigate complexities and troubleshoot with confidence.
Incremental Approach: Begin with a basic example using a small dataset. Gradually scale up and adjust parameters incrementally to uncover their impact on model performance.
Emphasize Data Quality: High-quality data is the cornerstone of effective training. Be innovative and diligent in assembling your dataset.


In [None]:
model = AutoModelForCausalLM.from_pretrained("Savianto/qlora-mistral")
tokenizer = AutoTokenizer.from_pretrained("Savianto/qlora-mistral")

In [None]:
pipeline = transformers.pipeline(
		"text-generation",
		model=model,
		torch_dtype=torch.bfloat16,
		device_map="auto",
    tokenizer = tokenizer

	)

In [None]:
# Your question or prompt
prompt = "What is the capital of France?"

# Tokenize the input (convert to input IDs)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Generate a response from the model (on CPU)
with torch.no_grad():
    output_ids = model.generate(input_ids, max_length=50, do_sample=True, temperature=0.7)

# Decode the generated tokens back to text
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Print the response
print(response)

In [None]:
from google.colab import drive
drive.mount('/content/drive')