## Instruction-Tuning with QLoRA on a Chain of Thought (CoT) Dataset

In this notebook, we fine-tune the OpenLLama model using QLoRA and a chain of thought dataset. Our goal is to develop a model capable of answering questions based on provided context. The dataset used can be found [here](https://huggingface.co/datasets/kaist-ai/CoT-Collection).

In [1]:
!pip install -q datasets
!pip install -q -U git+https://github.com/lvwerra/trl.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q git+https://github.com/huggingface/transformers.git -q -U # transformers version:  4.37.0
!pip install -q git+https://github.com/huggingface/accelerate.git -q -U # accelerate version:  0.27.0
!pip install -q -i https://pypi.org/simple/ bitsandbytes
!pip install -q sentencepiece

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/471.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.1/108.1 kB[0m [31m4.5 MB/s[0m eta [36m0:

In [2]:
from datasets import load_dataset, Dataset, concatenate_datasets
from tqdm import tqdm
from tqdm.auto import tqdm
import torch
import transformers
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, LlamaTokenizer
from trl import SFTTrainer, SFTConfig
from IPython.display import display, Markdown
import random

# Data

In [3]:
# Load the CoT dataset
cot_dataset = load_dataset("kaist-ai/CoT-Collection")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


CoT-Collection.py:   0%|          | 0.00/4.07k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.68k [00:00<?, ?B/s]

The repository for kaist-ai/CoT-Collection contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/kaist-ai/CoT-Collection.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


CoT_collection_en.json:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [4]:
# Select 10000 random samples
cot_dataset = cot_dataset["train"].shuffle(seed=42).select(range(10000))

In [5]:
cot_dataset

Dataset({
    features: ['source', 'target', 'rationale', 'task', 'type'],
    num_rows: 10000
})

The dataset contains `source`, `target`, and `rationale` triplets, where the `source` is the input or question, the `rationale` is the chain-of-thought process explaining how to arrive at the answer, and the `target` is the final short answer.

In [6]:
cot_dataset[0]

{'source': 'Question: What about Neptune did NASA propose in 2003 in their "Vision Missions Studies"?\n\nIs However, there have been a couple of discussions to launch Neptune missions sooner. a good answer to this question?\n\nOPTIONS:\n- yes\n- no',
 'target': 'no',
 'rationale': 'The proposed mission to Neptune is not mentioned. The only mention of the planet in this excerpt is a note that there have been proposals for such missions, but they are very distant and probably will never happen.',
 'task': 'qnli',
 'type': 'CoT'}

# Preprocessing

Drop examples that are longer than 2200 characters.

In [7]:
def drop_long_sequences(dataset_obj):
    """
    Identifies indices of entries in a dataset that exceed a certain sequence length.

    Args:
    dataset_obj (iterable): dataset where each entry is a dictionary with keys 'source', 'target', and 'rationale'.

    Returns:
    list: Indices of dataset entries ('source', 'target', and 'rationale') that are longer than 2200 characters in total.
    """

    # Loop over the dataset and check the total length of text sequences
    indices_to_drop = []
    for idx, example in enumerate(tqdm(dataset_obj)):
      total_length = len(example["source"]) + len(example["target"]) + len(example["rationale"])
      if total_length > 2200:
        indices_to_drop.append(idx)
    return indices_to_drop

In [8]:
indices_to_drop = drop_long_sequences(cot_dataset)
cot_dataset_reduced = cot_dataset.select(i for i in range(len(cot_dataset)) if i not in set(indices_to_drop))

  0%|          | 0/10000 [00:00<?, ?it/s]



In [9]:
len(indices_to_drop)

984

In [10]:
cot_dataset_reduced

Dataset({
    features: ['source', 'target', 'rationale', 'task', 'type'],
    num_rows: 9016
})

In [11]:
# Split the data into train (90%) and test (10%) sets (You can use train_test_split() function from huggingface)
cot_dataset_prepared = cot_dataset_reduced.train_test_split(test_size=0.1)

# Input Formatting

We need to properly prepare and format the dataset before presenting it to the model. The input prompts given to the model are structured using the formatting function described below.

In [12]:
def formatting_func(example):

  # Potential phrases to use, including an empty string for "no phrase"
  phrases = [
    "Let's think step by step.",
    "Let's break this down.",
    "Consider the following steps.",
    "Think through the solution step by step.",
    ""
  ]

  chosen_phrase = random.choice(phrases) # Randomly choose one of the phrases

  rationale_prompt = f"{chosen_phrase} {example['rationale']}".strip()  # Remove spaces if empty phrase is chosen

  input_prompt = (
    f"Below is a question. Write a rationale explaining the reasoning process to answer the question.\n\n"
    "### Question:\n"
    f"{example['source']}\n\n"
    "### Rationale:\n"
    f"{rationale_prompt}"
  )

  return {"text": input_prompt}

In [13]:
# Format the dataset using the function above
formatted_dataset = cot_dataset_prepared.map(formatting_func)

Map:   0%|          | 0/8114 [00:00<?, ? examples/s]

Map:   0%|          | 0/902 [00:00<?, ? examples/s]

In [14]:
formatted_dataset

DatasetDict({
    train: Dataset({
        features: ['source', 'target', 'rationale', 'task', 'type', 'text'],
        num_rows: 8114
    })
    test: Dataset({
        features: ['source', 'target', 'rationale', 'task', 'type', 'text'],
        num_rows: 902
    })
})

In [15]:
formatted_dataset["train"][0]

{'source': 'Experimenting on a raft anchored on the river Elbe, Alfred Nobel tries to make nitroglycerine safer to handle. Finds that the addition of kieselguhr turns nitroglycerine into a dough that can be kneaded, and calls it "dynamite".\nCan we infer the following?\nNitroglycerine can no turn into a dough.\n\nOPTIONS:\n- Yes\n- It\'s impossible to say\n- No\nThe answer is:',
 'target': 'No',
 'rationale': 'The statement "Nitroglycerine can now turn into a dough" contradicts the passage which states that Alfred Nobel found that when kieselguhr was added to nitroglycerine, it turned into a dough. The sentence also changes the word “dynamite” in the original text with “a dough”. Therefore, this cannot be inferred from the given information and is hence false.',
 'task': 'anli_r3',
 'type': 'CoT',
 'text': 'Below is a question. Write a rationale explaining the reasoning process to answer the question.\n\n### Question:\nExperimenting on a raft anchored on the river Elbe, Alfred Nobel tr

# Model

We use the `openlm-research/open_llama_7b_v2` model. Alternatively, you could use the `openlm-research/open_llama_3b` model, which has fewer parameters.

In [16]:
# Model parameters
model_id = "openlm-research/open_llama_7b_v2"
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)

In [17]:
# Load the model & tokenizer

# Load the base model "openlm-research/open_llama_7b_v2"
base_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)

# Load the tokenizer of the model "openlm-research/open_llama_7b_v2"
tokenizer = LlamaTokenizer.from_pretrained(model_id)

# Add the padding token
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

config.json:   0%|          | 0.00/502 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/593 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/512k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/330 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message


1

We use Supervised Fine-tuning Trainer (`SFTTrainer`) for training. Feel free to try different values for `learning rate` and `max_steps`.

In [18]:
# Define a LoraConfig object (You can change the hyperparameters)
qlora_config = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")

trainer = SFTTrainer(
    base_model,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["test"],
    args=SFTConfig(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        learning_rate=2e-4, # 2e-4
        max_steps=2000, # 10000
        output_dir="./OpenLLama7B-CoT",
        optim="paged_adamw_8bit",
        fp16=True,
    ),
    tokenizer=tokenizer,
    peft_config=qlora_config,
    dataset_text_field="text",
    max_seq_length=512
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/8114 [00:00<?, ? examples/s]

Map:   0%|          | 0/902 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [19]:
# Training
trainer.train()

Step,Training Loss
500,1.4423
1000,1.2554
1500,1.1842
2000,1.1423


TrainOutput(global_step=2000, training_loss=1.256030548095703, metrics={'train_runtime': 9103.5939, 'train_samples_per_second': 0.879, 'train_steps_per_second': 0.22, 'total_flos': 7.564159626193306e+16, 'train_loss': 1.256030548095703, 'epoch': 0.9859502095144195})

In [20]:
# Save the model using save_model()
trainer.save_model("drive/MyDrive/openllama_cot_checkpoint_5000")

In [19]:
# Load the saved model & tokenizer

# Load lora_config from where you saved the checkpoint
lora_config = LoraConfig.from_pretrained(r"drive/MyDrive/openllama_cot_checkpoint_5000")
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    lora_config.base_model_name_or_path,
    quantization_config=bnb_config,
    device_map={"":0})

model = get_peft_model(model, lora_config)

# Load the tokenizer from the checkpoint
tokenizer = AutoTokenizer.from_pretrained(r"/content/drive/MyDrive/openllama_cot_checkpoint_5000")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

 # Inference

Before providing the instruction and context to the model, we first prepare the prompt using the `make_inference()` function. We then tokenize these inputs and feed them to the model. The prompts prepared in this function should follow the same format as those created by the `formatting_func()`.

In [62]:
def make_inference(instruction):
  # Generates responses from different models based on the provided instruction.

  # Potential phrases to use, including an empty string for "no phrase"
  phrases = [
    "Let's think step by step.",
    "Let's break this down.",
    "Consider the following steps.",
    "Think through the solution step by step.",
    ""
  ]

  chosen_phrase = random.choice(phrases) # Randomly choose one of the phrases

  prompt = (
    f"Below is a question. Write a rationale explaining the reasoning process to answer the question.\n\n"
    "### Question:\n"
    f"{instruction}\n\n"
    "### Rationale:\n"
    f"{chosen_phrase} ".strip()
  )

  inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to("cuda:0")
  # outputs = model.generate(**inputs, max_new_tokens=100, repetition_penalty=2.0, temperature=0.5, top_p=0.5, do_sample=True)
  outputs = model.generate(**inputs, max_new_tokens=100)
  display(Markdown(tokenizer.decode(outputs[0], skip_special_tokens=True)))
  # print("base_model")
  # outputs = base_model.generate(**inputs, max_new_tokens=100)
  # outputs = base_model.generate(**inputs, max_new_tokens=100, repetition_penalty=2.0, temperature=0.5, top_p=0.5, do_sample=True)
  # display(Markdown(tokenizer.decode(outputs[0], skip_special_tokens=True)))

# Sample Inferences

In [63]:
make_inference("Problem: Solve -2*c + 3*c - 4 = 0 for c. And the answer is...")

Below is a question. Write a rationale explaining the reasoning process to answer the question.

### Question:
Problem: Solve -2*c + 3*c - 4 = 0 for c. And the answer is...

### Rationale:
Let's break this down.

-2*c + 3*c - 4 = 0

-2*c + 3*c = 4

-2*c = -2*c + 4

-2*c = -2*c + 4 - 4

-2*c = -2*c + 8

-2*c = -8

c = -8/2

c = -4


In [46]:
make_inference("You are provided with an arithmetic question. Your task is to compute the solution using the given arithmetic operations. The only arithmetic operators needed to answer the questions are'+'(addition) and'-'(subtraction). The answer should be correct to one decimal place. Blake filled a bucket with 0.8 gallon of water. Later, he poured out 0.2 gallon of the water. How much water is in the bucket?")

Below is a question. Write a rationale explaining the reasoning process to answer the question.

### Question:
You are provided with an arithmetic question. Your task is to compute the solution using the given arithmetic operations. The only arithmetic operators needed to answer the questions are'+'(addition) and'-'(subtraction). The answer should be correct to one decimal place. Blake filled a bucket with 0.8 gallon of water. Later, he poured out 0.2 gallon of the water. How much water is in the bucket?

### Rationale:

The question is asking for the amount of water in the bucket. The answer is 0.6 gallon.

The answer is 0.6 gallon because the question is asking for the amount of water in the bucket. The answer

In [47]:
make_inference("Identify the odd one out and explain your choice. Orange, Green, Airplane.")

Below is a question. Write a rationale explaining the reasoning process to answer the question.

### Question:
Identify the odd one out and explain your choice. Orange, Green, Airplane.

### Rationale:

The odd one out is the airplane. The reason is that the other two are both green and orange.

### Question:
Identify the odd one out and explain your choice. Orange, Green, Airplane.



## Acknowledgments

The work here is adapted from [this notebook](https://colab.research.google.com/drive/1SRclU2pcgzCkVXpmhKppVbGW4UcCs5xT?usp=sharing).