<a href="https://colab.research.google.com/github/duanzhihua/-transformer-english2chinese-/blob/main/Fine_Tuning_Llama_3_2_to_Think_Like_DeepSeek_R1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*More details in this article: [Fine-Tuning Your LLM to "Think" Like DeepSeek R1, on Your Computer](https://kaitchup.substack.com/p/fine-tuning-your-llm-to-think-like-r1)*

This notebook shows how to fine-tune Llama 3.2 to "think" like DeepSeek-R1. It reconfigures the tokenizer to exploit special "think" tokens. Fine-tuning is performed by using data generated by R1.


#Installation

In [None]:
!pip install --upgrade transformers bitsandbytes peft accelerate datasets trl flash_attn

Collecting transformers
  Downloading transformers-4.48.2-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.45.1-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting accelerate
  Downloading accelerate-1.3.0-py3-none-any.whl.metadata (19 kB)
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting trl
  Downloading trl-0.14.0-py3-none-any.whl.metadata (12 kB)
Collecting flash_attn
  Downloading flash_attn-2.7.4.post1.tar.gz (6.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m74.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp

# Configuration

In [None]:
from datasets import load_dataset
import torch, multiprocessing, sys
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig
from trl import SFTConfig, SFTTrainer


compute_dtype = torch.bfloat16
attn_implementation = 'flash_attention_2'

# Add the Special "think" Tokens to the Tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
tokenizer.pad_token = "<|finetune_right_pad_id|>"
tokenizer.pad_token_id = 128004
tokenizer.padding_side = 'right'

tokenizer.vocab[128011] = '<think>'
tokenizer.vocab[128012] = '</think>'

# Preprocessing the Dataset

In [None]:
ds = load_dataset("cognitivecomputations/dolphin-r1", 'reasoning-deepseek', split='train[:30000]').train_test_split(test_size=0.1)

#We need to add the reasoning and response of the assistant to the messages columns
def process(row):
  assisant_message = "<think>"+row['reasoning']+"</think>\n\n"+row['answer']
  row['messages'].append({'role': 'assistant', 'content': assisant_message})
  row['text'] = tokenizer.apply_chat_template(row['messages'], tokenize=False)
  return row

ds['train'] = ds['train'].map(
    process,
    num_proc= multiprocessing.cpu_count(),
    load_from_cache_file=False,
)

ds['test'] = ds['test'].map(
    process,
    num_proc= multiprocessing.cpu_count(),
    load_from_cache_file=False,
)

Map (num_proc=12):   0%|          | 0/27000 [00:00<?, ? examples/s]

Map (num_proc=12):   0%|          | 0/3000 [00:00<?, ? examples/s]

#Fine-tuning code

A unique function to perform full and QLoRA/LoRA fine-tuning:

In [None]:


def fine_tune(model_name, batch_size=1, gradient_accumulation_steps=32, LoRA=False, QLoRA=False):


  if QLoRA:
    bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=compute_dtype,
            bnb_4bit_use_double_quant=True,
    )
    model = AutoModelForCausalLM.from_pretrained(
              model_name, quantization_config=bnb_config, device_map={"": 0}, attn_implementation=attn_implementation
    )
    model = prepare_model_for_kbit_training(model, gradient_checkpointing_kwargs={'use_reentrant':True})
  else:
    model = AutoModelForCausalLM.from_pretrained(
              model_name, device_map={"": 0}, torch_dtype=compute_dtype, #attn_implementation=attn_implementation
    )
    model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant':True})



  if LoRA or QLoRA:
    peft_config = LoraConfig(
            lora_alpha=16,
            lora_dropout=0.05,
            r=16,
            bias="none",
            task_type="CAUSAL_LM",
            target_modules= ['k_proj', 'o_proj','q_proj', 'v_proj', 'up_proj', 'down_proj', 'gate_proj'],
            modules_to_save=["lm_head","embed_tokens"],
    )
  else:
      peft_config = None

  if LoRA:
    output_dir = "./LoRA/"
  elif QLoRA:
    output_dir = "./QLoRA/"
  else:
    output_dir = "./FFT/"

  training_arguments = SFTConfig(
          output_dir=output_dir,
          eval_strategy="steps",
          do_eval=True,
          optim="adamw_8bit",
          per_device_train_batch_size=batch_size,
          gradient_accumulation_steps=gradient_accumulation_steps,
          per_device_eval_batch_size=batch_size,
          log_level="debug",
          save_strategy="epoch",
          logging_steps=25,
          learning_rate=1e-5,
          bf16 = True,
          eval_steps=25,
          num_train_epochs=1,
          warmup_ratio=0.1,
          lr_scheduler_type="linear",
          dataset_text_field="text",
          max_seq_length=1024,
          report_to='none'
  )

  trainer = SFTTrainer(
          model=model,
          train_dataset=ds['train'],
          eval_dataset=ds['test'],
          peft_config=peft_config,
          processing_class=tokenizer,
          args=training_arguments,
  )

  #--code by Unsloth: https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing#scrollTo=pCqnaKmlO1U9

  gpu_stats = torch.cuda.get_device_properties(0)
  start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
  max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
  print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
  print(f"{start_gpu_memory} GB of memory reserved.")

  trainer_ = trainer.train()


  used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
  used_memory_for_trainer= round(used_memory - start_gpu_memory, 3)
  used_percentage = round(used_memory         /max_memory*100, 3)
  trainer_percentage = round(used_memory_for_trainer/max_memory*100, 3)
  print(f"{trainer_.metrics['train_runtime']} seconds used for training.")
  print(f"{round(trainer_.metrics['train_runtime']/60, 2)} minutes used for training.")
  print(f"Peak reserved memory = {used_memory} GB.")
  print(f"Peak reserved memory for training = {used_memory_for_trainer} GB.")
  print(f"Peak reserved memory % of max memory = {used_percentage} %.")
  print(f"Peak reserved memory for training % of max memory = {trainer_percentage} %.")
  print("-----")
  #----

## Example with LoRA Fine-Tuning

In [None]:
fine_tune("meta-llama/Llama-3.2-3B", batch_size=2, gradient_accumulation_steps=16, LoRA=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Map:   0%|          | 0/27000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 2
***** Running training *****
  Num examples = 27,000
  Num Epochs = 1
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 16
  Total optimization steps = 843
  Number of trainable parameters = 812,318,720


GPU = NVIDIA L4. Max memory = 22.161 GB.
8.322 GB of memory reserved.


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss
25,1.612,1.622329
50,1.6144,1.597241
75,1.5465,1.499028
100,1.428,1.360236
125,1.2903,1.257095
150,1.2314,1.209658
175,1.1908,1.181761
200,1.1766,1.162012
225,1.1444,1.148058
250,1.1409,1.136617



***** Running Evaluation *****
  Num examples = 3000
  Batch size = 2

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 2

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 2

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 2

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 2

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 2

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 2

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 2

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 2

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 2

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 2

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 2

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 2

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 2

*****

44953.7426 seconds used for training.
749.23 minutes used for training.
Peak reserved memory = 16.846 GB.
Peak reserved memory for training = 8.524 GB.
Peak reserved memory % of max memory = 76.016 %.
Peak reserved memory for training % of max memory = 38.464 %.
-----


# Testing the Adapter

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

compute_dtype = torch.bfloat16
attn_implementation = 'flash_attention_2'

### Not sure why but tokenizer is not saved by the SFTTrainer with our custmo tokens
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
tokenizer.pad_token = "<|finetune_right_pad_id|>"
tokenizer.pad_token_id = 128004

tokenizer.vocab[128011] = '<think>'
tokenizer.vocab[128012] = '</think>'
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    device_map={"": 0},
    attn_implementation=attn_implementation,
    torch_dtype=torch.bfloat16,
)

model = PeftModel.from_pretrained(model, "./LoRA_R1/checkpoint-843/")

In [None]:
prompt = [{'role':'system', 'content':"You are a helpful assistant and you know a lot about rabbits. Think before answering!"},
    {'role':'user', 'content':"What is the maximum number of carrots a rabbit can eat, theoritically, in a day?"}
    ]

prompt = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).to('cuda')
output = model.generate(**input_ids, temperature=0.7, max_new_tokens=1024)
print(tokenizer.decode(output[0], skip_special_tokens=False))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 31 Jan 2025

You are a helpful assistant and you know a lot about rabbits. Think before answering!<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the maximum number of carrots a rabbit can eat, theoritically, in a day?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

<think>Okay, so I need to figure out the maximum number of carrots a rabbit can eat in a day. Hmm, let's think about this. A rabbit's diet typically consists of grass and carrots. Carrots are a major part of their diet, but how much? 

First, rabbits are herbivores, so they eat mostly plants. Carrots are a root vegetable that's high in fiber and vitamins. Let me recall some rabbit facts. On average, a rabbit can eat about 1-2 pounds of food per day. So, if a rabbit eats mostly carrots, how many pounds of carrots per day? 

Let me break this down step by step. Let's assume that a