# Unsloth Fine-tuning DeepSeek R1 Distilled Llama 8B for AutoWare



In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install bitsandbytes unsloth_zoo


## Choose a Base Model

use DeepSeek R1 Distilled Llama 8B as our base model, which offers a good balance between performance and efficiency for autonomous driving applications.

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

## Inference before fine-tuning

In [3]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are an autonomous driving expert with deep knowledge of AutoWare, ROS, and perception systems.
Please answer the following question about autonomous vehicles.

### Question:
{}

### Response:
<think>{}"""

In [6]:
question = "AccelerationCommand消息中包含哪些数据字段？"

FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
好的，我现在要回答关于AccelerationCommand消息中包含哪些数据字段的问题。首先，我需要回想一下相关的知识点。Autonomous driving系统中，控制消息是关键的一部分，尤其是像AccelerationCommand这样的消息，它通常用于传递车辆的加速度指令。

我记得在ROS中，消息类型通常由一个清晰的结构体定义，包含多个字段。AccelerationCommand可能类似于其他控制指令，比如速度和转向指令，所以它可能包含加速度值、时间戳、是否启用制动等信息。

接下来，我应该查阅一下具体的消息定义。假设AccelerationCommand的定义包括加速度值、时间戳、是否有制动器启用、是否有牵引力限制、是否有速度限制、是否有转向辅助以及是否有紧急制动等字段。这些字段应该有助于车辆根据环境做出适当的反应，比如在制动时减速，加速度过大时限制牵引力，或者在紧急情况下立即制动。

然后，我需要将这些字段逐一列出，并解释每个字段的作用。比如，加速度值用于指示车辆需要的加速度，时间戳用于记录消息的发发时间，制动器启用与否则车辆能否使用制动系统，这样在需要时可以及时减速。

最后，确保所有列出的字段都是基于AccelerationCommand消息的定义，可能还需要参考相关的开发文档或代码来确认每个字段的存在和作用。如果有不确定的地方，可能需要进一步的资料查证或与开发者沟通，以确保信息的准确性。

总结一下，我的思考过程是：回忆相关知识，查阅消息定义，列出字段，解释每个字段的作用，并确保信息准确无误。
</think>

AccelerationCommand 消息中包含以下数据字段：

1. **加速度值 (acceleration)**：表示车辆需要的加速度，单位通常为 m/s²。
2. **时间戳 (timestamp)**：消息的发发时间戳，用于标识指令的时间。
3. **制动器启用 (brake)**：布尔值，表示是否启用制动系统。
4. **牵引力限制 (throttle)**：布尔值，表示是否限制牵引力以防止过快加速。
5. **速度限制 (speed_limit)**：车辆的最大允许速度，单位为 m/s。
6. **转向辅助 (steering_assist)**：布尔值，表示是否启用转向辅助功能。
7. **紧急制动 

## Prepare Dataset

We'll prepare an AutoWare-focused dataset to train our model. This dataset will include questions and answers related to autonomous driving and AutoWare specifically.

In [7]:
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are an autonomous driving expert with deep knowledge of AutoWare, ROS, and perception systems.
Please answer the following question about autonomous vehicles.

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

### Important Notice

It's crucial to add the EOS (End of Sequence) token at the end of each training dataset entry, otherwise you may encounter infinite generations.

In [8]:
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples.get("input", [""] * len(instructions))  # Use empty string if no input
    outputs = examples["output"]

    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        # Combine instruction and input if input exists
        question = f"{instruction}\n{input_text}" if input_text else instruction
        # Empty CoT for now - in production you might want to generate these
        cot = ""
        text = train_prompt_style.format(question, cot, output) + EOS_TOKEN
        texts.append(text)

    return {
        "text": texts,
    }

In [9]:
import json
import os
from datasets import Dataset
from google.colab import files

# Option 1: Upload a dataset file directly (for Google Colab)
try:
    print("Please upload your AutoWare dataset in Alpaca format (JSON file)")
    uploaded = files.upload()
    json_file_name = list(uploaded.keys())[0]
    print(f"Successfully uploaded {json_file_name}")
except Exception as e:
    print(f"Upload failed or not running in Colab: {e}")
    json_file_name = "autoware_dataset.json"  # Default fallback name

# Option 2: Load from a local file if upload failed or we're not in Colab
if not os.path.exists(json_file_name):
    # Create a small sample dataset if no file is found
    print(f"No dataset file found. Creating a minimal sample dataset for demonstration.")
    sample_data = [
        {
            "instruction": "Explain the role of the planning module in AutoWare",
            "input": "",
            "output": "The planning module in AutoWare is responsible for determining the optimal path and driving behavior of the autonomous vehicle."
        },
        {
            "instruction": "How does AutoWare handle sensor fusion?",
            "input": "Consider LiDAR, camera, and radar integration",
            "output": "AutoWare handles sensor fusion through a sophisticated multi-level approach."
        }
    ]

    # Save the sample dataset
    with open(json_file_name, 'w') as f:
        json.dump(sample_data, f)
    print(f"Created sample dataset in {json_file_name}")

# Load the dataset from the JSON file
try:
    with open(json_file_name, 'r') as f:
        data = json.load(f)
    print(f"Successfully loaded dataset with {len(data)} examples")

    # Convert to Dataset format
    dataset = Dataset.from_dict({
        "instruction": [item["instruction"] for item in data],
        "input": [item.get("input", "") for item in data],  # Handle missing inputs
        "output": [item["output"] for item in data]
    })

    # Print dataset column names to verify
    print(f"Dataset columns: {dataset.column_names}")
    print(f"First example instruction: {dataset['instruction'][0][:100]}...")

except Exception as e:
    print(f"Error loading dataset: {e}")
    raise

Please upload your AutoWare dataset in Alpaca format (JSON file)


Saving datasets-1743318077377-alpaca-2025-04-08.json to datasets-1743318077377-alpaca-2025-04-08.json
Successfully uploaded datasets-1743318077377-alpaca-2025-04-08.json
Successfully loaded dataset with 5239 examples
Dataset columns: ['instruction', 'input', 'output']
First example instruction: 使用/api/local/command/acceleration接口发送加速度命令需要满足哪些操作模式条件？...


For `Ollama` and `llama.cpp` to function like a custom Chatbot, we need to format our dataset properly with the specific instruction and output structure.

In [10]:
dataset = dataset.map(formatting_prompts_func, batched = True)
dataset["text"][0]  # Show first formatted example

Map:   0%|          | 0/5239 [00:00<?, ? examples/s]

'Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are an autonomous driving expert with deep knowledge of AutoWare, ROS, and perception systems.\nPlease answer the following question about autonomous vehicles.\n\n### Question:\n使用/api/local/command/acceleration接口发送加速度命令需要满足哪些操作模式条件？\n\n### Response:\n<think>\n\n</think>\n<think>首先需要明确使用该接口的核心目的是发送标准化加速度指令，这需要与系统的控制架构相匹配。关键在于理解该接口的设计定位：它专用于本地操作场景下的实时控制，因此对操作模式有特定限制。\n\n接着分析操作模式的匹配条件。由于加速度指令直接影响设备运动状态，必须确保系统当前处于允许手动控制的模式。这里的"手动控制"并非字面意义，而是指系统配置中专门划分的本地指令优先级模式，该模式确保外部指令能够绕过常规决策层直接作用于执行端。\n\n然后考虑模式的具体配置标准。系统手册中明确定义了不同操作模式的功能边界，只有符合手册中"本地手动控制"分类的模式才能激活此接口的指令通道。这涉及到权限验证、安全校验等多个子系统的协同状态，确保指令传输既实时又可靠。\n\n此外需要关注接口的技术实现方式。采用实时流方法意味着系统必须维持低延迟的数据通道，这对操作模式下的网络配置、资源分配都有严格要求。同时加速

## Train the model
Now let's use Huggingface TRL's `SFTTrainer`.

In [11]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.3.19 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [12]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        # num_train_epochs = 1, # For longer training runs!
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/5239 [00:00<?, ? examples/s]

In [13]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,239 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.5929
2,2.6006
3,2.4486
4,2.6224
5,2.2785
6,2.2016
7,2.0506
8,1.7673
9,1.9282
10,1.8518


## Inference after fine-tuning

Let's inference with same question again and see the difference.

In [14]:
print(question)

AccelerationCommand消息中包含哪些数据字段？


In [15]:
FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>

</think>
<think>首先需要明确问题的核心是分析AccelerationCommand消息的数据结构。消息的定义由两个部分组成：消息类型声明和字段列表。观察消息类型定义部分，发现其采用标准的ROS消息规范，使用std_msgs/Float64的数据类型。

接着分析字段列表，第一条字段为command，类型为std_msgs/Float64，通常用于存储具体的加速指令值。第二条字段为timestamp，同样采用std_msgs/Float64的数据类型，用于记录命令的时间戳。两个字段均使用相同的数据类型，表明它们的数据类型设计一致性。

然后需要验证数据类型的兼容性。消息类型声明中的两个字段均采用std_msgs/Float64，这是ROS中常用的浮点数数据类型，具备数值计算和通信适配性。由于两个字段使用相同的数据类型，系统在数据存储和传输过程中可以保持数据格式的一致性，避免因类型不匹配导致的数据解析错误。

最后需要确认数据结构的完整性。消息类型的字段列表包含两个数据字段，分别对应加速值和时间戳两种核心信息。这种结构设计既满足消息传输的基本需求，又通过标准数据类型实现了数据的可扩展性和兼容性。</think>
AccelerationCommand消息中包含以下两个数据字段：
1. command：类型为std_msgs/Float64，用于存储具体的加速指令值
2. timestamp：类型为std_msgs/Float64，用于记录命令的时间戳

两个字段均采用std_msgs/Float64的数据类型，表明它们的数据类型设计一致，确保数据存储和传输时的兼容性。<｜end▁of▁sentence｜>


In [16]:
from peft import AutoPeftModelForCausalLM

# 合并LoRA权重到基础模型
merged_model = model.merge_and_unload()

# 创建输出目录
output_dir = "autoware_model_full"
os.makedirs(output_dir, exist_ok=True)

# 保存合并后的模型
merged_model.save_pretrained(output_dir, safe_serialization=True)
# 保存tokenizer
tokenizer.save_pretrained(output_dir)

print(f"完整合并模型已保存到: {output_dir}")



完整合并模型已保存到: autoware_model_full


## Upload Model to HuggingFace（option）



Now, let's save our finetuned model and upload it to HuggingFace.

### Save the fine-tuned model to GGUF format

Choose the llama.cpp's GGUF format we prefer by setting the corresponding `if` to `True`.

In [None]:
from google.colab import userdata

HUGGINGFACE_TOKEN = userdata.get('HUGGINGFACE_TOKEN')

In [None]:
# Save to 8bit Q8_0
if True: model.save_pretrained_gguf("autoware_model", tokenizer,)

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("autoware_model_f16", tokenizer, quantization_method = "f16")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("autoware_model", tokenizer, quantization_method = "q4_k_m")

### Push the model to HuggingFace

Create a model type repository for your model if you haven't done so.

In [None]:
from huggingface_hub import create_repo
create_repo("your-username/autoware-model", token=HUGGINGFACE_TOKEN, exist_ok=True)

In [None]:
model.push_to_hub_gguf("your-username/autoware-model", tokenizer, token=HUGGINGFACE_TOKEN)

### Ollama run HuggingFace model

```bash
ollama run hf.co/your-username/autoware-model:q8_0
```

### Ollama inference

```bash
curl http://localhost:11434/api/chat -d '{ \
  "model": "your-username/autoware-model", \
  "messages": [ \
    { "role": "user", "content": "What are the main components of the AutoWare perception stack and how do they interact with each other?" } \
  }'
```

## Dataset Format for Creating Your Own Data

The training data should be in Alpaca JSON format as shown below:

```json
[
  {
    "instruction": "Explain the role of the planning module in AutoWare",
    "input": "",
    "output": "The planning module in AutoWare is responsible for determining the optimal path and driving behavior..."
  },
  {
    "instruction": "How does AutoWare handle sensor fusion?",
    "input": "Consider LiDAR, camera, and radar integration",
    "output": "AutoWare handles sensor fusion through a sophisticated multi-level approach..."
  }
]
```

Save this in a file named `autoware_dataset.json` to use with this notebook.