## Install Packages

In [1]:
!pip install bitsandbytes
!pip install -U bitsandbytes
!pip install -q unsloth transformers datasets accelerate bitsandbytes
!pip install --upgrade unsloth
!pip install --upgrade transformers
!pip install datasets

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-

## Base Model Loading + Lora Adaptor

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
from huggingface_hub import notebook_login
notebook_login()
# HuggingFace Token

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
from huggingface_hub import snapshot_download

# Route setting
model_dir = "/content/drive/MyDrive/285_project/DeepSeek-R1-Distill-Llama-8B"

# Download and save models
snapshot_download(repo_id="unsloth/DeepSeek-R1-Distill-Llama-8B", local_dir=model_dir, local_dir_use_symlinks=False)

print(f"Model saved to: {model_dir}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/959 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

Model saved to: /content/drive/MyDrive/285_project/DeepSeek-R1-Distill-Llama-8B


In [5]:
import torch
from unsloth import FastLanguageModel
from transformers import TrainingArguments

MODEL_NAME = "/content/drive/MyDrive/285_project/DeepSeek-R1-Distill-Llama-8B"
quantization_dtype = torch.float16  # FP16 setting

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL_NAME,
    max_seq_length = 4096,
    dtype=quantization_dtype,
    load_in_4bit= False # 4-bit quantization
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    GPU: NVIDIA L4. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [6]:
from peft import PeftModel

# LoRA 适配层路径（确保它是你之前保存的 LoRA 训练结果）
lora_model_path = "/content/drive/MyDrive/285_project/outputs/checkpoint-3171"  # 你的 LoRA 适配层路径

# 让基座模型加载 LoRA 适配权重
model = PeftModel.from_pretrained(model, lora_model_path)

# 让模型移动到 GPU（如果可用）
model.to("cuda" if torch.cuda.is_available() else "cpu")

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear

## Model Inference

In [7]:
FastLanguageModel.for_inference(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear

In [26]:
prompt_style = """Below is an instruction that describes a task, paired with relevant background information.
Your task is to generate a well-reasoned and structured response.

Before providing an answer, carefully analyze the question and construct a logical, step-by-step reasoning process.
Make sure to justify your conclusions based on medical knowledge and clinical evidence.

### Instruction:
You are an advanced medical AI assistant with expertise in clinical reasoning, diagnostics, and treatment planning.
Your goal is to analyze the given medical question and provide a well-structured, evidence-based response.
You can assume you are helping a doctor.

### Question:
{}

### Response:
<think>{}"""

In [27]:
question = '医生，我最近总是觉得特别疲劳，而且体重也在下降。我还时常感到口渴，这可能是什么病？' #input()

In [28]:
input_text = prompt_style.format(question, '')

In [29]:
inputs = tokenizer([input_text], return_tensors="pt").to(model.device)

In [30]:
output = model.generate(**inputs,
                            max_new_tokens=5000,
                            do_sample=True,
                            top_k=50,
                            top_p=0.95)

In [31]:
response = tokenizer.batch_decode(output, skip_special_tokens=False)

In [32]:
print(response[0].split('### Response:')[1]) #cot和回答


<think>
Okay, so I'm feeling really tired all the time and I've lost some weight recently. That's definitely not normal. Let's think about what might be causing this. Hmm, fatigue and weight loss could mean a lot of things, but they often point to something going on with the body's metabolism or hormones. Maybe there's an issue with how my body is processing energy. 

Now, what about this constant feeling of being thirsty? It's like my mouth is dry all the time. That's a bit odd. It's not just thirst; it's more like a constant need to drink water. That makes me think of diabetes insipidus, where there's a problem with the kidneys and hormones that control water balance. But wait, diabetes insipidus usually doesn't cause weight loss or fatigue. So maybe it's something else. 

Let's think about hyperthyroidism. It's a condition where the thyroid gland is overactive, and it can definitely cause fatigue and weight loss because the body is burning through energy at an increased rate. Plus,

## Merging Model

In [34]:
new_model_local = "/content/drive/MyDrive/285_project/DeepSeek-R1-Medical-CoT"

# 1.保存 LoRA 适配层（轻量权重）
model.save_pretrained(new_model_local)
tokenizer.save_pretrained(new_model_local)

# 2.合并 LoRA 适配层到基础模型，并存储完整模型
model.save_pretrained_merged(new_model_local, tokenizer, save_method="merged_16bit")

Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 32.24 out of 52.96 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


  3%|▎         | 1/32 [00:00<00:06,  5.07it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [00:33<00:00,  1.04s/it]


Unsloth: Saving tokenizer... Done.
Done.
