# Unsloth Fine-tuning DeepSeek R1 Distilled Llama 8B

In this notebook, it will demonstrate how to finetune `DeepSeek-R1-Distill-Llama-8B` with Unsloth, using a medical dataset.

## Why do we need LLM fine-tuning?

Fine-tuning tailors the model to have a better performance for specific tasks, making it more effective and versatile in real-world applications. This process is essential for tailoring an existing model to a particular task or domain.

In [2]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install bitsandbytes unsloth_zoo


## Choose a Base Model

1. Choose a model that aligns with your usecase
2. Assess your storage, compute capacity and dataset
3. Select a Model and Parameters
4. Choose Between Base and Instruct Models

In [3]:
!pip freeze >unsloth_requirement.txt

In [4]:
from unsloth import FastLanguageModel
import torch
import os
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

save_directory = "./saved_models/"
os.makedirs(save_directory, exist_ok=True)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

# 保存模型权重到指定路径
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
print(f"Model weights saved to {save_directory}")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.6: Fast Llama patching. Transformers: 4.53.2.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Model weights saved to ./saved_models/


In [5]:
# from google.colab import drive
# drive.mount('/content/drive')

In [6]:
# !ls drive/MyDrive

In [7]:
# !cp -r saved_models/ drive/MyDrive

In [8]:
# !ls -lh saved_models/

In [9]:
# !du -h --max-depth=1 /root/.cache/huggingface/hub

In [10]:
!ls /root/.cache/huggingface/hub/models--unsloth--deepseek-r1-distill-llama-8b-unsloth-bnb-4bit/snapshots/ -l

total 4
drwxr-xr-x 2 root root 4096 Jul 21 14:09 99d24119ec73dd12a306bf48ccc69bfe7a343848


In [11]:
!ls /root/.cache/huggingface/hub

models--unslothai--1
models--unslothai--colabpro
models--unslothai--repeat
models--unslothai--vram-24
models--unsloth--deepseek-r1-distill-llama-8b-unsloth-bnb-4bit


## Inference before fine-tuning

In [12]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>{}"""

In [13]:
question = "一个患有急性阑尾炎的病人已经发病5天，腹痛稍有减轻但仍然发热，在体检时发现右下腹有压痛的包块，此时应如何处理？"


FastLanguageModel.for_inference(model) #必须切换到推理模式
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)#这一步是没有微调前进行一个推理
print(response[0].split("### Response:")[1])


<think>
好，我现在要处理一个关于急性阑尾炎的病例。患者已经患病五天，腹痛有所减轻，但仍然发热。在体检时，发现右下腹有压痛的包块。

首先，急性阑尾炎通常表现为右下腹痛、发热、发热、腹泻等症状。患者已经有五天的病程，说明炎症已经持续了一段时间。腹痛虽然有所缓解，但仍然存在，可能意味着炎症正在逐渐缓解，但仍未完全消退。

接下来，体检发现右下腹有压痛的包块。这可能提示存在一个可触摸的结块。包块的存在可能意味着阑尾炎已经形成了一个脓肿，或者是阑尾的炎症结构。包块的压痛可能反映出周围组织的不适或炎症情况。

考虑到患者仍有发热，腹痛缓解但不完全，包块的存在，可能需要进一步的评估和治疗。首先，应进行腹部超声检查，以确定包块的性质。超声可以帮助判断包块是否是脓肿、结石或其他结构。此外，检查是否有膈下游离气泡，排除子肱疝的可能。

如果包块是脓肿，可能需要引流治疗。引流可以有效缓解感染，减轻症状。同时，患者应继续使用抗生素治疗，以覆盖可能的细菌感染。考虑到病原体的敏感性，可能需要根据局部抗生素敏感结果调整药物选择。

此外，患者应保持current output management，确保排便通畅，避免便秘或腹泻带来的问题。治疗过程中，应密切监测患者的症状变化，如腹痛加重、发热升高或包块增大等情况，及时调整治疗计划。

在处理过程中，应与外科团队保持沟通，根据超声结果决定是否需要进一步的手术干预。例如，如果包块无法通过引流治疗解决，或者诊断为膈下游离气泡，可能需要进行手术治疗。

总的来说，处理这种情况需要综合考虑患者的症状、检查结果和治疗效果，逐步调整治疗方案，确保患者病情得到有效控制。
</think>

在处理急性阑尾炎患者的情况时，以下是逐步的处理思路和建议：

1. **初步评估**：
   - **病史**：患者已有急性阑尾炎5天，腹痛缓解但仍有发热。
   - **体征**：右下腹压痛包块，可能提示脓肿或炎症结块。

2. **进一步检查**：
   - **腹部超声**：确定包块性质，如脓肿或结石，检查膈下游离气泡。
   - **血常规、血培养、CRP**：评估感染程度和细菌类型。

3. **治疗选择**：
   - **抗生素**：根据敏感结果选择药物，通常为广谱抗生素。
   - **引流治疗**：如果诊断为脓肿，考虑外科引流。
   - **支

## Prepare Dataset

A medical dataset [https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT/](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT/) will be used to train the selected model.

In [14]:
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

### Important Notice

It's crucial to add the EOS (End of Sequence) token at the end of each training dataset entry, otherwise you may encounter infinite generations.

In [15]:
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

In [16]:
from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", 'zh', split = "train[:30%]", trust_remote_code=True)
print(dataset.column_names)

README.md: 0.00B [00:00, ?B/s]

medical_o1_sft_Chinese.json:   0%|          | 0.00/50.6M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/20171 [00:00<?, ? examples/s]

['Question', 'Complex_CoT', 'Response']


For `Ollama` and `llama.cpp` to function like a custom `ChatGPT` Chatbot, we must only have 2 columns - an `instruction` and an `output` column. We need to transform the dataset into proper structure.

In [17]:
dataset = dataset.map(formatting_prompts_func, batched = True)
dataset["text"][0]

Map:   0%|          | 0/6051 [00:00<?, ? examples/s]

'Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.\nPlease answer the following medical question.\n\n### Question:\n根据描述，一个1岁的孩子在夏季头皮出现多处小结节，长期不愈合，且现在疮大如梅，溃破流脓，口不收敛，头皮下有空洞，患处皮肤增厚。这种病症在中医中诊断为什么病？\n\n### Response:\n<think>\n这个小孩子在夏天头皮上长了些小结节，一直都没好，后来变成了脓包，流了好多脓。想想夏天那么热，可能和湿热有关。才一岁的小孩，免疫力本来就不强，夏天的湿热没准就侵袭了身体。\n\n用中医的角度来看，出现小结节、再加上长期不愈合，这些症状让我想到了头疮。小孩子最容易得这些皮肤病，主要因为湿热在体表郁结。\n\n但再看看，头皮下还有空洞，这可能不止是简单的头疮。看起来病情挺严重的，也许是脓肿没治好。这样的情况中医中有时候叫做禿疮或者湿疮，也可能是另一种情况。\n\n等一下，头皮上的空洞和皮肤增厚更像是疾病已经深入到头皮下，这是不是说明有可能是流注或瘰疬？这些名字常描述头部或颈部的严重感染，特别是有化脓不愈合，又形成通道或空洞的情况。\n\n仔细想想，我怎么感觉这些症状更贴近瘰疬的表现？尤其考虑到孩子的年纪和夏天发生的季节性因素，湿热可能是主因，但可能也有火毒或者痰湿造成的滞留。\n\n回到基本

## Train the model
Now let's use Huggingface TRL's `SFTTrainer`.

In [18]:
FastLanguageModel.for_training(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
    (layers): ModuleList(
      (0): LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((409

In [19]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.7.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [20]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,#启动几个进程去dataset中加载数据
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 600,
        # num_train_epochs = 1, # For longer training runs!
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc，none是不用任何报告插件
    ),
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/6051 [00:00<?, ? examples/s]

In [21]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 6,051 | Num Epochs = 1 | Total steps = 600
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.3172
2,2.3224
3,2.4429
4,2.373
5,2.1732
6,2.249
7,1.998
8,1.8898
9,1.7397
10,1.7865


KeyboardInterrupt: 

## Inference after fine-tuning

Let's inference with same question again and see the difference.

In [22]:
print(question)

一个患有急性阑尾炎的病人已经发病5天，腹痛稍有减轻但仍然发热，在体检时发现右下腹有压痛的包块，此时应如何处理？


In [23]:
FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference! 验证微调后的模型可以推理
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
患者已经发病5天了，腹痛稍微减轻，但仍然发烧。体检发现右下腹有压痛的包块，这让我立刻想到急性阑尾炎。嗯，急性阑尾炎的患者如果有包块，通常需要手术治疗。可是，手术对患者来说是个大问题，尤其是在已经有发烧的情况下，手术风险会增加。

我觉得应该先考虑其他的治疗方法，比如抗生素。因为发烧可能是因为细菌感染引起的，抗生素可以帮助控制发烧和治疗感染。这样做的话，可以减少手术的必要性，尤其是在感染风险更高的情况下。

不过，包块的存在也不能忽视。包块可能会引起腹痛，或者有其他并发症，比如感染。所以，虽然抗生素可以帮助控制发烧，但包块依然需要处理。

这时候，我开始犹豫，手术风险太高，抗生素似乎是个不错的选择，但包块也不能完全忽视。也许可以先用抗生素来稳定病情，然后再考虑手术的可能性。

等一下，考虑到患者已经发烧，手术风险确实增加了。手术可能会更复杂，术后恢复也可能更困难。所以，手术可能不是首选的方法。

然后，我想到，包块是否可以通过其他方式处理？比如说，是否有非手术的方法能减轻包块带来的不适？如果能的话，可能就不用马上考虑手术了。

不过，急性阑尾炎的包块，通常是需要手术的。毕竟，包块在腹腔内，无法通过外部手段有效处理。所以，非手术治疗可能不太现实。

我又回到最初的想法，既然手术风险很高，尤其在有发烧的情况下，可能应该先考虑抗生素来控制病情，然后再决定是否手术。

这样一来，患者可以先通过抗生素来缓解发烧，病情稳定下来后，再考虑手术的可能性。这样做既能减少手术风险，又能控制病情。

不过，这样的策略可能会延误手术的时间，包块可能会继续发炎，导致更严重的问题。所以，我觉得这不是最好的选择。

再想想，手术的风险虽然高，但如果不手术，病情可能会恶化，甚至导致更严重的并发症。

嗯，经过一番思考，我觉得即便手术风险很高，也必须面对现实，考虑手术的必要性。

所以，结论是：患者必须立即手术，虽然风险很高，但这是治疗急性阑尾炎包块的唯一选择。
</think>
对于患有急性阑尾炎且腹部出现压痛的包块的病人，手术通常是必要的。虽然手术风险较高，但包块的存在可能导致更严重的并发症，如感染和腹痛加重。因此，在病人有发烧的情况下，手术的风险可能会更高，但仍需尽快处理以避免病情恶化。因此，建议尽快进行手术治疗。<｜end▁of▁sentence｜>


In [24]:
#为了加快测试速度，所以在最后50个样本做测试
test_dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", 'zh', split = "train[-30:]", trust_remote_code=True)
test_questions = test_dataset["Question"]
test_references = test_dataset["Response"]  # 参考答案
print(test_questions[0])
print('-'*50)
test_references[0]

一名25岁的初孕妇在孕38周时入院，出现头晕、头痛、胸闷1周。其血压为170/100mmHg，尿检测结果显示尿蛋白（++），下肢存在水肿（++），估计胎儿重约3200克，胎心率为140次/分。根据这些症状和检查结果，这名孕妇最可能被诊断为何种妊娠病症？
--------------------------------------------------


'根据您描述的症状和检查结果，这位孕妇很有可能被诊断为子痫前期（pre-eclampsia）。子痫前期是妊娠期特有的综合征，通常在怀孕20周以后出现，典型症状包括高血压、尿蛋白和水肿。\n\n在您的描述中，这名孕妇在第38周时血压高达170/100 mmHg，并且尿检显示尿蛋白（++）。此外，她还出现了头晕、头痛、胸闷和下肢水肿，这些都是子痫前期的典型表现。这些症状排除了其他可能的妊娠期高血压问题，如妊娠期高血压病（没有尿蛋白）和慢性高血压（通常在怀孕前发现）。\n\n因此，综合考虑所有症状和检验结果，她最符合子痫前期的诊断标准。因此，我建议尽快进行进一步的医学评估和处理，以保障母婴的健康。'

In [32]:

# 生成模型的预测结果,可以考虑去掉<Think>标签内的内容，还有最后的<｜end▁of▁sentence｜>
model_predictions = []
for question in test_questions:
    inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")
    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=1200,
        use_cache=True,
    )
    response = tokenizer.batch_decode(outputs)[0].split("### Response:")[1]
    response = response.split("</think>")[1].strip()
    response = response.replace("<｜end▁of▁sentence｜>", "")
    model_predictions.append(response)


KeyboardInterrupt: 

In [26]:
model_predictions[0]

'根据您提供的症状和检查结果，这名25岁的孕妇最可能被诊断为高血压肾病。高血压肾病在怀孕晚期尤其常见，表现包括血压显著升高、尿蛋白增高、下肢水肿以及胎心率的增加。这些症状与高血压肾病的典型表现一致。建议进一步的监测和管理，以确保母婴的安全。'

In [27]:
len(model_predictions)

3

In [28]:
test_references[1]

'在符合国务院药品监督管理部门规定的条件下，通过集中规模化栽培养殖和严格质量控制获得批准文号的药物主要是中药材。这类药物的生产涉及大面积的种植过程，并且需要在每一批次中确保质量一致性和安全性。国家对中药材的质量管理非常严格，确保其有效性和安全性，使得中药材能够在市场上合法销售和使用。同时，这种管理方式能够保证中药材的生产稳定、供应充足，也为行业规范化和现代化发展提供了保障。'

In [29]:
!pip install nltk rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [30]:
from nltk.translate.bleu_score import sentence_bleu
#因为推理太慢了，所以推理了100个后打断了，只对已经推理的100个样本计算了bleu指标
bleu_scores = []
for pred, ref in zip(model_predictions[0:31], test_references[0:31]):
    # 将参考答案和预测结果分词
    ref_tokens = [ref.split()]
    pred_tokens = pred.split()
    # 计算 BLEU 分数
    bleu_score = sentence_bleu(ref_tokens, pred_tokens,weights=(1,0,0,0))
    bleu_scores.append(bleu_score)

# 计算平均 BLEU 分数
average_bleu = sum(bleu_scores) / len(bleu_scores)
print(f"Average BLEU Score: {average_bleu}")


Average BLEU Score: 0.16666666666666666


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


下面的部分可以不做，不用上传到抱抱脸上

## Upload Model to HuggingFace

Now, let's save our finetuned model and upload it to HuggingFace.

### Save the fine-tuned model to GGUF format

Choose the llama.cpp's GGUF format we prefer by setting the corresponding `if` to `True`.

In [None]:
# from google.colab import userdata

# HUGGINGFACE_TOKEN = userdata.get('HUGGINGFACE_TOKEN')

In [None]:
# # Save to 8bit Q8_0
# if True: model.save_pretrained_gguf("model", tokenizer,)

# # Save to 16bit GGUF
# if False: model.save_pretrained_gguf("model_f16", tokenizer, quantization_method = "f16")

# # Save to q4_k_m GGUF
# if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

### Push the model to HuggingFace

Create a model type repository for your model if you haven't done so.

In [None]:
# from huggingface_hub import create_repo
# create_repo("wyang14/medical-model", token=HUGGINGFACE_TOKEN, exist_ok=True)

In [None]:
# model.push_to_hub_gguf("wyang14/medical-model", tokenizer, token = HUGGINGFACE_TOKEN)

### Ollama run HuggingFace model

```bash
ollama run hf.co/{username}/{repository}:{quantization}
```

### Ollama inference

```bash
curl http://localhost:11434/api/chat -d '{ \
  "model": "", \
  "messages": [ \
    { "role": "user", "content": "一个患有急性阑尾炎的病人已经发病5天，腹痛稍有减轻但仍然发热，在体检时发现右下腹有压痛的包块，此时应如何处理？" } \
  }'
```