# Finetuning an model

these 4 models will be finetuned, with synthetic data based upon real fact-checked claims:
- "unsloth/mistral-7b-bnb-4bit"
- "unsloth/mistral-7b-instruct-v0.3-bnb-4bit ",
- "unsloth/Meta-Llama-3.1-8B-bnb-4bit"
- "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit "

We start with Installing Unsloth, if you work on Colab use the code below.

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

In [None]:
from unsloth import FastLanguageModel

model_id = "unsloth/mistral-7b-bnb-4bit"
#model_id = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit ",
#model_id = "unsloth/Meta-Llama-3.1-8B-bnb-4bit"
#model_id = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit "

max_seq_length = 4096

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_id,
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

==((====))==  Unsloth 2025.11.3: Fast Mistral patching. Transformers: 4.57.1.
   \\   /|    NVIDIA GeForce RTX 5060 Ti. Num GPUs = 1. Max memory: 15.928 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Next the dataset is loaded with 12000 samples in a JSONL file.

In [None]:
import json
from datasets import Dataset

# Load the training data
training_data = []
with open("socratic_questions_GPTOSS3000.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        training_data.append(json.loads(line))

ds = Dataset.from_list(training_data)
ds[0]

{'claim': '"Arizona officials caught changing ballots, have been arrested."',
 'summary': 'The claim states that Arizona officials were caught changing ballots and have been arrested. It is a qualitative, absolute assertion with no cited source or methodology; the geography (Arizona) is clear, but the time frame of the alleged arrests is not provided.',
 'alerts': ['qualitative claim',
  'source/methodology missing',
  'time period missing'],
 'url_used': False,
 'question': 'What kinds of evidence would be needed to determine whether any Arizona officials were actually arrested for altering ballots, and how might the absence of a time frame affect that search?',
 'history': ['How might the spread of an unverified claim about election officials influence public trust, and what responsibilities do journalists have when encountering such statements?',
  'Which perspectives or stakeholders are missing from the current description, and how could including them change the way we evaluate th

We will use the Alpaca prompt template to structure in- and output

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token  # Llama/Mistral EOS like <|end_of_text|>

All the context is added to the template and glued together in a string. The questions are examples of what the output should be.

In [None]:
def to_alpaca_text(examples):
    texts = []

    for claim, summary, alerts, history, question in zip(
        examples["claim"],
        examples["summary"],
        examples["alerts"],
        examples.get("history", [[]]*len(examples["claim"])),
        examples["question"],
    ):
        # 1) instruction: fixed goal of the model
        instruction = (
            "Generate a reflective Socratic questions to challenge the student’s reasoning, surface blind spots, and encourage deeper reflection, rather than accepting the AI’s output at face value."
            "Differentiate the question from previous questions asked about the claim."
        )

        # 2) input: merge your fields into one context string
        alerts_str = ", ".join(alerts) if isinstance(alerts, list) else str(alerts)
        history_str = "\n- " + "\n- ".join(history) if isinstance(history, list) and len(history) else ""

        input_text = (
            f"[[The claim is {claim}.]]\n"
            f"[[This is all the information known about the claim: {summary}.]]\n"
            f"[[These alerts point out omissions: {alerts_str}.]]\n"
            f"[[Take into account previous questions asked: {history_str}]]"
        )

        # 3) output: your label
        output_text = question

        # final alpaca-formatted training text
        text = alpaca_prompt.format(instruction, input_text, output_text) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}

dataset = ds.map(to_alpaca_text, batched=True, remove_columns=ds.column_names)

# check one
print(dataset[0]["text"][:4000])

Map: 100%|██████████| 12000/12000 [00:00<00:00, 88086.48 examples/s]

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Generate a reflective Socratic questions to challenge the student’s reasoning, surface blind spots, and encourage deeper reflection, rather than accepting the AI’s output at face value.Differentiate the question from previous questions asked about the claim.

### Input:
[[The claim is "Arizona officials caught changing ballots, have been arrested.".]]
[[This is all the information known about the claim: The claim states that Arizona officials were caught changing ballots and have been arrested. It is a qualitative, absolute assertion with no cited source or methodology; the geography (Arizona) is clear, but the time frame of the alleged arrests is not provided..]]
[[These alerts point out omissions: qualitative claim, source/methodology missing, time period missing.]]
[[Take into account previous questions asked:




As mentioned in https://docs.unsloth.ai/get-started/fine-tuning-llms-guide/tutorial-how-to-finetune-llama-3-and-use-in-ollama:

the rank is set to 32 to increase accuracy, but since the task is not that complicated and always the same, higher is not necessary. Lora_alpha is set to double this as recommended
only the low rank matrices are trained so Bias is none
Lora_dropout is a regularization parameter to prevent overfitting

In [None]:
from unsloth import FastLanguageModel

# Add LoRA adapters
target_modules = [
    "q_proj","k_proj","v_proj","o_proj",
    "gate_proj","up_proj","down_proj",
]

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=target_modules,
    lora_alpha=32*2,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

Unsloth 2025.11.3 patched 40 layers with 40 QKV layers, 40 O layers and 40 MLP layers.


In [10]:
from unsloth import is_bfloat16_supported

# test if bfloat16 is supported
print(is_bfloat16_supported())

True


I spilt the data in training an test set to evaluate results, the setup of most of these parameters are derived from examples The instruct models are trained in 1 epoch and the base models in 2 epochs

In [None]:
from trl import SFTTrainer, SFTConfig
from unsloth import unsloth_train, is_bfloat16_supported

# split the dataset in a train and test set
new_dataset = dataset.train_test_split(test_size = 0.01)

# train the model
trainer = SFTTrainer(
    model = model,
    train_dataset = new_dataset["train"],
    eval_dataset = new_dataset["test"],
    tokenizer = tokenizer,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 1,
    packing=False,
    args= SFTConfig(
        output_dir="outputs",
        seed=3407,

        # Effective batch size = 2 * 4 = 8
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps = 5,

        # Use epochs instead of max_steps
        num_train_epochs=1,

        # more precise tuning than default 2e-4
        learning_rate=1e-4,
        lr_scheduler_type="linear",

        # bfloat16 is supported
        bf16=False,

        optim="adamw_8bit",
        weight_decay=0.001,

        logging_steps=10,
        save_steps=200,
        #save_total_limit=2,

        # evaluate every 50 steps
        per_device_eval_batch_size = 2,
        eval_accumulation_steps = 4,
        eval_strategy = "steps",
        eval_steps = 25,
    ),
)

#trainer.train()
unsloth_train(trainer)

### Inference

In [None]:
from unsloth import FastLanguageModel

FastLanguageModel.for_inference(model)
messages = [
    {"role": "user", "content": "The claim is 'Arizona officials caught changing ballots, have been arrested.'.\nThis is all the information know about the claim: The claim asserts that Arizona officials were caught changing ballots and have been arrested. It is a qualitative, absolute statement with no cited source or methodological detail. Geography (Arizona) is specified, but the time frame is not, leading to alerts about the claim\'s qualitative nature, missing source/methodology, and missing time period..\n"},
]

# Instruction for inference, matching the training instruction
instruction_inference = "Generate a reflective Socratic questions to challenge the student’s reasoning, surface blind spots, and encourage deeper reflection, rather than accepting the AI’s output at face value.Differentiate the question from previous questions asked about the claim."

# Extract the user content
user_content = messages[0]["content"]

# Format the prompt using the alpaca_prompt template, leaving the response empty for generation
input_prompt = alpaca_prompt.format(instruction_inference, user_content, "")

# Tokenize the manually formatted prompt
inputs = tokenizer(
    input_prompt,
    return_tensors="pt",
    return_attention_mask=True
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs.input_ids, attention_mask = inputs.attention_mask, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Save the lora adapters

In [11]:
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

('lora_model\\tokenizer_config.json',
 'lora_model\\special_tokens_map.json',
 'lora_model\\chat_template.jinja',
 'lora_model\\tokenizer.json')

### Load model first before Inference

In [8]:
from unsloth import FastLanguageModel

if True: # Load model first before Inference (Set to False if it is already loaded)
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        dtype = None,
        load_in_4bit = True,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference
pass


# Instruction for inference, matching the training instruction
instruction_inference = "Generate a reflective Socratic questions to challenge the student’s reasoning, surface blind spots, and encourage deeper reflection, rather than accepting the AI’s output at face value.Differentiate the question from previous questions asked about the claim."

# Extract the user content
user_content = messages[0]["content"]

# Format the prompt using the alpaca_prompt template, leaving the response empty for generation
input_prompt = alpaca_prompt.format(instruction_inference, user_content, "")

inputs = tokenizer(
    input_prompt,
    return_tensors="pt",
    return_attention_mask=True
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs.input_ids, attention_mask = inputs.attention_mask, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

==((====))==  Unsloth 2025.11.3: Fast Mistral patching. Transformers: 4.57.1.
   \\   /|    NVIDIA GeForce RTX 5060 Ti. Num GPUs = 1. Max memory: 15.928 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: Will load lora_model as a legacy tokenizer.


What evidence would be needed to determine whether any Arizona officials have actually been arrested for altering ballots, and where might that evidence be found?</s>


### Exporting to Ollama

Before you do this, you need to
- install uv
- download windows compatible files from https://github.com/ggml-org/llama.cpp/releases
- copy: these files to llama.cpp
- copy and rename the llama-quantize.exe files so Unsloth can find it (looks for unix style)
copy llama-quantize.exe llama-quantize (from withing the llama.cpp folder)

In [None]:
from dotenv import load_dotenv
load_dotenv(dotenv_path=".env", override=True)

# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

In [2]:
import subprocess
subprocess.Popen(["ollama", "serve"])
import time
time.sleep(3) # Wait for a few seconds for Ollama to load!

If you work on collab, connect your drive to Colab and copy all the files to drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Create base folder and subfolder using model_id
!mkdir -p "/content/drive/MyDrive/FineTunedModels/$model_id"

# Move generated artifacts into the model-specific folder
!mv model       "/content/drive/MyDrive/FineTunedModels/$model_id"
!mv outputs     "/content/drive/MyDrive/FineTunedModels/$model_id"
!mv lora_model  "/content/drive/MyDrive/FineTunedModels/$model_id"
!mv wandb       "/content/drive/MyDrive/FineTunedModels/$model_id"

Finally load the model in ollama. this step should be done on your local pc.

In [6]:
!ollama create mistral7b-q4km -f Modelfile-mistral7b-q4km

[?2026h[?25l[1Ggathering model components [K[?25h[?2026l
Error: 400 Bad Request: invalid model name


In [11]:
!ollama list

NAME        ID              SIZE      MODIFIED     
qwen3:4b    e55aed6fe643    2.5 GB    2 months ago    
