Check with this tutorials:
https://www.youtube.com/watch?v=W_xh6qNSfAQ
https://docs.unsloth.ai/models/tutorials-how-to-fine-tune-and-run-llms/tutorial-how-to-finetune-llama-3-and-use-in-ollama

some models:
4bit pre quantized models we support for 4x faster downloading + no OOMs.
- "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
- "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
- "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
- "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
- "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
- "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
- "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
- "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
- "unsloth/Phi-3-medium-4k-instruct",
- "unsloth/gemma-2-9b-bnb-4bit",
- "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
 More models at https://huggingface.co/unsloth

In [1]:
import os
# 1) Disable Unsloth auto torch.compile
os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"   # Unsloth docs confirm this flag :contentReference[oaicite:1]{index=1}

# 2) Shorten all Inductor/Triton/temp paths (Windows-safe)
os.environ["TORCHINDUCTOR_CACHE_DIR"] = r"C:\ti"
os.environ["TRITON_CACHE_DIR"]        = r"C:\triton"

from unsloth import FastLanguageModel

model_id = 'unsloth/Meta-Llama-3.1-8B-bnb-4bit'

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_id,
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm
W1122 15:15:35.053000 13892 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


ü¶• Unsloth Zoo will now patch everything to make training faster!


  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"{DEVICE_TYPE_TORCH}:{i}") for i in range(n_gpus)])


==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA GeForce RTX 5060 Ti. Num GPUs = 1. Max memory: 15.928 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Load the data from the JSONL file

In [2]:
import json
from datasets import Dataset

# Load the training data
training_data = []
with open("../Data/socratic_questions_GPTOSS3000.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        training_data.append(json.loads(line))

ds = Dataset.from_list(training_data)
ds[0]



{'claim': '"Arizona officials caught changing ballots, have been arrested."',
 'summary': "The claim asserts that Arizona officials were caught changing ballots and have been arrested. It is a qualitative, absolute statement with no cited source or methodological detail. Geography (Arizona) is specified, but the time frame is not, leading to alerts about the claim's qualitative nature, missing source/methodology, and missing time period.",
 'alerts': ['qualitative claim',
  'source/methodology missing',
  'time period missing',
  'geography present'],
 'url_used': False,
 'question': 'What specific evidence or sources support the assertion that Arizona officials were caught changing ballots and arrested, and how can that evidence be independently verified?'}

One issue is this dataset has multiple columns. For Ollama and llama.cpp to function like a custom Assistant, we must only have 2 columns prompt and an output column.

The template needs a PROMPT and OUTPUT field. Here we use Claim and the justification for the fact check combined as the prompt, and "question" for the OUTPUT field.

In [3]:
from unsloth import to_sharegpt
dataset_simple = to_sharegpt(
    ds,
    merged_prompt = "[[The claim is {claim}.\n]][[This is all the information know about the claim: {summary}.\n]]",
    output_column_name = "question",
)
dataset_simple[0]

Merging columns: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 12000/12000 [00:00<00:00, 206607.48 examples/s]
Converting to ShareGPT: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 12000/12000 [00:00<00:00, 242764.56 examples/s]


{'conversations': [{'from': 'human',
   'value': 'The claim is "Arizona officials caught changing ballots, have been arrested.".\nThis is all the information know about the claim: The claim asserts that Arizona officials were caught changing ballots and have been arrested. It is a qualitative, absolute statement with no cited source or methodological detail. Geography (Arizona) is specified, but the time frame is not, leading to alerts about the claim\'s qualitative nature, missing source/methodology, and missing time period..\n'},
  {'from': 'gpt',
   'value': 'What specific evidence or sources support the assertion that Arizona officials were caught changing ballots and arrested, and how can that evidence be independently verified?'}]}

Finally use `standardize_sharegpt`! to convert all `user`, `assistant` and `system` tags to OpenAI Hugging Face style: using `user` and `assistant`.

In [4]:
from unsloth import standardize_sharegpt
dataset_standard = standardize_sharegpt(dataset_simple)
dataset_standard[0]

Unsloth: Standardizing formats (num_proc=12): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 12000/12000 [00:13<00:00, 896.13 examples/s]


{'conversations': [{'content': 'The claim is "Arizona officials caught changing ballots, have been arrested.".\nThis is all the information know about the claim: The claim asserts that Arizona officials were caught changing ballots and have been arrested. It is a qualitative, absolute statement with no cited source or methodological detail. Geography (Arizona) is specified, but the time frame is not, leading to alerts about the claim\'s qualitative nature, missing source/methodology, and missing time period..\n',
   'role': 'user'},
  {'content': 'What specific evidence or sources support the assertion that Arizona officials were caught changing ballots and arrested, and how can that evidence be independently verified?',
   'role': 'assistant'}]}

Next use a chat_template

In [5]:
chat_template = """Below describes some details about a facty-checked claim.
Ask a critical socratic question that would help to critically analyse all that is known, up until now, about the claim.
>>> Claim and details:
{INPUT}
>>> Critical socratic question:
{OUTPUT}"""

from unsloth import apply_chat_template
dataset = apply_chat_template(
    dataset_standard,
    tokenizer = tokenizer,
    chat_template = chat_template,
    # default_system_message = "You are a helpful assistant", << [OPTIONAL]
)

Unsloth: We automatically added an EOS token to stop endless generations.
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 12000/12000 [00:00<00:00, 33105.93 examples/s]


In [6]:
from unsloth import FastLanguageModel

# Add LoRA adapters (include embed_tokens + lm_head if using base model)
target_modules = [
    "q_proj","k_proj","v_proj","o_proj",
    "gate_proj","up_proj","down_proj",
] 

model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    target_modules=target_modules,
    lora_alpha=64*2,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

Unsloth 2025.11.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Train the model

In [7]:
from trl import SFTTrainer, SFTConfig

#pretokenize the dataset (faster training)
def tokenize_fn(batch):
    out = tokenizer(
        batch["text"],
        truncation=True,
        max_length=2048,
        padding=False,
    )
    out["labels"] = out["input_ids"]
    return out
tok_ds = dataset.map(tokenize_fn, batched=True, num_proc=None)


trainer = SFTTrainer(  # supervised fine-tuning trainer
    model = model,
    train_dataset = tok_ds,
    tokenizer = tokenizer,
    dataset_text_field = None,
    max_seq_length = 2048,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 1e-4,  #when the learning rate is set to 2e-4 it overshoots
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

trainer.train()

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 12000/12000 [00:02<00:00, 4365.37 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 12,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 167,772,160 of 8,198,033,408 (2.05% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,3.0409
2,2.952
3,2.9187
4,2.5694
5,2.4572
6,2.1927
7,1.7821
8,1.7146
9,1.613
10,1.3821


TrainOutput(global_step=60, training_loss=1.287584732969602, metrics={'train_runtime': 211.784, 'train_samples_per_second': 2.266, 'train_steps_per_second': 0.283, 'total_flos': 5205218912403456.0, 'train_loss': 1.287584732969602, 'epoch': 0.04})

### Inference

In [None]:
from unsloth import FastLanguageModel

FastLanguageModel.for_inference(model)
messages = [
    {"role": "user", "content": "The claim is 'Arizona officials caught changing ballots, have been arrested.'.\nThis is all the information know about the claim: The claim asserts that Arizona officials were caught changing ballots and have been arrested. It is a qualitative, absolute statement with no cited source or methodological detail. Geography (Arizona) is specified, but the time frame is not, leading to alerts about the claim\'s qualitative nature, missing source/methodology, and missing time period..\n"},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

NameError: name 'FastLanguageModel' is not defined

Save the lora adapters

In [10]:
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

('lora_model\\tokenizer_config.json',
 'lora_model\\special_tokens_map.json',
 'lora_model\\chat_template.jinja',
 'lora_model\\tokenizer.json')

Load model first before Inference (Set to False if it is already loaded)

In [1]:
if True: # Load model first before Inference (Set to False if it is already loaded)
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        dtype = None,
        load_in_4bit = True,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference
pass


messages = [
    {"role": "user", "content": "The claim is 'Arizona officials caught changing ballots, have been arrested.'.\nThis is all the information know about the claim: The claim asserts that Arizona officials were caught changing ballots and have been arrested. It is a qualitative, absolute statement with no cited source or methodological detail. Geography (Arizona) is specified, but the time frame is not, leading to alerts about the claim\'s qualitative nature, missing source/methodology, and missing time period..\n"},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm
W1122 15:43:25.105000 5704 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


ü¶• Unsloth Zoo will now patch everything to make training faster!


  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"{DEVICE_TYPE_TORCH}:{i}") for i in range(n_gpus)])


==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA GeForce RTX 5060 Ti. Num GPUs = 1. Max memory: 15.928 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.11.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


What assumptions are being made about the meaning of "caught" and "changing ballots" that might lead to different interpretations of the claim?<|end_of_text|>


### Exporting to Ollama

Before you do this, you need to install llama.cpp. 
- On a windows computer run: `winget install llama.cpp`
- Next check where windows installed llama.cpp: `where llama-quantize.exe`
- Copy `llama-quantize.exe` to `llama.cpp `in your project folder

if the `lama.cpp` folder already exists first run in Powershell:
`if (Test-Path ".\llama.cpp") {Remove-Item ".\llama.cpp" -Recurse -Force}`
Next copy the `lama-quantize` file to `llama.cpp`

`q4_k_m` doesn't work with the code below, run the `f16` option and manualy in powershel convert it into `q4_k_m`:
- `# Pick your 16-bit GGUF file name:`
- `$IN  = "model\unsloth.BF16.gguf"   # adjust to your actual file`
- `$OUT = "model\unsloth.q4_k_m.gguf"`

- `.\llama-quantize.exe "$IN" "$OUT" q4_k_m`

In [1]:
from dotenv import load_dotenv
load_dotenv(dotenv_path=".env", override=True)

# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

NameError: name 'model' is not defined

In [3]:
import subprocess
subprocess.Popen(["ollama", "serve"])
import time
time.sleep(3) # Wait for a few seconds for Ollama to load!

Finally load the model in ollama

In [5]:
!ollama create unsloth_llama_q4_k_m -f ./model/Modelfile_q4_k_m

[?2026h[?25l[1Ggathering model components ‚†ã [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ‚†ô [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ‚†π [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ‚†∏ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ‚†º [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ‚†¥ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ‚†¶ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ‚†ß [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ‚†á [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ‚†è [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ‚†ã [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ‚†ô [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ‚†π [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ‚†∏ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ‚†º [K[?25h[?2026l[?2026h[

In [2]:
!ollama list

NAME                           ID              SIZE      MODIFIED     
unsloth_llama_q4_k_m:latest    354540974481    4.9 GB    41 hours ago    
unsloth_model:latest           4cf99a75cee6    8.5 GB    41 hours ago    
llama3:8b                      365c0bd3c000    4.7 GB    10 days ago     
qwen3:4b                       e55aed6fe643    2.5 GB    2 weeks ago     
