See colab notebook to try out final trained model:

https://colab.research.google.com/drive/1tA6aTU_PtH5Ihh9un_jb2IHY9rHNwKcv?usp=sharing

### Unsloth

In [1]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 16000 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",

    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit" # NEW! Llama 3.3 70B!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.588 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [2]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 4,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True, # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.2.15 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. Llama-3 renders multi turn conversations like below:

```txt
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [3]:
import json
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    conversations = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(conversation, tokenize = False, add_generation_prompt = False) 
        for conversation in conversations
    ]
    return { "text" : texts, }

from datasets import Dataset
dataset = Dataset.from_list(
    json.load(
        open('./data/article_formatter_dataset.json')
    )
  )


In [4]:
print(dataset[0])

{'conversations': [{'content': 'You are a helpful assistant who formats articles given to you by the user.', 'role': 'system'}, {'content': 'Format the following article for me: \n\n<beginning of article>He helps the soul to approach with confidence, and yet with reverence with filial fear, and yet with an emboldened faith with zeal and importunity, and yet with humble submission with lively hope, and yet with self-denial. --David Clarkson 1622 to 1686 Imagine Esther. She stands at the door to the kings inner court. She has not been invited into his presence. She hesitates, knowing the fateful step to follow. She may wear the royal robes as a Persian queen, yet she is far more a prized possession than a beloved wife. In fact, the very circumstances of her ascent to the palace were designed to put queens, and all women in the empire, in their place. Her predecessor had refused the kings summons when he desired to show off her beauty at a royal feast. In response, the king deposed his qu

In [5]:
dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/873 [00:00<?, ? examples/s]

We look at how the conversations are structured for item 5:

In [6]:
dataset[5]

{'conversations': [{'content': 'You are a helpful assistant who formats articles given to you by the user.',
   'role': 'system'},
  {'content': 'Format the following article for me: \n\n<beginning of article>Statistics of servants of God who do not end well can be very discouraging. Dr. J. Robert Clinton conducted a now-famous study of how leaders in Scripture finished their tenures. Of those about whom he was able to get sufficient data, Clinton determined that about one in three finished well. Commenting on the contemporary scene, he suggested that the ratio is probably even worse. But if what the Bible says about Gods keeping power is true Philippians 1 6https ref.lyPhil201.6esv?tbiblia 2 Timothy 1 12, it does not have to be so. I want to focus on two issues that hinder people from finishing well. The first is disappointment over what they have experienced. When talking to Christian leaders, I have often found that, once you get below the surface, there is deep-seated anger or disc

In [7]:
dataset[5]["conversations"]

[{'content': 'You are a helpful assistant who formats articles given to you by the user.',
  'role': 'system'},
 {'content': 'Format the following article for me: \n\n<beginning of article>Statistics of servants of God who do not end well can be very discouraging. Dr. J. Robert Clinton conducted a now-famous study of how leaders in Scripture finished their tenures. Of those about whom he was able to get sufficient data, Clinton determined that about one in three finished well. Commenting on the contemporary scene, he suggested that the ratio is probably even worse. But if what the Bible says about Gods keeping power is true Philippians 1 6https ref.lyPhil201.6esv?tbiblia 2 Timothy 1 12, it does not have to be so. I want to focus on two issues that hinder people from finishing well. The first is disappointment over what they have experienced. When talking to Christian leaders, I have often found that, once you get below the surface, there is deep-seated anger or discontentment over disa

In [8]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are a helpful assistant who formats articles given to you by the user.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nFormat the following article for me: \n\n<beginning of article>Statistics of servants of God who do not end well can be very discouraging. Dr. J. Robert Clinton conducted a now-famous study of how leaders in Scripture finished their tenures. Of those about whom he was able to get sufficient data, Clinton determined that about one in three finished well. Commenting on the contemporary scene, he suggested that the ratio is probably even worse. But if what the Bible says about Gods keeping power is true Philippians 1 6https ref.lyPhil201.6esv?tbiblia 2 Timothy 1 12, it does not have to be so. I want to focus on two issues that hinder people from finishing well. The first is disappointment over what they have experienced. When talking 

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [9]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 2, # Set this for 1 full training run.
        # max_steps = 2,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3408,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Converting train dataset to ChatML (num_proc=2):   0%|          | 0/873 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=2):   0%|          | 0/873 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/873 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/873 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [10]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/873 [00:00<?, ? examples/s]

We verify masking is actually done:

In [11]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are a helpful assistant who formats articles given to you by the user.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nFormat the following article for me: \n\n<beginning of article>Statistics of servants of God who do not end well can be very discouraging. Dr. J. Robert Clinton conducted a now-famous study of how leaders in Scripture finished their tenures. Of those about whom he was able to get sufficient data, Clinton determined that about one in three finished well. Commenting on the contemporary scene, he suggested that the ratio is probably even worse. But if what the Bible says about Gods keeping power is true Philippians 1 6https ref.lyPhil201.6esv?tbiblia 2 Timothy 1 12, it does not have to be so. I want to focus on two issues that hinder people from finishing well. The first is disappointment over what they have experienc

In [12]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

We can see the System and Instruction prompts are successfully masked!

In [13]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")


GPU = NVIDIA GeForce RTX 3090. Max memory = 23.588 GB.
5.516 GB of memory reserved.


In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 873 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 218
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,0.2157
2,0.1897
3,0.2142
4,0.1962
5,0.1782
6,0.1792
7,0.1702
8,0.1838
9,0.1488
10,0.1449


In [15]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

7054.8858 seconds used for training.
117.58 minutes used for training.
Peak reserved memory = 19.672 GB.
Peak reserved memory for training = 14.156 GB.
Peak reserved memory % of max memory = 83.398 %.
Peak reserved memory for training % of max memory = 60.014 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [16]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant who formats articles given to you by the user."
    },
    {
        "role": "user",
        "content": "Format the following article for me: \n\n<beginning of article>He helps the soul to approach with confidence, and yet with reverence with filial fear, and yet with an emboldened faith with zeal and importunity, and yet with humble submission with lively hope, and yet with self-denial. --David Clarkson 1622 to 1686 Imagine Esther. She stands at the door to the kings inner court. She has not been invited into his presence. She hesitates, knowing the fateful step to follow. She may wear the royal robes as a Persian queen, yet she is far more a prized possession than a beloved wife. In fact, the very circumstances of her ascent to the palace were designed to put queens, and all women in the empire, in their place. Her predecessor had refused the kings summons when he desired to show off her beauty at a royal feast. In response, the king deposed his queen and launched an empire-wide search for a new one. Esther, an orphaned Jew under the care of her uncle Mordecai, had won the pageant. She may be a queen, but she is far from his peer -- and not even his only woman. Esther now stands at a crossroads between one likely death or another. The king has been tricked into issuing an irreversible edict against the Jews, not knowing his new queen is Jewish. Her uncle has pled she help her people and warns she too may die if the decree endures. At the same time, death may await if she approaches the king unbidden. Known by all, the royal dictate stipulates, If any man or woman goes to the king inside the inner court without being called, there is but one law -- to be put to death. That is, with one exception. Uninvited visitors will be assumed dead on arrival except the one to whom the king holds out the golden scepter so that he may live Esther 4 11. Even though Esther questions whether she might not presently be in his good favor As for me, I have not been called to come in to the king these thirty days, verse 11, she embraces the risk and puts her hand to the door, knowing, as she has said to her uncle, If I perish, I perish verse 17. She enters. Today, some 2,500 years later, we still celebrate Esthers courage. Faced with such uncertainties and possible death on two sides, she took action that might rescue others, rather than waiting passively for her own fate. But mark this those who claim Christ do not stand in Esthers uninvited, uncertain place when we dare to approach the inner court of heaven. Even though our Kings majesty far outstrips the Persian king of kings over 127 provinces from India to Ethiopia, we approach his throne to make our requests with a stunning confidence. Esther was not wrong to proceed with caution, yet we draw near to a far higher throne and do so with boldness, knowing that, in Christ, the God of heaven already has extended to us his golden scepter. Christian prayer invites a striking mingling of the utmost reverence with the deepest confidence. First, reverence -- and nothing less than reverence -- befits our drawing near to the very throne of heaven, the seat of God Almighty, the all-seeing, all-just, and all-powerful. We need the rebalancing of the Holy Spirit, the whole Christ, and wholesome fellowship. As for that Persian ruler, so-called king of kings, so great were the riches of his royal glory and the splendor and pomp of his greatness that he could make a show of them for 180 days Esther 1 4. Known to history as Xerxes the Great, his majesty far outshone that of David and Solomon, and even Nebuchadnezzar and Cyrus. He was the wealthiest and most powerful man alive, more so than any who had lived to that point in time, surpassing not only his peers but his predecessors. Ponder such a king, and then compare his glory to that of God Almighty. The Majesty who is the great King above all gods Psalm 95 3 is orders of magnitude greater than any Persian ruler. God Almighty deserves far greater reverence upon drawing near. One might dare approach Xerxes, as Esther did, and hope to land his favor. Should we not suspect, then, fear and uncertainty in approaching heavens throne that would dwarf that? Yet in Christ, the true king of kings -- and great high priest -- we are summoned to approach the throne with confidence. While Esther advanced toward the sovereign of Persia with feminine courage, all who are in Christ, men and women, Jew and Gentile, draw near to God Almighty, as exhorts Hebrews, with confidence. Let us then with confidence draw near to the throne of grace, that we may receive mercy and find grace to help in time of need. Hebrews 4 16 In Jesus, we come not just to the Seat of Heaven, with its unsurpassed height and greatness, but we come to the throne of grace. It is indeed a throne. The Sovereign over all the world and its history sits in omnipotence. His Majesty far outshines the glories of ancient Persia and Greece and Rome and all human glories, past and present, combined. And still, Hebrews bids us come into his presence. How then might we, weak and inadequate as we are, muster sufficient reverence to approach such a dignitary? We are not left to ourselves, but we have a Helper who is God himself. According to Thomas Boston 1676 to 1732, Gods own Holy Spirit works in us a holy reverence of God, to whom we pray, which is necessary in acceptable prayer. By this view he strikes us with a holy dread and awe of the majesty of God Complete Works, 11 62. And we come to a throne of grace. Can we justly doubt the Fathers favor toward his own Son? The boldness with which we come is not confidence in ourselves, our merit, our dessert, our worth. It is confidence in Jesus, his person, his sonship, his acceptance, his priesthood, his merit, his worth. And here too, we have a Helper. Gods own Spirit, says Boston, works in us this holy confidence This is it that makes prayer an ease to a troubled heart, the Spirit exciting in us holy confidence in God as a Father. If you ask, How will I get this mingling right? Is it not beyond me to have both a reverence worthy of God and a confidence worthy of Christ? Yes, it is beyond us. Which is why the Helper, dwelling in us, is so vital in prayer. He helps us in our weakness Romans 8 26. He works in us reverence, works in us boldness, and keeps working in us their mingling. And with such help, Joseph Hall 1574 to 1656 is so bold as to say, Good prayers never come weeping home. I am sure I shall receive either what I ask or what I should ask. As we pray, day in and day out, we lean on the Spirit to feed both holy reverence and holy confidence in our hearts. And with his help, we seek to keep the whole story of the whole Christ before us his self-emptying in becoming human, his self-humbling in going to the cross, and his power and glory in rising from the dead and ascending to heaven and being seated at Gods right hand. For whole prayers, we need the whole Christ creator God and fellow man. And we need his whole story condescension in the incarnation and cross, and exaltation up from the grave and up to heaven and up to the throne. We need his majesty and meekness, which make us both reverent Our Father in heaven, hallowed be your name . . . and confident . . . in Jesuss name we pray. Doubtless, our human lives and church lives are complex. In our finitude and sin, we often emphasize one truth to the detriment of another. In both our prayer closets and corporate petitions, we tip toward imbalances -- too casual or too timid -- and need the rebalancing of the Holy Spirit, the whole Christ, and wholesome fellowship. And so along this journey, God manages this mingling in us, this holy blend of awe before his Majesty and audacity in our Messiah. So, what became of Esthers daring entrance? When the king saw Queen Esther standing in the court, she won favor in his sight, and he held out to Esther the golden scepter that was in his hand. Then Esther approached and touched the tip of the scepter. And the king said to her, What is it, Queen Esther? What is your request? It shall be given you, even to the half of my kingdom. Esther 5 2 to 3 So too we come and keep coming -- daily in prayer, weekly in corporate worship. Gods Spirit gives us reverence with confidence, humility with boldness, awe with audacity. In light of Gods might, we approach him with holy fear in light of his mercy, we come with expectant delight. In Christ, we enter not only assured that we already have the Kings favor but knowing that our Fathers kingdom far surpasses all others. And it is our Fathers good pleasure to give it all, with Christ, to his church. Knowing our lowliness and Christs worthiness, we neither grovel nor saunter into the presence of God. And we do not go home flippant or weeping. In Christ, we will receive what we ask or what we should have asked. Thank you, Holy Spirit.\n\n<end of article>"
    },
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 8000, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are a helpful assistant who formats articles given to you by the user.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nFormat the following article for me: \n\n<beginning of article>He helps the soul to approach with confidence, and yet with reverence with filial fear, and yet with an emboldened faith with zeal and importunity, and yet with humble submission with lively hope, and yet with self-denial. --David Clarkson 1622 to 1686 Imagine Esther. She stands at the door to the kings inner court. She has not been invited into his presence. She hesitates, knowing the fateful step to follow. She may wear the royal robes as a Persian queen, yet she is far more a prized possession than a beloved wife. In fact, the very circumstances of her ascent to the palace were designed to put queens, and all women in the empire, in their place. Her predecessor had 

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [17]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant who formats articles given to you by the user."
    },
    {
        "role": "user",
        "content": "Format the following article for me: \n\n<beginning of article>He helps the soul to approach with confidence, and yet with reverence with filial fear, and yet with an emboldened faith with zeal and importunity, and yet with humble submission with lively hope, and yet with self-denial. --David Clarkson 1622 to 1686 Imagine Esther. She stands at the door to the kings inner court. She has not been invited into his presence. She hesitates, knowing the fateful step to follow. She may wear the royal robes as a Persian queen, yet she is far more a prized possession than a beloved wife. In fact, the very circumstances of her ascent to the palace were designed to put queens, and all women in the empire, in their place. Her predecessor had refused the kings summons when he desired to show off her beauty at a royal feast. In response, the king deposed his queen and launched an empire-wide search for a new one. Esther, an orphaned Jew under the care of her uncle Mordecai, had won the pageant. She may be a queen, but she is far from his peer -- and not even his only woman. Esther now stands at a crossroads between one likely death or another. The king has been tricked into issuing an irreversible edict against the Jews, not knowing his new queen is Jewish. Her uncle has pled she help her people and warns she too may die if the decree endures. At the same time, death may await if she approaches the king unbidden. Known by all, the royal dictate stipulates, If any man or woman goes to the king inside the inner court without being called, there is but one law -- to be put to death. That is, with one exception. Uninvited visitors will be assumed dead on arrival except the one to whom the king holds out the golden scepter so that he may live Esther 4 11. Even though Esther questions whether she might not presently be in his good favor As for me, I have not been called to come in to the king these thirty days, verse 11, she embraces the risk and puts her hand to the door, knowing, as she has said to her uncle, If I perish, I perish verse 17. She enters. Today, some 2,500 years later, we still celebrate Esthers courage. Faced with such uncertainties and possible death on two sides, she took action that might rescue others, rather than waiting passively for her own fate. But mark this those who claim Christ do not stand in Esthers uninvited, uncertain place when we dare to approach the inner court of heaven. Even though our Kings majesty far outstrips the Persian king of kings over 127 provinces from India to Ethiopia, we approach his throne to make our requests with a stunning confidence. Esther was not wrong to proceed with caution, yet we draw near to a far higher throne and do so with boldness, knowing that, in Christ, the God of heaven already has extended to us his golden scepter. Christian prayer invites a striking mingling of the utmost reverence with the deepest confidence. First, reverence -- and nothing less than reverence -- befits our drawing near to the very throne of heaven, the seat of God Almighty, the all-seeing, all-just, and all-powerful. We need the rebalancing of the Holy Spirit, the whole Christ, and wholesome fellowship. As for that Persian ruler, so-called king of kings, so great were the riches of his royal glory and the splendor and pomp of his greatness that he could make a show of them for 180 days Esther 1 4. Known to history as Xerxes the Great, his majesty far outshone that of David and Solomon, and even Nebuchadnezzar and Cyrus. He was the wealthiest and most powerful man alive, more so than any who had lived to that point in time, surpassing not only his peers but his predecessors. Ponder such a king, and then compare his glory to that of God Almighty. The Majesty who is the great King above all gods Psalm 95 3 is orders of magnitude greater than any Persian ruler. God Almighty deserves far greater reverence upon drawing near. One might dare approach Xerxes, as Esther did, and hope to land his favor. Should we not suspect, then, fear and uncertainty in approaching heavens throne that would dwarf that? Yet in Christ, the true king of kings -- and great high priest -- we are summoned to approach the throne with confidence. While Esther advanced toward the sovereign of Persia with feminine courage, all who are in Christ, men and women, Jew and Gentile, draw near to God Almighty, as exhorts Hebrews, with confidence. Let us then with confidence draw near to the throne of grace, that we may receive mercy and find grace to help in time of need. Hebrews 4 16 In Jesus, we come not just to the Seat of Heaven, with its unsurpassed height and greatness, but we come to the throne of grace. It is indeed a throne. The Sovereign over all the world and its history sits in omnipotence. His Majesty far outshines the glories of ancient Persia and Greece and Rome and all human glories, past and present, combined. And still, Hebrews bids us come into his presence. How then might we, weak and inadequate as we are, muster sufficient reverence to approach such a dignitary? We are not left to ourselves, but we have a Helper who is God himself. According to Thomas Boston 1676 to 1732, Gods own Holy Spirit works in us a holy reverence of God, to whom we pray, which is necessary in acceptable prayer. By this view he strikes us with a holy dread and awe of the majesty of God Complete Works, 11 62. And we come to a throne of grace. Can we justly doubt the Fathers favor toward his own Son? The boldness with which we come is not confidence in ourselves, our merit, our dessert, our worth. It is confidence in Jesus, his person, his sonship, his acceptance, his priesthood, his merit, his worth. And here too, we have a Helper. Gods own Spirit, says Boston, works in us this holy confidence This is it that makes prayer an ease to a troubled heart, the Spirit exciting in us holy confidence in God as a Father. If you ask, How will I get this mingling right? Is it not beyond me to have both a reverence worthy of God and a confidence worthy of Christ? Yes, it is beyond us. Which is why the Helper, dwelling in us, is so vital in prayer. He helps us in our weakness Romans 8 26. He works in us reverence, works in us boldness, and keeps working in us their mingling. And with such help, Joseph Hall 1574 to 1656 is so bold as to say, Good prayers never come weeping home. I am sure I shall receive either what I ask or what I should ask. As we pray, day in and day out, we lean on the Spirit to feed both holy reverence and holy confidence in our hearts. And with his help, we seek to keep the whole story of the whole Christ before us his self-emptying in becoming human, his self-humbling in going to the cross, and his power and glory in rising from the dead and ascending to heaven and being seated at Gods right hand. For whole prayers, we need the whole Christ creator God and fellow man. And we need his whole story condescension in the incarnation and cross, and exaltation up from the grave and up to heaven and up to the throne. We need his majesty and meekness, which make us both reverent Our Father in heaven, hallowed be your name . . . and confident . . . in Jesuss name we pray. Doubtless, our human lives and church lives are complex. In our finitude and sin, we often emphasize one truth to the detriment of another. In both our prayer closets and corporate petitions, we tip toward imbalances -- too casual or too timid -- and need the rebalancing of the Holy Spirit, the whole Christ, and wholesome fellowship. And so along this journey, God manages this mingling in us, this holy blend of awe before his Majesty and audacity in our Messiah. So, what became of Esthers daring entrance? When the king saw Queen Esther standing in the court, she won favor in his sight, and he held out to Esther the golden scepter that was in his hand. Then Esther approached and touched the tip of the scepter. And the king said to her, What is it, Queen Esther? What is your request? It shall be given you, even to the half of my kingdom. Esther 5 2 to 3 So too we come and keep coming -- daily in prayer, weekly in corporate worship. Gods Spirit gives us reverence with confidence, humility with boldness, awe with audacity. In light of Gods might, we approach him with holy fear in light of his mercy, we come with expectant delight. In Christ, we enter not only assured that we already have the Kings favor but knowing that our Fathers kingdom far surpasses all others. And it is our Fathers good pleasure to give it all, with Christ, to his church. Knowing our lowliness and Christs worthiness, we neither grovel nor saunter into the presence of God. And we do not go home flippant or weeping. In Christ, we will receive what we ask or what we should have asked. Thank you, Holy Spirit.\n\n<end of article>"
    },
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True, )
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 8000,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

# Courage Before the Throne

> _He helps the soul to approach with confidence, and yet with reverence; with filial fear, and yet with an emboldened faith; with zeal and importunity, and yet with humble submission; with lively hope, and yet with self-denial._ --David Clarkson (1622-1686)

Imagine Esther. She stands at the door to the king's inner court. She has not been invited into his presence. She hesitates, knowing the fateful step to follow. She may wear the royal robes as a Persian queen, yet she is far more a prized possession than a beloved wife. In fact, the very circumstances of her ascent to the palace were designed to put queens, and all women in the empire, in their place.

Her predecessor had refused the king's summons when he desired to show off her beauty at a royal feast. In response, the king deposed his queen and launched an empire-wide search for a new one. Esther, an orphaned Jew under the care of her uncle Mordecai, had won the pageant. She may be a queen, but she 

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [18]:
TOKEN=input("Enter your Huggingface Token: ")

In [19]:
# model.save_pretrained("lora_model")  # Local saving
# tokenizer.save_pretrained("lora_model")
model.push_to_hub("erat-verbum/transcript-formatter", token = TOKEN) # Online saving
tokenizer.push_to_hub("erat-verbum/transcript-formatter", token = TOKEN) # Online saving

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/erat-verbum/transcript-formatter


No files have been modified since last commit. Skipping to prevent empty commit.


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [20]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

You are probably thinking of the Eiffel Tower, in Paris (capital of France). It is a famous iron lattice tower that is 324 meters (1,063 feet) tall. It was completed in 1889 and took more than 2 years to build.<|eot_id|>


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [21]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [22]:
# Merge to 16bit
if True: model.save_pretrained_merged("transcript-formatter", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("transcript-formatter", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 38.81 out of 62.59 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 88%|████████▊ | 28/32 [00:00<00:00, 52.26it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [00:01<00:00, 19.02it/s]


Unsloth: Saving tokenizer... Done.
Done.


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [23]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("transcript-formatter", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("erat-verbum/transcript-formatter", tokenizer, quantization_method = "q4_k_m", token = TOKEN)

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
