This Colab finetunes to the OG-Tiro Llama3dumbbabytiro repository.

In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.26" trl peft accelerate bitsandbytes

#high RAM not necessary for 8b models, maybe for 70b?
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # automatically does RoPE Scaling internally, can choose any value
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
#load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.
load_in_8bit = True

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit", # useless
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit", #
] # More models at https://huggingface.co/unsloth


model, tokenizer = FastLanguageModel.from_pretrained(
    #model_name = "unsloth/llama-3-8b-bnb-4bit",
    model_name = "unsloth/llama-3-8b", # 8 Bit version
    #model_name = "unsloth/llama-3-70b-bnb-4bit",
    #model_name = "unsloth/llama-3-70b",
    max_seq_length = max_seq_length,
    dtype = dtype,
    #load_in_4bit = load_in_4bit, # i guess this must be changed to load in 8 bit?
    #load_in_8bit = load_in_8bit
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

print("Model and tokenizer successfully loaded.")
print("Model architecture:", model)

Unsloth: You passed in `unsloth/llama-3-8b` and `load_in_4bit = True`.
We shall load `unsloth/llama-3-8b-bnb-4bit` for 4x faster loading.


config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [4]:
tiro_prompt = """
### Instruction:
{instruction}

### Here is the original:
{question}

### Answer:
{answer}

### Extra Information:
{extra}

### User Input:
{input}

### Expected Response:
{output}
"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    questions = examples["question"]
    answers = examples["answer"]
    extras = examples["extra"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []

    for instruction, question, answer, extra, input, output in zip(instructions, questions, answers, extras, inputs, outputs):
        # Fill the template with data from the dataset
        text = tiro_prompt.format(
            instruction=instruction,
            question=question,
            answer=answer,
            extra=extra,
            input=input,
            output=output
        ) + EOS_TOKEN  # Add EOS token to mark the end of the text
        texts.append(text)

    return {"text": texts}
pass

from datasets import load_dataset
dataset = load_dataset("OG-Tiro/Finetune_Evaluate_Answer", token="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", split="train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Downloading readme:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.63M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1451 [00:00<?, ? examples/s]

Map:   0%|          | 0/1451 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60, # will override epochs if max steps is given
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/1451 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [6]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.
5.594 GB of memory reserved.


In [7]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,451 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.3519
2,2.3241
3,2.2911
4,2.2904
5,2.1407
6,1.9906
7,1.6169
8,1.4319
9,1.3796
10,0.9846


In [8]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

119.1881 seconds used for training.
1.99 minutes used for training.
Peak reserved memory = 7.027 GB.
Peak reserved memory for training = 1.433 GB.
Peak reserved memory % of max memory = 17.761 %.
Peak reserved memory for training % of max memory = 3.622 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
# Define the detailed Tiro-specific prompt
tiro_prompt = """
### Instruction:
{instruction}

### Question:
{question}

### Answer:
{answer}

### Extra :
{extra}

### User Input:
{input}

### Expected Response:
{output}
"""

# Sample data for the prompt
sample_data = {
    "instruction": "You provide helpful feedback on my answer to a question. You are given the original question and answer as well as my answer. It's not essential that I replicate the exact wording of the original answer, but rather that I deliver a response which evidences my comprehension of the question. If my answer shows that I have understood the concept, mark it as correct. Use a tone that would be used by two students quizzing each other avoid sounding strict.\n\nPlease answer in JSON format. The answer should have 2 properties:\n    - \"response\": If I answered incorrectly, provide the correct answer, and provide some additional background information. Keep it concise. Also ask me if i have more follow-up questions regarding this.\nIf I answered correctly, give only one word of affirmation like correct, super, awesome. say nothing else.\n    - \"correct\": This is a boolean, i.e. \"true\" or \"false\" and indicates whether I answered the question correctly (true) or incorrectly (false).\n\n Here is the original:\n    Original question: {question}\n    Correct answer: {answer}\n    Additional helpful information: {extra}",
    "question": "Zu welchem Ergebnis führen die Zottenatrophie, Kryptenhyperplasie und der Verlust des Bürstensaums bei der glutensensitiven Enteropathie?",
    "answer": "Malabsorption",
    "extra": "Makroskopische Befunde - Endoskopie:↳ Schleimhaut: Atrophie↳ Duodenale Kerckring-Falten: Wellige Muscheloberfläche („scalloping“)↳ Pflastersteinrelief (eingekerbte Mukosa durch Furchen und Risse)Makroskopische Befunde :↳ Zottenatrophie | Kryptenhyperplasie | Bürstensaumverlust↳ Erhöhte Zahl an intraepithelialen Lymphozyten (IEL) pro 100 EnterozytenMögliche Symptome einer ZöliakeAllgemeinAntriebslosigkeit | Müdigkeit | AppetitlosigkeitIntestinal allgemeinStuhlveränderungen | Abdominalle BeschwerdenIntestinal malabsorptivGewichtsverlust | Gedeihstörung | VitaminmangelzeichenExtraintestinal psychiatrischWesensänderung | Übellaunigkeit | KonzentrationsstörungExtraintestinal neurologischAtaxie | NeuropathieExtraintestinal hepatologischFettleber | TransaminasenerhöhungExtraintestinal dermatologischDermatitis herpetiformis DuhringExtraintestinal weitereIgA-Mangel",
    "input": "Ähm, dass man viel mehr Hunger auf Big Macs hat.",
    "output": ""  # Leave output blank for generation

}

# Ensure the model is ready for inference
FastLanguageModel.for_inference(model)

# Prepare the input for the model
inputs = tokenizer(
    [
        tiro_prompt.format(
            instruction=sample_data["instruction"],
            question=sample_data["question"],
            answer=sample_data["answer"],
            extra=sample_data["extra"],
            input=sample_data["input"],
            output=sample_data["output"]
        )
    ], return_tensors="pt").to("cuda")  # Ensure tensors are on GPU if available

# Generate output using the model
outputs = model.generate(**inputs, max_new_tokens=150, use_cache=True)  # Adjust max_new_tokens if needed for more complex responses
decoded_responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)  # Decode tokens to strings

# Print the responses
for response in decoded_responses:
    print(response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



### Instruction:
You provide helpful feedback on my answer to a question. You are given the original question and answer as well as my answer. It's not essential that I replicate the exact wording of the original answer, but rather that I deliver a response which evidences my comprehension of the question. If my answer shows that I have understood the concept, mark it as correct. Use a tone that would be used by two students quizzing each other avoid sounding strict.

Please answer in JSON format. The answer should have 2 properties:
    - "response": If I answered incorrectly, provide the correct answer, and provide some additional background information. Keep it concise. Also ask me if i have more follow-up questions regarding this.
If I answered correctly, give only one word of affirmation like correct, super, awesome. say nothing else.
    - "correct": This is a boolean, i.e. "true" or "false" and indicates whether I answered the question correctly (true) or incorrectly (false).



 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# Define the detailed Tiro-specific prompt
tiro_prompt = """
### Instruction:
{instruction}

### Question:
{question}

### Correct Answer:
{answer}

### Extra Information:
{extra}

### User Input:
{input}

### Expected Response:
{output}
"""

# Sample data for the prompt
sample_data = {
    "instruction": "You provide helpful feedback on my answer to a question. You are given the original question and answer as well as my answer. It's not essential that I replicate the exact wording of the original answer, but rather that I deliver a response which evidences my comprehension of the question. If my answer shows that I have understood the concept, mark it as correct. Use a tone that would be used by two students quizzing each other avoid sounding strict.",
    "question": "What is the role of beta-blockers in the management of hypertension?",
    "answer": "Beta-blockers reduce blood pressure by decreasing cardiac output and can be used as part of the treatment regimen.",
    "extra": "Beta-blockers are typically considered when patients have concomitant heart failure or tachycardia.",
    "input": "My doctor mentioned adding a beta-blocker to my treatment.",
    "output": ""  # Leave output blank for generation
}

# Ensure the model is ready for inference
FastLanguageModel.for_inference(model)

# Prepare the input for the model using the custom Tiro prompt
inputs = tokenizer(
    [
        tiro_prompt.format(
            instruction=sample_data["instruction"],
            question=sample_data["question"],
            answer=sample_data["answer"],
            extra=sample_data["extra"],
            input=sample_data["input"],
            output=sample_data["output"]
        )
    ], return_tensors="pt").to("cuda")  # Ensure tensors are on GPU if available

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)

# Use TextStreamer for continuous token by token generation
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)  # Adjust max_new_tokens if needed for more complex responses

# Print the generated tokens as they are created
for token in text_streamer.stream():
    print(token, end='')


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>
### Instruction:
You provide helpful feedback on my answer to a question. You are given the original question and answer as well as my answer. It's not essential that I replicate the exact wording of the original answer, but rather that I deliver a response which evidences my comprehension of the question. If my answer shows that I have understood the concept, mark it as correct. Use a tone that would be used by two students quizzing each other avoid sounding strict.

### Question:
What is the role of beta-blockers in the management of hypertension?

### Correct Answer:
Beta-blockers reduce blood pressure by decreasing cardiac output and can be used as part of the treatment regimen.

### Extra Information:
Beta-blockers are typically considered when patients have concomitant heart failure or tachycardia.

### User Input:
My doctor mentioned adding a beta-blocker to my treatment.

### Expected Response:

### Correct Response:
Correct. Beta-blockers are often used in co

AttributeError: 'TextStreamer' object has no attribute 'stream'

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [9]:
!huggingface-cli login #hier muss der huggingface WRITE token rein sonst gehts net


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [12]:
model.save_pretrained("_model") # Local saving
tokenizer.save_pretrained("Llama3dumbbabytiro")
model.push_to_hub("OG-Tiro/Llama3dumbbabytiro", token = "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") # Online saving
tokenizer.push_to_hub("OG-Tiro/Llama3dumbbabytiro", token = "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") # Online saving



Saved model to https://huggingface.co/OG-Tiro/Llama3dumbbabytiro


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [13]:
#skip this, this is only for local interference
if True:  # This should be set to True if you want to execute this block
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="Llama3dumbbabytiro",  # Updated to your model's name
        max_seq_length=4096,  # Example value, set this as per your model's training config
        dtype=None,  # Adjust based on your hardware capabilities
        load_in_4bit=False,  # Set according to your model's quantization settings
    )
    FastLanguageModel.for_inference(model)

# Update this to your actual prompt format
tiro_prompt = """
### Instruction:
{instruction}

### Question:
{question}

### Correct Answer:
{answer}

### Extra Information:
{extra}

### User Input:
{input}

### Expected Response:
{output}
"""

# Example data to format the prompt
sample_data = {
    "instruction": "Identify the landmark based on the description.",
    "question": "What is a famous tall tower in Paris?",
    "answer": "Eiffel Tower",
    "extra": "It was constructed as the entrance to the 1889 World's Fair.",
    "input": "",
    "output": ""  # Leave output blank for generation
}

inputs = tokenizer(
    [
        tiro_prompt.format(**sample_data)
    ], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
decoded_responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)

for response in decoded_responses:
    print(response)

Unsloth: You passed in `unsloth/llama-3-8b-bnb-4bit` which is a 4bit model, yet you set
`load_in_4bit = False`. We shall load `unsloth/llama-3-8b` instead.


config.json:   0%|          | 0.00/698 [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



### Instruction:
Identify the landmark based on the description.

### Question:
What is a famous tall tower in Paris?

### Correct Answer:
Eiffel Tower

### Extra Information:
It was constructed as the entrance to the 1889 World's Fair.

### User Input:


### Expected Response:

Please answer in JSON format. The answer should have 2 properties:
    - "response": If you answered incorrectly, give some background information on the correct answer. Keep it concise.
If you answered correctly, give only one word of affirmation like correct, super, awesome. say nothing else.
    - "correct":


### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [15]:
# Merge to 16bit
if True: model.save_pretrained_merged("Llama3dumbbabytiro", tokenizer, save_method = "merged_16bit",)
if True: model.push_to_hub_merged("Llama3dumbbabytiro", tokenizer, save_method = "merged_16bit", token = "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") #highest available quality

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 62.01 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:46<00:00,  1.46s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 62.02 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:44<00:00,  1.38s/it]


Unsloth: Saving to organization with address OG-Tiro/Llama3dumbbabytiro
Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving to organization with address OG-Tiro/Llama3dumbbabytiro
Unsloth: Uploading all files... Please wait...


model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

Done.
Saved merged model to https://huggingface.co/None/Llama3dumbbabytiro


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).