To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Read our **[Qwen3 Guide](https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

### Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.8.0+cu126)
    Python  3.12.9 (you have 3.12.11)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.10: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.8.10 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
Exception ignored in: <function _xla_gc_callback at 0x7fa9bcf43380>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/jax/_src/lib/__init__.py", line 96, in _xla_gc_callback
    def _xla_gc_callback(*args):
    
KeyboardInterrupt: 


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Alpaca.ipynb)

For text completions like novel writing, try this [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb).

In [None]:
!pip install datasets transformers --quiet
!pip install -U datasets



In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
from huggingface_hub import login
import os

# 在這裡貼上自己的token
login(token="#######")

# 載入tokenizer（根據要用的模型來決定）
tokenizer = AutoTokenizer.from_pretrained("unsloth/Meta-Llama-3.1-8B")

# Alpaca 格式的 prompt 模板
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# 必須添加 EOS_TOKEN
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    # 為時尚推薦系統定義統一的指令
    instruction = "Provide fashion recommendations based on the user's requirements and preferences."

    inputs = examples["input"]  # 用戶的需求和偏好
    outputs = examples["completion"]  # 時尚推薦回應

    texts = []
    for input_text, output in zip(inputs, outputs):
        # 使用 Alpaca 格式，並確保添加 EOS_TOKEN
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}

# 載入資料集
dataset = load_dataset("neuralwork/fashion-style-instruct", split="train")

# 應用格式化函數
dataset = dataset.map(formatting_prompts_func, batched=True)

# 移除不需要的欄位，只保留 "text" 欄位
dataset = dataset.remove_columns(["input", "completion", "context"])

# 檢查結果（印出其中一筆）
print("=" * 50)
print("Sample formatted data:")
print("=" * 50)
print(dataset[0]["text"])
print("=" * 50)

# 如果你需要進一步的 tokenization（通常在訓練時會自動處理）
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding=False,  # 通常在 DataLoader 中進行 padding
    )

# 可選：如果需要預先 tokenize（大多數訓練腳本會自動處理）
# tokenized_dataset = dataset.map(tokenize_function, batched=True)

print(f"Dataset size: {len(dataset)}")
print("Dataset is ready for training!")

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/882 [00:00<?, ?B/s]

data/train-00000-of-00001-9b0ae8e510f95a(…):   0%|          | 0.00/2.64M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3193 [00:00<?, ? examples/s]

Map:   0%|          | 0/3193 [00:00<?, ? examples/s]

Sample formatted data:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Provide fashion recommendations based on the user's requirements and preferences.

### Input:
I'm a tall, athletic man with broad shoulders and a narrow waist. I prefer sharp, tailored suits that highlight my V-shaped torso.

### Response:
Outfit Combination 1:
- Top: Fitted white linen shirt
- Bottom: Slim-fit beige chinos
- Shoe: Brown leather loafers
- Accessories: Brown woven belt, aviator sunglasses

Outfit Combination 2:
- Top: Light blue oxford button-down shirt
- Bottom: Navy blue tailored trousers
- Shoe: Tan leather brogues
- Accessories: Navy blue patterned pocket square, silver wristwatch

Outfit Combination 3:
- Top: Light gray tailored blazer
- Bottom: Dark wash denim jeans
- Shoe: White canvas sneakers
- Accessories: Black leather belt, silver pendant necklace

Outfit Combina

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# 確認這些變數已經在之前的程式碼中定義：
# - model: 要微調的模型
# - tokenizer: 對應的 tokenizer
# - dataset: 已經格式化的訓練資料集
# - max_seq_length: 最大序列長度


max_seq_length = 2048  # 或者根據需求調整

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",  # 這個要與資料集欄位名稱一致
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,  # 對於推薦系統通常設為 False 比較好
    args=TrainingArguments(
        per_device_train_batch_size=2,  # 如果 GPU 記憶體不足，改為 1
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,  # 可以根據資料集大小調整
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",  # 如果要用 wandb 記錄，改為 "wandb"
        save_steps=20,  # 每 20 步保存一次模型
        save_total_limit=3,  # 最多保存 3 個 checkpoint
        dataloader_pin_memory=False,  # 在 Colab 中建議設為 False
    ),
)


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/3193 [00:00<?, ? examples/s]

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
6.881 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 3,193 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.4145
2,1.359
3,1.3934
4,1.3376
5,1.3296
6,1.2912
7,1.1788
8,1.1704
9,1.1048
10,0.9902


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1136.6109 seconds used for training.
18.94 minutes used for training.
Peak reserved memory = 7.771 GB.
Peak reserved memory for training = 0.89 GB.
Peak reserved memory % of max memory = 52.717 %.
Peak reserved memory for training % of max memory = 6.038 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!



In [None]:
# 使用之前定義的 alpaca_prompt 模板
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# 啟用快速推理模式
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

# 測試案例 1: 基本時尚推薦
print("=" * 60)
print("測試 1: 基本時尚推薦")
print("=" * 60)

test_input_1 = "I'm looking for a casual outfit for weekend brunch. I prefer comfortable clothes in neutral colors."

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Provide fashion recommendations based on the user's requirements and preferences.", # instruction
        test_input_1, # input
        "", # output - leave this blank for generation!
    )
], return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    use_cache=True,
    temperature=0.7,  # 添加一些創意性
    do_sample=True,   # 啟用採樣
    top_p=0.9        # nucleus sampling
)

result = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(result[0])

# 測試案例 2: 特定場合推薦
print("\n" + "=" * 60)
print("測試 2: 特定場合推薦")
print("=" * 60)

test_input_2 = "I have a job interview at a tech company next week. I'm a woman in my late 20s and want to look professional but not too formal."

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Provide fashion recommendations based on the user's requirements and preferences.",
        test_input_2,
        "",
    )
], return_tensors="pt").to("cuda")

# 使用 TextStreamer 來即時顯示生成過程
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_special_tokens=True)

print("生成中...")
_ = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=150,
    temperature=0.6,
    do_sample=True,
    top_p=0.85
)

# 測試案例 3: 季節性推薦
print("\n" + "=" * 60)
print("測試 3: 季節性推薦")
print("=" * 60)

test_input_3 = "What should I wear for a summer wedding? I'm a guest and it's an outdoor ceremony in the afternoon."

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Provide fashion recommendations based on the user's requirements and preferences.",
        test_input_3,
        "",
    )
], return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=120,
    use_cache=True,
    temperature=0.8,
    do_sample=True,
    repetition_penalty=1.1  # 減少重複
)

result = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(result[0])

# 測試案例 4: 預算考量推薦
print("\n" + "=" * 60)
print("測試 4: 預算考量推薦")
print("=" * 60)

test_input_4 = "I'm a college student on a tight budget. I need versatile pieces that can work for both classes and going out with friends."

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Provide fashion recommendations based on the user's requirements and preferences.",
        test_input_4,
        "",
    )
], return_tensors="pt").to("cuda")

# 即時生成並顯示
text_streamer = TextStreamer(tokenizer, skip_special_tokens=True)
print("生成中...")
_ = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=140,
    temperature=0.7,
    do_sample=True,
    top_k=50,
    top_p=0.9
)

# 功能測試：檢查模型是否正確理解指令
print("\n" + "=" * 60)
print("功能測試：模型理解能力檢查")
print("=" * 60)

def test_model_response(user_input, description):
    print(f"\n{description}")
    print("-" * 40)

    inputs = tokenizer(
    [
        alpaca_prompt.format(
            "Provide fashion recommendations based on the user's requirements and preferences.",
            user_input,
            "",
        )
    ], return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.6,
        do_sample=True,
        use_cache=True
    )

    result = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    print(result)
    return result

# 多個測試案例
test_cases = [
    ("I love vintage style and bold colors. What should I wear to a art gallery opening?", "藝術風格測試"),
    ("I'm going hiking this weekend but want to look cute in photos. Any suggestions?", "戶外活動測試"),
    ("I need work-from-home outfits that are comfortable but look good on video calls.", "居家工作測試")
]

for user_input, description in test_cases:
    test_model_response(user_input, description)

print("\n" + "=" * 60)
print("測試完成！")
print("=" * 60)

測試 1: 基本時尚推薦
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Provide fashion recommendations based on the user's requirements and preferences.

### Input:
I'm looking for a casual outfit for weekend brunch. I prefer comfortable clothes in neutral colors.

### Response:
Outfit 1:
- Top: A lightweight, loose-fitting t-shirt in a neutral color like white or light gray.
- Bottom: A pair of relaxed-fit jeans in a medium wash.
- Shoes: Slip-on canvas sneakers in a neutral color like black or brown.
- Accessories: A baseball cap in a fun pattern or color to add a playful touch.

Outfit 2:
- Top: A flowy, off-the-shoulder blouse in a soft, neutral color like blush or beige.
- Bottom: A pair of high-waisted wide-leg pants in a solid color like khaki or olive green.
-

測試 2: 特定場合推薦
生成中...
Below is an instruction that describes a task, paired with an input that provides 

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [None]:
if True:
    model.push_to_hub_gguf(
        "username/recft_unsloth-Meta-Llama-3.1-8B-2", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m"], #, "q8_0", "q5_k_m",
        token = "hf_########",
    )

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 6.0G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.16 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 44%|████▍     | 14/32 [00:01<00:01, 11.38it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [03:17<00:00,  6.17s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving floraliuya/recft_unsloth-Meta-Llama-3.1-8B-2/pytorch_model-00001-of-00004.bin...
Unsloth: Saving floraliuya/recft_unsloth-Meta-Llama-3.1-8B-2/pytorch_model-00002-of-00004.bin...
Unsloth: Saving floraliuya/recft_unsloth-Meta-Llama-3.1-8B-2/pytorch_model-00003-of-00004.bin...
Unsloth: Saving floraliuya/recft_unsloth-Meta-Llama-3.1-8B-2/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at floraliuya/recft_unsloth-Meta-Llama-3.1-8B-2 into f16 GGUF format.
The output location will be /content/floraliuya/recft_unsloth-Meta-Llama-3.1-8B-2/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: recft_unsloth-Meta-Llama-3.1-8B-2
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...-Llama-3.1-8B-2/unsloth.Q4_K_M.gguf:   0%|          | 16.7MB / 4.92GB            

Saved GGUF to https://huggingface.co/floraliuya/recft_unsloth-Meta-Llama-3.1-8B-2


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
What is a famous tall tower in Paris?

### Input:


### Response:
The Eiffel Tower is one of the most famous and iconic landmarks in Paris, France. Standing at a height of 324 meters (1,063 feet), it is the tallest building in Paris and one of the most recognizable structures in the world. Designed by Gustave Eiffel for the 1889 World's Fair, the Eiffel Tower has become a symbol of Paris and a popular tourist attraction.

### Instruction:
Provide fashion recommendations based on the user's requirements and preferences.

### Input:
I'm a tall, slender man with an athletic build. I prefer modern, minimalist styles that emphasize clean lines and a streamlined silhouette


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
