To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Read our **[Qwen3 Guide](https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
!pip install datasets transformers --quiet
!pip install -U "datasets==4.3.0"

Collecting datasets==4.3.0
  Downloading datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting pyarrow>=21.0.0 (from datasets==4.3.0)
  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Downloading datasets-4.3.0-py3-none-any.whl (506 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m506.8/506.8 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (47.7 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m47.7/47.7 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyarrow, datasets
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 18.1.0
    Uninstalling pyarrow-18.1.0:
      Successfully uninstalled pyarrow-18.1.0
  Attempting uninstall: dat

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

### Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct",#DemoÁî®
    #unsloth/Meta-Llama-3.1-8B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.8.0+cu126)
    Python  3.12.9 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Switching to PyTorch attention since your Xformers is broken.

Unsloth: Xformers was not installed correctly.
Please install xformers separately first.
Then confirm if it's correctly installed by running:
python -m xformers.info

Longer error message:
xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.8.0+cu126)
    Python  3.12.9 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.c

model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.11.3 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Alpaca.ipynb)

For text completions like novel writing, try this [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb).

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
from huggingface_hub import login
import os

# Âú®ÈÄôË£°Ë≤º‰∏äËá™Â∑±ÁöÑtoken
login(token="####")

# ËºâÂÖ•tokenizerÔºàÊ†πÊìöË¶ÅÁî®ÁöÑÊ®°Âûã‰æÜÊ±∫ÂÆöÔºâ
tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-1B-Instruct")#unsloth/Meta-Llama-3.1-8B

# Alpaca Ê†ºÂºèÁöÑ prompt Ê®°Êùø
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# ÂøÖÈ†àÊ∑ªÂä† EOS_TOKEN
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    # ÁÇ∫ÊôÇÂ∞öÊé®Ëñ¶Á≥ªÁµ±ÂÆöÁæ©Áµ±‰∏ÄÁöÑÊåá‰ª§
    instruction = "Provide fashion recommendations based on the user's requirements and preferences."

    inputs = examples["input"]  # Áî®Êà∂ÁöÑÈúÄÊ±ÇÂíåÂÅèÂ•Ω
    outputs = examples["completion"]  # ÊôÇÂ∞öÊé®Ëñ¶ÂõûÊáâ

    texts = []
    for input_text, output in zip(inputs, outputs):
        # ‰ΩøÁî® Alpaca Ê†ºÂºèÔºå‰∏¶Á¢∫‰øùÊ∑ªÂä† EOS_TOKEN
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}

# ËºâÂÖ•Ë≥áÊñôÈõÜ
dataset = load_dataset("neuralwork/fashion-style-instruct", split="train")

# ÊáâÁî®Ê†ºÂºèÂåñÂáΩÊï∏
dataset = dataset.map(formatting_prompts_func, batched=True)

# ÁßªÈô§‰∏çÈúÄË¶ÅÁöÑÊ¨Ñ‰ΩçÔºåÂè™‰øùÁïô "text" Ê¨Ñ‰Ωç
dataset = dataset.remove_columns(["input", "completion", "context"])

# Ê™¢Êü•ÁµêÊûúÔºàÂç∞Âá∫ÂÖ∂‰∏≠‰∏ÄÁ≠ÜÔºâ
print("=" * 50)
print("Sample formatted data:")
print("=" * 50)
print(dataset[0]["text"])
print("=" * 50)

# Â¶ÇÊûú‰Ω†ÈúÄË¶ÅÈÄ≤‰∏ÄÊ≠•ÁöÑ tokenizationÔºàÈÄöÂ∏∏Âú®Ë®ìÁ∑¥ÊôÇÊúÉËá™ÂãïËôïÁêÜÔºâ
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding=False,  # ÈÄöÂ∏∏Âú® DataLoader ‰∏≠ÈÄ≤Ë°å padding
    )

# ÂèØÈÅ∏ÔºöÂ¶ÇÊûúÈúÄË¶ÅÈ†êÂÖà tokenizeÔºàÂ§ßÂ§öÊï∏Ë®ìÁ∑¥ËÖ≥Êú¨ÊúÉËá™ÂãïËôïÁêÜÔºâ
# tokenized_dataset = dataset.map(tokenize_function, batched=True)

print(f"Dataset size: {len(dataset)}")
print("Dataset is ready for training!")

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

README.md:   0%|          | 0.00/882 [00:00<?, ?B/s]

data/train-00000-of-00001-9b0ae8e510f95a(‚Ä¶):   0%|          | 0.00/2.64M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3193 [00:00<?, ? examples/s]

Map:   0%|          | 0/3193 [00:00<?, ? examples/s]

Sample formatted data:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Provide fashion recommendations based on the user's requirements and preferences.

### Input:
I'm a tall, athletic man with broad shoulders and a narrow waist. I prefer sharp, tailored suits that highlight my V-shaped torso.

### Response:
Outfit Combination 1:
- Top: Fitted white linen shirt
- Bottom: Slim-fit beige chinos
- Shoe: Brown leather loafers
- Accessories: Brown woven belt, aviator sunglasses

Outfit Combination 2:
- Top: Light blue oxford button-down shirt
- Bottom: Navy blue tailored trousers
- Shoe: Tan leather brogues
- Accessories: Navy blue patterned pocket square, silver wristwatch

Outfit Combination 3:
- Top: Light gray tailored blazer
- Bottom: Dark wash denim jeans
- Shoe: White canvas sneakers
- Accessories: Black leather belt, silver pendant necklace

Outfit Combina

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# Á¢∫Ë™çÈÄô‰∫õËÆäÊï∏Â∑≤Á∂ìÂú®‰πãÂâçÁöÑÁ®ãÂºèÁ¢º‰∏≠ÂÆöÁæ©Ôºö
# - model: Ë¶ÅÂæÆË™øÁöÑÊ®°Âûã
# - tokenizer: Â∞çÊáâÁöÑ tokenizer
# - dataset: Â∑≤Á∂ìÊ†ºÂºèÂåñÁöÑË®ìÁ∑¥Ë≥áÊñôÈõÜ
# - max_seq_length: ÊúÄÂ§ßÂ∫èÂàóÈï∑Â∫¶


max_seq_length = 2048  # ÊàñËÄÖÊ†πÊìöÈúÄÊ±ÇË™øÊï¥

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",  # ÈÄôÂÄãË¶ÅËàáË≥áÊñôÈõÜÊ¨Ñ‰ΩçÂêçÁ®±‰∏ÄËá¥
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,  # Â∞çÊñºÊé®Ëñ¶Á≥ªÁµ±ÈÄöÂ∏∏Ë®≠ÁÇ∫ False ÊØîËºÉÂ•Ω
    args=TrainingArguments(
        per_device_train_batch_size=2,  # Â¶ÇÊûú GPU Ë®òÊÜ∂È´î‰∏çË∂≥ÔºåÊîπÁÇ∫ 1
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,  # ÂèØ‰ª•Ê†πÊìöË≥áÊñôÈõÜÂ§ßÂ∞èË™øÊï¥
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",  # Â¶ÇÊûúË¶ÅÁî® wandb Ë®òÈåÑÔºåÊîπÁÇ∫ "wandb"
        save_steps=20,  # ÊØè 20 Ê≠•‰øùÂ≠ò‰∏ÄÊ¨°Ê®°Âûã
        save_total_limit=3,  # ÊúÄÂ§ö‰øùÂ≠ò 3 ÂÄã checkpoint
        dataloader_pin_memory=False,  # Âú® Colab ‰∏≠Âª∫Ë≠∞Ë®≠ÁÇ∫ False
    ),
)


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/3193 [00:00<?, ? examples/s]

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
1.203 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 3,193 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192 of 1,247,086,592 (0.90% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.6994
2,1.6757
3,1.665
4,1.581
5,1.6588
6,1.6439
7,1.4855
8,1.5216
9,1.4595
10,1.2999


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

267.0187 seconds used for training.
4.45 minutes used for training.
Peak reserved memory = 2.641 GB.
Peak reserved memory for training = 1.438 GB.
Peak reserved memory % of max memory = 17.916 %.
Peak reserved memory for training % of max memory = 9.755 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!



In [None]:
# ‰ΩøÁî®‰πãÂâçÂÆöÁæ©ÁöÑ alpaca_prompt Ê®°Êùø
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# ÂïüÁî®Âø´ÈÄüÊé®ÁêÜÊ®°Âºè
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

# Ê∏¨Ë©¶Ê°à‰æã 1: Âü∫Êú¨ÊôÇÂ∞öÊé®Ëñ¶
print("=" * 60)
print("Ê∏¨Ë©¶ 1: Âü∫Êú¨ÊôÇÂ∞öÊé®Ëñ¶")
print("=" * 60)

test_input_1 = "I'm looking for a casual outfit for weekend brunch. I prefer comfortable clothes in neutral colors."

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Provide fashion recommendations based on the user's requirements and preferences.", # instruction
        test_input_1, # input
        "", # output - leave this blank for generation!
    )
], return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    use_cache=True,
    temperature=0.7,  # Ê∑ªÂä†‰∏Ä‰∫õÂâµÊÑèÊÄß
    do_sample=True,   # ÂïüÁî®Êé°Ê®£
    top_p=0.9        # nucleus sampling
)

result = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(result[0])

# Ê∏¨Ë©¶Ê°à‰æã 2: ÁâπÂÆöÂ†¥ÂêàÊé®Ëñ¶
print("\n" + "=" * 60)
print("Ê∏¨Ë©¶ 2: ÁâπÂÆöÂ†¥ÂêàÊé®Ëñ¶")
print("=" * 60)

test_input_2 = "I have a job interview at a tech company next week. I'm a woman in my late 20s and want to look professional but not too formal."

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Provide fashion recommendations based on the user's requirements and preferences.",
        test_input_2,
        "",
    )
], return_tensors="pt").to("cuda")

# ‰ΩøÁî® TextStreamer ‰æÜÂç≥ÊôÇÈ°ØÁ§∫ÁîüÊàêÈÅéÁ®ã
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_special_tokens=True)

print("ÁîüÊàê‰∏≠...")
_ = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=150,
    temperature=0.6,
    do_sample=True,
    top_p=0.85
)

# Ê∏¨Ë©¶Ê°à‰æã 3: Â≠£ÁØÄÊÄßÊé®Ëñ¶
print("\n" + "=" * 60)
print("Ê∏¨Ë©¶ 3: Â≠£ÁØÄÊÄßÊé®Ëñ¶")
print("=" * 60)

test_input_3 = "What should I wear for a summer wedding? I'm a guest and it's an outdoor ceremony in the afternoon."

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Provide fashion recommendations based on the user's requirements and preferences.",
        test_input_3,
        "",
    )
], return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=120,
    use_cache=True,
    temperature=0.8,
    do_sample=True,
    repetition_penalty=1.1  # Ê∏õÂ∞ëÈáçË§á
)

result = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(result[0])

# Ê∏¨Ë©¶Ê°à‰æã 4: È†êÁÆóËÄÉÈáèÊé®Ëñ¶
print("\n" + "=" * 60)
print("Ê∏¨Ë©¶ 4: È†êÁÆóËÄÉÈáèÊé®Ëñ¶")
print("=" * 60)

test_input_4 = "I'm a college student on a tight budget. I need versatile pieces that can work for both classes and going out with friends."

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Provide fashion recommendations based on the user's requirements and preferences.",
        test_input_4,
        "",
    )
], return_tensors="pt").to("cuda")

# Âç≥ÊôÇÁîüÊàê‰∏¶È°ØÁ§∫
text_streamer = TextStreamer(tokenizer, skip_special_tokens=True)
print("ÁîüÊàê‰∏≠...")
_ = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=140,
    temperature=0.7,
    do_sample=True,
    top_k=50,
    top_p=0.9
)

# ÂäüËÉΩÊ∏¨Ë©¶ÔºöÊ™¢Êü•Ê®°ÂûãÊòØÂê¶Ê≠£Á¢∫ÁêÜËß£Êåá‰ª§
print("\n" + "=" * 60)
print("ÂäüËÉΩÊ∏¨Ë©¶ÔºöÊ®°ÂûãÁêÜËß£ËÉΩÂäõÊ™¢Êü•")
print("=" * 60)

def test_model_response(user_input, description):
    print(f"\n{description}")
    print("-" * 40)

    inputs = tokenizer(
    [
        alpaca_prompt.format(
            "Provide fashion recommendations based on the user's requirements and preferences.",
            user_input,
            "",
        )
    ], return_tensors="pt").to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.6,
        do_sample=True,
        use_cache=True
    )

    result = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    print(result)
    return result

# Â§öÂÄãÊ∏¨Ë©¶Ê°à‰æã
test_cases = [
    ("I love vintage style and bold colors. What should I wear to a art gallery opening?", "ËóùË°ìÈ¢®Ê†ºÊ∏¨Ë©¶"),
    ("I'm going hiking this weekend but want to look cute in photos. Any suggestions?", "Êà∂Â§ñÊ¥ªÂãïÊ∏¨Ë©¶"),
    ("I need work-from-home outfits that are comfortable but look good on video calls.", "Â±ÖÂÆ∂Â∑•‰ΩúÊ∏¨Ë©¶")
]

for user_input, description in test_cases:
    test_model_response(user_input, description)

print("\n" + "=" * 60)
print("Ê∏¨Ë©¶ÂÆåÊàêÔºÅ")
print("=" * 60)

Ê∏¨Ë©¶ 1: Âü∫Êú¨ÊôÇÂ∞öÊé®Ëñ¶
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Provide fashion recommendations based on the user's requirements and preferences.

### Input:
I'm looking for a casual outfit for weekend brunch. I prefer comfortable clothes in neutral colors.

### Response:
Outfit 1:
- Top: A lightweight, white button-up shirt with a relaxed fit. Look for one with a bit of texture or pattern for added interest.
- Bottom: A pair of dark-washed jeans or a pair of straight-leg trousers in a neutral color like navy or black. These will provide a versatile and casual look.
- Shoe: Opt for a pair of white sneakers for a comfortable and trendy touch.
- Accessories: Add a light, colorful scarf to tie in with the brunch vibe. Finish the look with a simple leather bracelet and a pair of round sunglasses.

Outfit 2:
- Top: A lightweight

Ê∏¨Ë©¶ 2: ÁâπÂÆöÂ†¥Âêà

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")


('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/chat_template.jinja',
 'lora_model/tokenizer.json')

In [None]:
import os
print(os.listdir("lora_model"))


['README.md', 'tokenizer.json', 'adapter_model.safetensors', 'tokenizer_config.json', 'special_tokens_map.json', 'adapter_config.json', 'chat_template.jinja']


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
import gc, torch

del trainer    # Â¶ÇÊûúÊúâ
del dataset    # Â¶ÇÊûúÊúâ
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()


In [None]:
#Â≠∏Ê†°ÂèØËÉΩÂÅö‰∏çÂá∫‰æÜÔºåÂª∫Ë≠∞ÂõûÂÆ∂ÂèØ‰ª•Áî®Ëá™Â∑±Êú¨Âú∞Á´ØÈõªËÖ¶ÊàñÊòØcolab proÂéªpushÂà∞huggingface
if True:
    model.push_to_hub_gguf(
        "username/repo name", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m"], #, "q8_0", "q5_k_m",
        token = "######",# Change hf to yours!
    )
    # model.push_to_hub("your_name/lora_model", token = "...") # Online saving
    # tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è
</div>
