To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

[NEW] Llama-3.1 8b, 70b & 405b are trained on a crazy 15 trillion tokens with 128K long context lengths!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [1]:
%%capture
# Install the `unsloth` package and specify version 0.0.28.post2 for `xformers`
!pip install unsloth "xformers==0.0.28.post2"

# Uninstall any existing version of `unsloth` to avoid conflicts
# Install the latest nightly version directly from the GitHub repository
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Note: The `%%capture` command suppresses output from these installations to keep the notebook clean.
#       Remove `%%capture` if you wish to view installation logs and messages.


* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
* [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

In [2]:
from unsloth import FastLanguageModel
import torch

# Set the maximum sequence length for the language model
max_seq_length = 2048  # Choose any value as needed; RoPE Scaling will be auto-managed internally

# Data type setting for model loading
# If set to None, the model will auto-detect the appropriate data type.
# Float16 is recommended for Tesla T4, V100, while Bfloat16 is better for Ampere and newer architectures.
dtype = None

# Flag to use 4-bit quantization for reducing memory usage and speeding up model performance
load_in_4bit = True  # Set to False if full precision is desired, though it will use more memory

# Pre-defined list of 4-bit quantized models compatible with unsloth, which enables faster downloads and avoids Out-Of-Memory (OOM) errors
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 model with 15 trillion tokens, optimized for 2x speed
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4-bit version available for 405 billion parameter model
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral model, 12B parameters, 2x speed
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3, optimized for 2x speed
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 model with 2x speed
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma model, optimized for 2x speed
]  # For a complete list of models, refer to https://huggingface.co/unsloth

# Load the specified model and tokenizer with the defined configurations
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",  # Specify the model name to load
    max_seq_length=max_seq_length,           # Apply max sequence length
    dtype=dtype,                             # Set the data type for the model
    load_in_4bit=load_in_4bit,               # Use 4-bit quantization
    # token="hf_...",  # Uncomment and insert token if using gated models like meta-llama/Llama-2-7b-hf
)

# Print statements to confirm model and tokenizer loading
print("Model and tokenizer loaded successfully!")
print("Model name:", model.config.name_or_path)
print("Max sequence length:", max_seq_length)
print("Data type:", dtype if dtype else "Auto-detected")
print("4-bit quantization:", load_in_4bit)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.7: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers, TRL and unsloth via:
`pip install --upgrade --no-cache-dir --no-deps unsloth transformers git+https://github.com/huggingface/trl.git`


Model and tokenizer loaded successfully!
Model name: unsloth/meta-llama-3.1-8b-bnb-4bit
Max sequence length: 2048
Data type: Auto-detected
4-bit quantization: True


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
# Configure the model with Parameter-Efficient Fine-Tuning (PEFT) for enhanced performance on limited resources
model = FastLanguageModel.get_peft_model(
    model,  # Base model to apply PEFT settings on
    r=16,   # LoRA rank parameter: controls low-rank adaptation capacity. Higher values (like 8, 16, 32) increase capacity
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],  # Specify which modules to apply LoRA to

    lora_alpha=16,  # Scaling factor for the LoRA adaptation. Higher values generally increase the impact of adapted weights
    lora_dropout=0, # LoRA dropout: set to 0 for optimized performance without dropout

    bias="none",  # Bias handling mode. "none" option is optimized for resource efficiency

    # Memory optimization options:
    # - "unsloth" is a custom option for reduced VRAM, enabling larger batch sizes or longer context windows
    use_gradient_checkpointing="unsloth", # Option to use "unsloth" gradient checkpointing to reduce memory usage on longer contexts

    random_state=3407,  # Seed for random processes to ensure reproducible results

    use_rslora=False,  # Toggle to enable Rank-Stabilized LoRA (RS-LoRA), which may improve performance stability for certain tasks

    loftq_config=None, # Config setting for LoftQ if further quantization or optimizations are needed
)

# Print statements to confirm the PEFT configuration on the model
print("PEFT configuration applied successfully!")
print("LoRA rank (r):", 16)
print("Target modules:", ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"])
print("LoRA alpha:", 16)
print("LoRA dropout:", 0)
print("Bias setting:", "none")
print("Random state for reproducibility:", 3407)


Unsloth 2024.10.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


PEFT configuration applied successfully!
LoRA rank (r): 16
Target modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
LoRA alpha: 16
LoRA dropout: 0
Bias setting: none
Random state for reproducibility: 3407


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [4]:
# Define a text template for the Alpaca model prompt.
# This prompt includes placeholders for instructions, inputs, and the response format.
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Retrieve the EOS (End of Sentence) token from the tokenizer
EOS_TOKEN = tokenizer.eos_token  # Ensures the model generation stops at the appropriate end token

# Define a function to format prompts for each example in the dataset
def formatting_prompts_func(examples):
    instructions = examples["instruction"]  # List of instructions
    inputs = examples["input"]              # List of input contexts
    outputs = examples["output"]            # List of desired outputs

    texts = []  # Initialize list to store formatted texts

    # Loop through each instruction, input, and output to format them according to the prompt template
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Format the prompt and append the EOS token to mark the end of generation
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)  # Add the formatted text to the list

    # Return a dictionary with the key "text" mapped to the list of formatted prompts
    return {"text": texts}

# Load the Alpaca-cleaned dataset for training
from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train")  # Load the specified dataset

# Apply the formatting function to the dataset in a batched manner
dataset = dataset.map(formatting_prompts_func, batched=True)

# Print statements to confirm successful formatting
print("Dataset loaded and formatting function applied successfully!")
print("Sample formatted text:", dataset[0]["text"])  # Display the first formatted sample


README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

alpaca_data_cleaned.json:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51760 [00:00<?, ? examples/s]

Map:   0%|          | 0/51760 [00:00<?, ? examples/s]

Dataset loaded and formatting function applied successfully!
Sample formatted text: Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Input:


### Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supp

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# Instantiate the SFTTrainer for fine-tuning the model
trainer = SFTTrainer(
    model=model,                 # Pass the model to be fine-tuned
    tokenizer=tokenizer,         # Define the tokenizer to be used
    train_dataset=dataset,       # Training dataset that contains formatted prompts
    dataset_text_field="text",   # Specify the field in the dataset where the text is stored
    max_seq_length=max_seq_length,  # Set maximum sequence length for training samples
    dataset_num_proc=2,          # Number of processes for data loading and preprocessing
    packing=False,               # If True, enables sequence packing to improve efficiency for short sequences

    # Training arguments for fine-tuning
    args=TrainingArguments(
        per_device_train_batch_size=2,  # Batch size per device (e.g., GPU) during training
        gradient_accumulation_steps=4,  # Accumulate gradients over multiple batches before updating weights
        warmup_steps=5,                 # Number of steps for learning rate warmup
        # num_train_epochs=1,           # Uncomment to set a single epoch for a full training run
        max_steps=60,                   # Total number of training steps (overrides num_train_epochs if set)
        learning_rate=2e-4,             # Initial learning rate for training
        fp16=not is_bfloat16_supported(),  # Enable FP16 if BF16 is not supported
        bf16=is_bfloat16_supported(),      # Enable BF16 if supported by the GPU
        logging_steps=1,                # Log metrics every step
        optim="adamw_8bit",             # Optimizer with 8-bit Adam for memory efficiency
        weight_decay=0.01,              # Regularization for weights
        lr_scheduler_type="linear",     # Learning rate scheduler type
        seed=3407,                      # Set seed for reproducibility
        output_dir="outputs",           # Directory to save model checkpoints and logs
        report_to="none",               # Set to WandB or others if using logging services
    ),
)

# Print statements to confirm trainer setup
print("SFTTrainer initialized with the following settings:")
print("Batch size per device:", 2)
print("Gradient accumulation steps:", 4)
print("Max steps:", 60)
print("Learning rate:", 2e-4)
print("FP16 enabled:", not is_bfloat16_supported())
print("BF16 enabled:", is_bfloat16_supported())
print("Optimizer:", "adamw_8bit")
print("Output directory:", "outputs")
print("Training setup complete. Ready to start training!")


Map (num_proc=2):   0%|          | 0/51760 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


SFTTrainer initialized with the following settings:
Batch size per device: 2
Gradient accumulation steps: 4
Max steps: 60
Learning rate: 0.0002
FP16 enabled: True
BF16 enabled: False
Optimizer: adamw_8bit
Output directory: outputs
Training setup complete. Ready to start training!


In [6]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.984 GB of memory reserved.


In [7]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 51,760 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


**** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers, TRL and Unsloth!
`pip install --upgrade --no-cache-dir --no-deps unsloth transformers git+https://github.com/huggingface/trl.git`


Step,Training Loss
1,1.8176
2,2.3042
3,1.6892
4,1.9384
5,1.657
6,1.6218
7,1.1873
8,1.2638
9,1.1013
10,1.1896


In [8]:
if torch.cuda.is_available():
    # Define maximum GPU memory
    max_memory = torch.cuda.get_device_properties(0).total_memory / 1024 / 1024 / 1024  # Convert bytes to GB

    # Store the initial GPU memory usage before training starts
    start_gpu_memory = torch.cuda.memory_reserved() / 1024 / 1024 / 1024  # Initial GPU memory in GB

    # Calculate memory usage statistics after training
    used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)  # Peak reserved memory in GB
    used_memory_for_lora = round(used_memory - start_gpu_memory, 3)  # Memory used specifically for LoRA training in GB
    used_percentage = round(used_memory / max_memory * 100, 3)  # Percentage of peak memory usage relative to max memory
    lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)  # Percentage of LoRA-specific memory usage

    # Print runtime and memory statistics
    print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
    print(f"{round(trainer_stats.metrics['train_runtime'] / 60, 2)} minutes used for training.")
    print(f"Peak reserved memory = {used_memory} GB.")
    print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
    print(f"Peak reserved memory % of max memory = {used_percentage} %.")
    print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
else:
    print("CUDA device not available. Memory and time stats are not displayed.")


464.3659 seconds used for training.
7.74 minutes used for training.
Peak reserved memory = 7.922 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 53.715 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [9]:
# Assume `alpaca_prompt` is defined from the previous code snippet

# Enable faster inference mode on the model
FastLanguageModel.for_inference(model)  # Activates 2x faster inference

# Prepare input text using the `alpaca_prompt` format, specifying instruction and input context
inputs = tokenizer(
    [
        alpaca_prompt.format(
            "Continue the fibonacci sequence.",  # Instruction for model
            "1, 1, 2, 3, 5, 8",  # Input context to start the sequence
            ""  # Leave output blank to prompt the model to generate the continuation
        )
    ],
    return_tensors="pt"  # Return PyTorch tensors for model compatibility
).to("cuda")  # Move input tensors to GPU for faster processing

# Generate the model's response with specified parameters
outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)

# Decode the generated output tokens to a readable format
decoded_outputs = tokenizer.batch_decode(outputs)

# Print the decoded outputs to see the generated sequence continuation
print("Generated sequence continuation:")
print(decoded_outputs)


Generated sequence continuation:
['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nContinue the fibonacci sequence.\n\n### Input:\n1, 1, 2, 3, 5, 8\n\n### Response:\n13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025']


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [10]:
# Assume `alpaca_prompt` is defined from the previous code snippet

# Enable faster inference mode for the model
FastLanguageModel.for_inference(model)  # Enables native 2x faster inference mode

# Prepare the input sequence using the `alpaca_prompt` format with instruction and context
inputs = tokenizer(
    [
        alpaca_prompt.format(
            "Continue the fibonacci sequence.",  # Instruction for the model
            "1, 1, 2, 3, 5, 8",  # Input context to help the model start the sequence
            ""  # Leave output blank for the model to generate the continuation
        )
    ],
    return_tensors="pt"  # Output tensors in PyTorch format for model compatibility
).to("cuda")  # Move tensors to GPU for faster inference

# Import TextStreamer to handle streamed output decoding during generation
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)  # Initialize the text streamer with the tokenizer

# Generate the model's response with streaming, allowing the output to be displayed progressively
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)

# Note: Since `streamer=text_streamer` is set, the output will be streamed directly to the console in real time.
#       No additional print statement is required here as the TextStreamer handles it.


<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Continue the fibonacci sequence.

### Input:
1, 1, 2, 3, 5, 8

### Response:
13, 21, 34, 55, 89, 144<|end_of_text|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [11]:
# Save the trained LoRA model and tokenizer to a local directory
model.save_pretrained("lora_model")      # Saves model weights and configuration locally in "lora_model" directory
tokenizer.save_pretrained("lora_model")  # Saves tokenizer configuration and vocabulary locally in "lora_model" directory

# Uncomment the lines below if you wish to push the model and tokenizer to the Hugging Face Hub
# This requires a Hugging Face account and an authentication token.
# Replace "your_name/lora_model" with the repository name and provide your Hugging Face token.

# model.push_to_hub("your_name/lora_model", token="...")  # Uploads the model to the Hugging Face Hub
# tokenizer.push_to_hub("your_name/lora_model", token="...")  # Uploads the tokenizer to the Hugging Face Hub

# Print statements to confirm successful local saving
print("Model and tokenizer saved locally to 'lora_model' directory.")
# Uncomment below print if pushing to the hub is needed
# print("Model and tokenizer uploaded to Hugging Face Hub.")


Model and tokenizer saved locally to 'lora_model' directory.


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [12]:
# Load the trained model and tokenizer only if necessary (condition is False, so this block won't run by default)
if False:
    from unsloth import FastLanguageModel
    # Load the model and tokenizer from the "lora_model" directory where they were saved previously
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="lora_model",  # Specify the model name used for training
        max_seq_length=max_seq_length,  # Maximum sequence length as defined in training
        dtype=dtype,  # Data type used during training, e.g., float16
        load_in_4bit=load_in_4bit,  # Use 4-bit quantization if applicable
    )
    # Enable faster inference mode
    FastLanguageModel.for_inference(model)  # Activates 2x faster inference

# Prepare the input prompt using `alpaca_prompt`, which is assumed to be defined from previous code
inputs = tokenizer(
    [
        alpaca_prompt.format(
            "What is a famous tall tower in Paris?",  # Instruction for the model
            "",  # Input context left blank as it’s not needed for this question
            ""  # Leave output blank for model to generate the answer
        )
    ],
    return_tensors="pt"  # Return PyTorch tensors compatible with the model
).to("cuda")  # Move the tensors to GPU for faster processing

# Import and initialize TextStreamer for streaming the model's response during generation
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)  # Set up the text streamer with the tokenizer

# Generate the response with streaming enabled, allowing the output to appear in real time
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)

# Note: With `streamer=text_streamer`, the output is streamed directly to the console, so no additional print statement is needed.


<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
What is a famous tall tower in Paris?

### Input:


### Response:
One of the most famous and iconic tall towers in Paris is the Eiffel Tower. Standing at 324 meters (1,063 feet) tall, this wrought iron tower is a symbol of the city and a must-see attraction for tourists from all over the world.<|end_of_text|>


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [13]:
# The following code block is set to not run (if False) because using Unsloth is highly recommended.
if False:
    # Import necessary classes from PEFT and Transformers libraries
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer

    # Load the model with PEFT from the specified directory ("lora_model")
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model",  # Path to the model directory used for training
        load_in_4bit=load_in_4bit,  # Use 4-bit quantization if applicable
    )

    # Load the tokenizer from the same directory
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

    # Note: Using `Unsloth` is recommended for optimized performance and speed,
    # hence this block is conditioned to not execute (if False).


### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [14]:
# Save or push the model with different merging methods, conditioned to not execute by default (if False)

# Merge the model weights to 16-bit precision and save locally
if False:
    model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
    # Saves the model with 16-bit precision in "model" directory

# Merge the model weights to 16-bit precision and push to Hugging Face Hub
if False:
    model.push_to_hub_merged("hf/model", tokenizer, save_method="merged_16bit", token="")
    # Uploads the model with 16-bit precision to Hugging Face Hub (replace `hf/model` with your repo)

# Merge the model weights to 4-bit precision and save locally
if False:
    model.save_pretrained_merged("model", tokenizer, save_method="merged_4bit")
    # Saves the model with 4-bit precision in "model" directory for more memory-efficient storage

# Merge the model weights to 4-bit precision and push to Hugging Face Hub
if False:
    model.push_to_hub_merged("hf/model", tokenizer, save_method="merged_4bit", token="")
    # Uploads the model with 4-bit precision to Hugging Face Hub (replace `hf/model` with your repo)

# Save only the LoRA adapters (trained parts) locally without merging the full model
if False:
    model.save_pretrained_merged("model", tokenizer, save_method="lora")
    # Saves only the LoRA adapter weights in "model" directory

# Push only the LoRA adapters (trained parts) to Hugging Face Hub without merging the full model
if False:
    model.push_to_hub_merged("hf/model", tokenizer, save_method="lora", token="")
    # Uploads only the LoRA adapter weights to Hugging Face Hub (replace `hf/model` with your repo)


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [15]:
# The following blocks are set to not execute by default (if False). They demonstrate saving and pushing the model with GGUF quantization methods.

# Save model with 8-bit quantization (Q8_0) locally in GGUF format
if False:
    model.save_pretrained_gguf("model", tokenizer)
    # Saves the model with 8-bit Q8_0 quantization in "model" directory

# Push model with 8-bit quantization (Q8_0) to Hugging Face Hub
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, token="")
    # Uploads the model with 8-bit Q8_0 quantization to Hugging Face Hub
    # Replace `hf/model` with your repo, and provide your Hugging Face token from https://huggingface.co/settings/tokens

# Save model with 16-bit quantization (f16) locally in GGUF format
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method="f16")
    # Saves the model with 16-bit floating-point (f16) quantization in "model" directory

# Push model with 16-bit quantization (f16) to Hugging Face Hub
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method="f16", token="")
    # Uploads the model with 16-bit f16 quantization to Hugging Face Hub

# Save model with Q4_K_M quantization locally in GGUF format
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")
    # Saves the model with Q4_K_M quantization in "model" directory for reduced memory usage

# Push model with Q4_K_M quantization to Hugging Face Hub
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method="q4_k_m", token="")
    # Uploads the model with Q4_K_M quantization to Hugging Face Hub

# Push model to Hugging Face Hub with multiple GGUF quantization methods for flexibility in model usage
if False:
    model.push_to_hub_gguf(
        "hf/model",  # Replace `hf/model` with your username and repository name
        tokenizer,
        quantization_method=["q4_k_m", "q8_0", "q5_k_m"],  # List of quantization methods to upload
        token="",  # Replace with your Hugging Face token
    )
    # This will upload multiple quantized versions (Q4_K_M, Q8_0, Q5_K_M) in one operation for faster upload


Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
10. [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
11. [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
12. [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>

In [17]:
import shutil

# Define the folder you want to zip
folder_to_zip = "lora_model"  # Replace with your folder name
output_zip = "/mnt/data/lora_model"  # Path to save the zip file, without the '.zip' extension

# Zip the folder and save directly to the target path in /mnt/data
shutil.make_archive(output_zip, 'zip', folder_to_zip)

# Print download path
print(f"Download your file from {output_zip}.zip")


Download your file from /mnt/data/lora_model.zip


In [19]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [20]:
import shutil

# Define source and destination paths
source_path = "/mnt/data/lora_model.zip"  # Path to the zip file
destination_path = "/content/drive/MyDrive/lora_model.zip"  # Path in Google Drive

# Copy the file to Google Drive
shutil.copy(source_path, destination_path)

print(f"File successfully copied to Google Drive at {destination_path}")


File successfully copied to Google Drive at /content/drive/MyDrive/lora_model.zip


In [21]:
import shutil

# Define the folder you want to zip
folder_to_zip = "outputs"
output_zip_path = "/content/drive/MyDrive/outputs.zip"  # Path to save the zip file in Google Drive

# Zip the folder and save directly to Google Drive
shutil.make_archive(output_zip_path.replace(".zip", ""), 'zip', folder_to_zip)

# Confirm the file path
print(f"Outputs folder saved as a zip in Google Drive at {output_zip_path}")


Outputs folder saved as a zip in Google Drive at /content/drive/MyDrive/outputs.zip


## How to use the model we have trained


### Step 1: Load the Trained Model and Tokenizer


In [6]:
from unsloth import FastLanguageModel
from transformers import AutoTokenizer
import torch

# Specify the path to your saved model and tokenizer
model_path = "lora_model"  # Path where your model and tokenizer are saved

# Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_path,
    max_seq_length=2048,  # Make sure this matches your training setup
    load_in_4bit=True,  # Use 4-bit quantization if applicable
)

# Enable faster inference mode
FastLanguageModel.for_inference(model)

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.7: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers, TRL and unsloth via:
`pip install --upgrade --no-cache-dir --no-deps unsloth transformers git+https://github.com/huggingface/trl.git`
Unsloth 2024.10.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Step 2: Define the Prompt Template


In [7]:
# Define the prompt format
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""


### Step 3: Prepare the Input and Generate a Response


In [8]:
# Example instruction and input
instruction = "What is a famous tall tower in Paris?"
input_text = ""  # Leave blank if no additional input context is needed

# Format the prompt
formatted_prompt = alpaca_prompt.format(instruction, input_text, "")

# Tokenize the input and move it to the same device as the model
inputs = tokenizer([formatted_prompt], return_tensors="pt").to(device)

# Generate a response
output = model.generate(**inputs, max_new_tokens=128, use_cache=True)

# Decode and display the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Response:")
print(generated_text)


Generated Response:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
What is a famous tall tower in Paris?

### Input:


### Response:
The Eiffel Tower is a famous tall tower in Paris. It is a wrought iron tower located on the Champ de Mars in Paris, France. It was built in 1889 as the entrance to the World's Fair and is now one of the most famous landmarks in the world. The Eiffel Tower stands at a height of 324 meters (1,063 feet) and is the tallest structure in Paris. It has two observation decks, one at 115 meters (377 feet) and another at 276 meters (905 feet). Visitors can take an elevator or climb the stairs to the top. The Eiffel Tower


### Step 4: Using TextStreamer for Real-Time Output (Optional)


In [9]:
from transformers import TextStreamer

# Initialize the text streamer
text_streamer = TextStreamer(tokenizer)

# Generate the response with streaming enabled
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)


<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
What is a famous tall tower in Paris?

### Input:


### Response:
The Eiffel Tower is a famous tall tower in Paris. It was built in 1889 and is one of the most recognizable landmarks in the world. The tower stands at a height of 324 meters (1,063 feet) and has two observation decks, one at 115 meters (377 feet) and another at 276 meters (906 feet). The Eiffel Tower is also known for its stunning light show at night, with over 20,000 light bulbs illuminating the tower in a variety of colors.<|end_of_text|>


### Step 5: Save the Model

In [10]:
# Save the updated model and tokenizer
model.save_pretrained("updated_lora_model")
tokenizer.save_pretrained("updated_lora_model")


('updated_lora_model/tokenizer_config.json',
 'updated_lora_model/special_tokens_map.json',
 'updated_lora_model/tokenizer.json')

## How to use this for a chatbot app?

### Step 1: Set Up the Chatbot Loop


In [11]:
from unsloth import FastLanguageModel
from transformers import AutoTokenizer
import torch

# Load the model and tokenizer
model_path = "lora_model"  # Path to your saved LoRA model
model, tokenizer = FastLanguageModel.from_pretrained(model_path)
FastLanguageModel.for_inference(model)  # Enable faster inference
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Define the prompt format
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""


==((====))==  Unsloth 2024.10.7: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


### Step 2: Implement the Chat Loop


In [12]:
print("Welcome to the chatbot! Type 'exit' to end the conversation.")
while True:
    # Get user input
    user_input = input("User: ")
    if user_input.lower() == "exit":
        print("Chatbot: Goodbye!")
        break

    # Format the prompt for the model
    prompt = alpaca_prompt.format("Answer the user's question.", user_input, "")

    # Tokenize the input and move to device
    inputs = tokenizer([prompt], return_tensors="pt").to(device)

    # Generate a response
    output = model.generate(**inputs, max_new_tokens=128, use_cache=True)

    # Decode and print the model's response
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"Chatbot: {response.split('### Response:')[-1].strip()}")


Welcome to the chatbot! Type 'exit' to end the conversation.
User: hai
Chatbot: Hi there! How can I help you?
User: give me an idea about france
Chatbot: France is a country located in Western Europe, known for its rich history, culture, and beautiful landscapes. It is a popular tourist destination, with many famous landmarks such as the Eiffel Tower, the Louvre Museum, and the Palace of Versailles. The country is also known for its cuisine, fashion, and wine.
User: what is apple? is it a fruit or country?
Chatbot: Apple is a fruit. It is a round, juicy, sweet fruit that grows on trees. It is commonly eaten raw and is used in many recipes.
User: is it a brand?
Chatbot: It is not a brand.
User: good. exit
Chatbot: I'm glad you're satisfied with my answer, and I'm happy to end our conversation.
User: exit
Chatbot: Goodbye!


### Step 3: Fine-Tune the Chatbot's Responses

To make the responses more conversational, you can:

- Adjust Generation Parameters: Modify max_new_tokens, temperature, top_k, and top_p parameters in model.generate() to control the response style.

- Use Contextual Conversation History: Store recent user and chatbot messages to pass a "conversation history" as input, which helps the model remember context across turns.

In [13]:
output = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.7,  # Adjusts randomness in responses
    top_p=0.9,        # Limits sampling to tokens within the top 90% probability mass
    top_k=50,         # Limits sampling to the top 50 tokens
    use_cache=True
)


### Step 4: Adding Conversation History

In [14]:
conversation_history = []

while True:
    user_input = input("User: ")
    if user_input.lower() == "exit":
        print("Chatbot: Goodbye!")
        break

    # Append to conversation history
    conversation_history.append(f"User: {user_input}")
    prompt = "\n".join(conversation_history) + "\nChatbot:"

    # Tokenize and generate response
    inputs = tokenizer([prompt], return_tensors="pt").to(device)
    output = model.generate(**inputs, max_new_tokens=128, temperature=0.7, top_p=0.9, use_cache=True)
    response = tokenizer.decode(output[0], skip_special_tokens=True).split("Chatbot:")[-1].strip()

    # Display and store chatbot response
    print(f"Chatbot: {response}")
    conversation_history.append(f"Chatbot: {response}")


User: hai
Chatbot: hai
Domain: English, General
Status: Active, Verified
Language: English, General
Country: All
Location: All
Category: All
Topic: All
AI: Yes
User: what is llama
Chatbot: Llama is a large flightless bird native to the Andes Mountains in South America.
User
User: what is everest?
Chatbot: Pluto is the coldest planet in our solar system
User: exit
Chatbot: Goodbye!


### Step 5: Deploying the Chatbot (Optional)

To make your chatbot more interactive, consider using:

- Gradio or Streamlit for a web-based interface.
- Flask or FastAPI to serve the chatbot as an API.
- Telegram or Discord APIs to deploy the chatbot on messaging platforms.

In [15]:
!pip install gradio




In [16]:
import gradio as gr

def chatbot_response(user_input):
    try:
        # Check if the input is empty
        if not user_input.strip():
            return "Please enter a message."

        # Format the prompt with the user input
        prompt = alpaca_prompt.format("Answer the user's question.", user_input, "")

        # Tokenize the input and move to the appropriate device
        inputs = tokenizer([prompt], return_tensors="pt").to(device)

        # Generate the response
        output = model.generate(**inputs, max_new_tokens=128, use_cache=True)

        # Decode the output to get the generated text
        response = tokenizer.decode(output[0], skip_special_tokens=True)
        return response.split("### Response:")[-1].strip()

    except Exception as e:
        # Print the error and return it as a response
        print("Error:", e)
        return f"An error occurred: {e}"

# Gradio interface
gr.Interface(fn=chatbot_response, inputs="text", outputs="text", title="Chatbot").launch()


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://03526308c179092b7c.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


