### News

**Read our [Gemma 3 blog](https://unsloth.ai/blog/gemma3) for what's new in Unsloth and our [Reasoning blog](https://unsloth.ai/blog/r1-reasoning) on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

### Unsloth

In [None]:
# from unsloth import FastLanguageModel
# import torch
# max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
# dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
# load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# # 4bit pre quantized models we support for 4x faster downloading + no OOMs.
# fourbit_models = [
#     "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
#     "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
#     "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
#     "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
#     "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
#     "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
#     "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
#     "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
#     "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
#     "unsloth/Phi-3-medium-4k-instruct",
#     "unsloth/gemma-2-9b-bnb-4bit",
#     "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
# ] # More models at https://huggingface.co/unsloth

# model, tokenizer = FastLanguageModel.from_pretrained(
#     model_name = "unsloth/Meta-Llama-3.1-8B",
#     max_seq_length = max_seq_length,
#     dtype = dtype,
#     load_in_4bit = load_in_4bit,
#     # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
# )

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.18: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
# model = FastLanguageModel.get_peft_model(
#     model,
#     r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
#     target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
#                       "gate_proj", "up_proj", "down_proj",],
#     lora_alpha = 16,
#     lora_dropout = 0, # Supports any, but = 0 is optimized
#     bias = "none",    # Supports any, but = "none" is optimized
#     # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
#     use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
#     random_state = 3407,
#     use_rslora = False,  # We support rank stabilized LoRA
#     loftq_config = None, # And LoftQ
# )

NameError: name 'model' is not defined

<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
import json
from datasets import Dataset

# Load your JSON (adjust path as needed)
with open("similar_wrong_pemdas_dataset.json", "r") as f:
    data = json.load(f)

# Format for Unsloth: simple instruction-output pairs
formatted_data = []
for item in data:
    prompt = f"Fix this incorrect math meme: {item['Incorrect Math Statement']}"
    response = f"Nope, it’s {item['Correct Answer']}. {item['Explanation'].replace('.', '')} "
    formatted_data.append({"text": f"Instruction: {prompt}\n Response: {response}"})

# Convert to Hugging Face Dataset
dataset = Dataset.from_list(formatted_data)
print(dataset[0])

{'text': 'Instruction: Fix this incorrect math meme: 7 + 3 × 2 = 20\n Response: Nope, it’s 13. Multiplication first: 3 × 2 = 6, then 7 + 6 = 13 '}


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments

# Clear GPU memory
torch.cuda.empty_cache()

# Load LLaMA-2-7B with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",  # Larger, distinct LLaMA variant
    max_seq_length=256,  # Memory-efficient length
    load_in_4bit=True,
    device_map={"": torch.cuda.current_device()}
)

# Configure LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=8,
    target_modules=["q_proj", "v_proj"],
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none"
)

# Training arguments with 5 epochs
training_args = TrainingArguments(
    output_dir="math_meme_fixer_llama2",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=5,
    learning_rate=2e-5,
    fp16=True,
    logging_steps=5,
    save_steps=5,
    gradient_checkpointing=True,
    report_to="none",

)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=256,
    args=training_args
)

# Train
trainer.train()

# Save
model.save_pretrained("math_meme_fixer_llama2")
tokenizer.save_pretrained("math_meme_fixer_llama2")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.18: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.3.18 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


Converting train dataset to ChatML (num_proc=2):   0%|          | 0/20 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=2):   0%|          | 0/20 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/20 [00:00<?, ? examples/s]

Truncating train dataset (num_proc=2):   0%|          | 0/20 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 20 | Num Epochs = 5 | Total steps = 25
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 3,407,872/8,000,000,000 (0.04% trained)


Step,Training Loss
5,2.504
10,2.4678
15,2.4329
20,2.4059
25,2.3902


('math_meme_fixer_llama2/tokenizer_config.json',
 'math_meme_fixer_llama2/special_tokens_map.json',
 'math_meme_fixer_llama2/tokenizer.json')

In [None]:
import torch

# Check memory usage before freeing
print(f"Memory allocated before delete: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Memory reserved before delete: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Delete the model and trainer (replace 'model' and 'trainer' with your variable names)
# del model
# del trainer  # Omit this line if you don’t use a trainer object

# Clear the GPU cache
torch.cuda.empty_cache()

# Confirm memory is freed
print(f"Memory allocated after delete: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Memory reserved after delete: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Memory allocated before delete: 6.10 GB
Memory reserved before delete: 6.49 GB
Memory allocated after delete: 6.10 GB
Memory reserved after delete: 6.14 GB


In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.984 GB of memory reserved.


In [None]:
# from unsloth import FastLanguageModel
# import torch

# # Clear GPU memory
# torch.cuda.empty_cache()

# # Load the fine-tuned model
# model, tokenizer = FastLanguageModel.from_pretrained(
#     model_name="math_meme_fixer_llama2",
#     max_seq_length=256,
#     dtype=torch.float16,
#     load_in_4bit=True,
#     device_map={"": torch.cuda.current_device()}
# )
# model = model.to("cuda")
# FastLanguageModel.for_inference(model)

# # Test function
# def fix_meme(meme):
#     prompt = f"### Instruction: Fix this incorrect math meme: {meme}\n### Response: "
#     inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
#     outputs = model.generate(**inputs, max_new_tokens=30, temperature=0.5)
#     full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
#     print(f"Raw output: {full_response}")
#     try:
#         result = full_response.split("### Response: ")[-1]
#     except IndexError:
#         result = full_response
#     return result

# # Test examples
# test_memes = [
#     "4 + 15 * 5 = 35? "
# ]

# # Run tests
# for meme in test_memes:
#     result = fix_meme(meme)


==((====))==  Unsloth 2025.3.18: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Raw output: ### Instruction: Fix this incorrect math meme: 4 + 15 * 5 = 35? 
### Response: 4 + 15 * 5 = 79. In order to solve this, you need to follow the order of operations (PEMDAS):
Input: 4 + 15 * 5 = 35? 
Output: 4 + 15 * 5 = 79. In order to solve this, you need to follow the order of operations (PEMDAS):



In [None]:
!pip install huggingface_hub



In [None]:
!pip install pyngrok
!pip install gradio

Collecting pyngrok
  Downloading pyngrok-7.2.3-py3-none-any.whl.metadata (8.7 kB)
Downloading pyngrok-7.2.3-py3-none-any.whl (23 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.2.3
Collecting gradio
  Downloading gradio-5.23.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.8.0 (from gradio)
  Downloading gradio_client-1.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.

In [None]:
# import torch
# import gradio as gr
# from unsloth import FastLanguageModel

# # Load your fine-tuned model and tokenizer
# model, tokenizer = FastLanguageModel.from_pretrained(
#     model_name="math_meme_fixer_llama2",
#     max_seq_length=256,
#     dtype=torch.float16,
#     load_in_4bit=True,
#     device_map={"": torch.cuda.current_device()}
# )
# model = model.to("cuda")
# FastLanguageModel.for_inference(model)

# def test_model(model, tokenizer, test_cases):
#     model.eval()
#     results = []

#     for test in test_cases:
#         input_text = f"### Instruction: Fix this incorrect math meme: {test}\n### Response: "
#         inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
#         max_length = len(inputs["input_ids"][0]) + 100

#         with torch.no_grad():
#             outputs = model.generate(
#                 **inputs,
#                 max_length=max_length,
#                 eos_token_id=tokenizer.eos_token_id,
#                 pad_token_id=tokenizer.pad_token_id,
#                 temperature=0.5
#             )

#         corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
#         if "### Response: " in corrected_text:
#             corrected_text = corrected_text.split("### Response: ")[-1].strip()

#         results.append((test, corrected_text))

#     return results

# def solve_math_problems(input_cases):
#     test_cases = [case.strip() for case in input_cases.split("\n") if case.strip()]  # Clean input
#     results = test_model(model, tokenizer, test_cases)

#     output = ""
#     for incorrect, corrected in results:
#         output += f"Incorrect: {incorrect}\nCorrected: {corrected}\n\n"

#     return output if output else "Please enter a math meme to fix!"

# # Custom CSS for orange theme
# custom_css = """
# body {
#     font-family: 'Arial', sans-serif;
#     background-color: #fff3e6; /* Light orange background */
# }
# .gradio-container {
#     max-width: 800px;
#     margin: 0 auto;
#     padding: 20px;
# }
# h1 {
#     color: #ff6200; /* Orange title */
#     text-align: center;
#     font-size: 2.5em;
#     margin-bottom: 20px;
# }
# textarea, .output-text {
#     border: 2px solid #ff6200 !important;
#     border-radius: 10px;
#     padding: 10px;
#     font-size: 1.1em;
# }
# button {
#     background-color: #ff6200 !important; /* Orange button */
#     color: white !important;
#     border: none !important;
#     border-radius: 10px;
#     padding: 12px 20px;
#     font-size: 1.2em;
#     transition: background-color 0.3s;
# }
# button:hover {
#     background-color: #e65c00 !important; /* Darker orange on hover */
# }
# .description {
#     color: #333;
#     text-align: center;
#     font-size: 1.2em;
#     margin-bottom: 20px;
# }
# .output-text {
#     background-color: #fff;
#     border: 2px solid #ff6200;
#     border-radius: 10px;
#     padding: 15px;
#     white-space: pre-wrap;
# }
# """

# # Create Gradio interface with button

KeyboardInterrupt: 

In [None]:
import torch
import gradio as gr
from unsloth import FastLanguageModel

# Assume 'model' and 'tokenizer' are already loaded and fine-tuned earlier in the notebook
# e.g., from your training code:
# model, tokenizer = FastLanguageModel.from_pretrained(...)
# model = model.to("cuda")
# FastLanguageModel.for_inference(model)
# [training code...]
# model.save_pretrained("math_meme_fixer_llama2")
# tokenizer.save_pretrained("math_meme_fixer_llama2")

# No need to reload if already in memory; just ensure it's set for inference
# If you haven't done this yet, uncomment the line below
# FastLanguageModel.for_inference(model)

def test_model(model, tokenizer, test_cases):
    model.eval()
    results = []

    for test in test_cases:
        # Format input as per your training prompt
        input_text = f"### Instruction: Fix this incorrect math meme: {test}\n### Response: "
        inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

        # Set max_length to allow for response generation
        max_length = len(inputs["input_ids"][0]) + 400

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_length=max_length,
                eos_token_id=tokenizer.eos_token_id,
                pad_token_id=tokenizer.pad_token_id,
                temperature=0.5
            )

        # Decode and extract the corrected part
        corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        if "### Response: " in corrected_text:
            corrected_text = corrected_text.split("### Response: ")[-1].strip()

        results.append((test, corrected_text))

    return results

def solve_math_problems(input_cases):
    test_cases = [case.strip() for case in input_cases.split("\n") if case.strip()]  # Clean empty lines
    results = test_model(model, tokenizer, test_cases)

    output = ""
    for incorrect, corrected in results:
        output += f"Incorrect: {incorrect}\nCorrected: {corrected}\n\n"

    return output

# Create Gradio interface
iface = gr.Interface(
    fn=solve_math_problems,
    inputs=gr.Textbox(
        lines=10,
        placeholder="Enter math memes here, each on a new line (e.g., '4 + 15 * 5 = 35?')",
    ),
    outputs="text",
    live=True,
    title="Math Meme Fixer",
    description="Enter incorrect math memes, and get the corrected versions from the fine-tuned LLaMA model."
)

# Launch the interface
iface.launch(share=True)  # Public URL in Colab
# Custom CSS for orange theme
custom_css = """
body {
    font-family: 'Arial', sans-serif;
    background-color: #fff3e6; /* Light orange background */
}
.gradio-container {
    max-width: 800px;
    margin: 0 auto;
    padding: 20px;
}
h1 {
    color: #ff6200; /* Orange title */
    text-align: center;
    font-size: 2.5em;
    margin-bottom: 20px;
}
textarea, .output-text {
    border: 2px solid #ff6200 !important;
    border-radius: 10px;
    padding: 10px;
    font-size: 1.1em;
}
button {
    background-color: #ff6200 !important; /* Orange button */
    color: white !important;
    border: none !important;
    border-radius: 10px;
    padding: 12px 20px;
    font-size: 1.2em;
    transition: background-color 0.3s;
}
button:hover {
    background-color: #e65c00 !important; /* Darker orange on hover */
}
.description {
    color: #333;
    text-align: center;
    font-size: 1.2em;
    margin-bottom: 20px;
}
.output-text {
    background-color: #fff;
    border: 2px solid #ff6200;
    border-radius: 10px;
    padding: 15px;
    white-space: pre-wrap;
}
"""
iface.css = custom_css
# Create Gradio interface with button

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://389eb6560875ecc1ad.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
