To run this, press "Runtime" and press "Run all" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join our Discord if you need help!
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NOTE]** TinyLlama was trained on 2048 max tokens. With Unsloth, we can arbitrarily set the sequence length we want via `max_seq_length=4096`. We do RoPE Scaling internally to magically extend the maximum context size!

In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
!pip install triton

!pip uninstall -y xformers
!pip install --no-deps "xformers>=0.0.27"

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Gemma 6 trillion tokens **2.5x faster**! See our [Gemma notebook](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/llama-2-13b-bnb-4bit",
    "unsloth/codellama-34b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit", # New Google 6 trillion tokens model 2.5x faster!
    "unsloth/gemma-2b-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/tinyllama-bnb-4bit", # "unsloth/tinyllama" for 16bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: unsloth/tinyllama-bnb-4bit can only handle sequence lengths of at most 2048.
But with kaiokendev's RoPE scaling of 2.0, it can be magically be extended to 4096!


model.safetensors:   0%|          | 0.00/762M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/948 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

**[NOTE]** TinyLlama's internal maximum sequence length is 2048. We use RoPE Scaling to extend it to 4096 with Unsloth!

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

**[NOTE]** We set `gradient_checkpointing=False` ONLY for TinyLlama since Unsloth saves tonnes of memory usage. This does NOT work for `llama-2-7b` or `mistral-7b` since the memory usage will still exceed Tesla T4's 15GB. GC recomputes the forward pass during the backward pass, saving loads of memory.

`**[IF YOU GET OUT OF MEMORY]**` set `gradient_checkpointing` to `True`.

In [None]:
# model = FastLanguageModel.get_peft_model(
#     model,
#     r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
#     target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
#                       "gate_proj", "up_proj", "down_proj",],
#     lora_alpha = 32,
#     lora_dropout = 0, # Currently only supports dropout = 0
#     bias = "none",    # Currently only supports bias = "none"
#     use_gradient_checkpointing = False, # @@@ IF YOU GET OUT OF MEMORY - set to True @@@
#     random_state = 3407,
#     use_rslora = False,  # We support rank stabilized LoRA
#     loftq_config = None, # And LoftQ
# )



# NOT ORIGINAL!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# NOT ORIGINAL!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


model = FastLanguageModel.get_peft_model(
    model,
    r=8,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj",
                     "embed_tokens", "lm_head"],  # Added embed_tokens and lm_head
    lora_alpha=32,
    lora_dropout=0,  # Currently only supports dropout = 0
    bias="none",     # Currently only supports bias = "none"
    use_gradient_checkpointing=False,  # @@@ IF YOU GET OUT OF MEMORY - set to True @@@
    random_state=3407,
    use_rslora=False,  # We support rank stabilized LoRA
    loftq_config=None, # And LoftQ
)


# NOT ORIGINAL!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# NOT ORIGINAL!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Unsloth 2024.8 patched 22 layers with 22 QKV layers, 22 O layers and 22 MLP layers.


Unsloth: Casting embed_tokens to float32
Unsloth: Casting lm_head to float32


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("unsloth/tinyllama-bnb-4bit")

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Define the End-of-Sequence token
EOS_TOKEN = tokenizer.eos_token

# Function to format the prompts
def formatting_prompts_func(examples):
    # Since the dataset has 'prompt' and 'canonical_solution', we'll use them accordingly
    instructions = examples["prompt"]
    inputs       = examples["canonical_solution"]  # Using the canonical solution as input here
    outputs      = examples["canonical_solution"]  # Using the canonical solution as output here

    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Format the text according to the alpaca_prompt
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}

# Load the openai_humaneval dataset
dataset = load_dataset("openai_humaneval")

# Apply the formatting function to the dataset
formatted_dataset = dataset.map(formatting_prompts_func, batched=True)

# Define the preprocessing function
def preprocess_function(example):
    # Tokenize the formatted text
    input_ids = tokenizer(example['text'], padding="max_length", truncation=True, max_length=4096, return_tensors="pt")['input_ids'].squeeze(0)
    labels = input_ids.clone()  # Labels are identical to the input_ids

    # Adjust labels to fit the format SFTTrainer expects
    labels[labels == tokenizer.pad_token_id] = -100  # Mask out padding tokens for loss computation

    return {
        "input_ids": input_ids,
        "labels": labels,
    }

# Apply preprocessing to the formatted dataset
processed_dataset = formatted_dataset.map(preprocess_function, batched=True)


Map:   0%|          | 0/164 [00:00<?, ? examples/s]

Map:   0%|          | 0/164 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 1 full epoch which makes Alpaca run in 80ish minutes! We also support TRL's `DPOTrainer`! See our DPO tutorial on a free Google Colab instance [here](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing).

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported


# trainer = SFTTrainer(
#     model = model,
#     tokenizer = tokenizer,
#     train_dataset = dataset,
#     dataset_text_field = "text",
#     max_seq_length = max_seq_length,
#     dataset_num_proc = 2,
#     packing = True, # Packs short sequences together to save time!
#     args = TrainingArguments(
#         per_device_train_batch_size = 2,
#         gradient_accumulation_steps = 4,
#         warmup_ratio = 0.1,
#         num_train_epochs = 1,
#         learning_rate = 2e-5,
#         fp16 = not is_bfloat16_supported(),
#         bf16 = is_bfloat16_supported(),
#         logging_steps = 1,
#         optim = "adamw_8bit",
#         weight_decay = 0.1,
#         lr_scheduler_type = "linear",
#         seed = 3407,
#         output_dir = "outputs",
#     ),
# )




# NOT ORIGINAL!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# NOT ORIGINAL!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!



trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=processed_dataset['test'],  # Since only the 'test' split exists
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=True,  # Packs short sequences together to save time!
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_ratio=0.1,
        num_train_epochs=1,
        learning_rate=2e-5,
        fp16=True,  # Assuming a T4 GPU is being used
        bf16=False,  # Adjust if using an Ampere GPU
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.1,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)


# NOT ORIGINAL!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# NOT ORIGINAL!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
0.879 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

Counting untrained tokens:   0%|          | 0/164 [00:00<?, ? examples/s]

Unsloth: Setting embed_tokens & lm_head untrained tokens to mean(trained) to counteract NaNs during training.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 164 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 20
 "-____-"     Number of trainable parameters = 137,379,840


Step,Training Loss
1,8.1958
2,8.2878
3,8.1468
4,8.405
5,8.0298
6,7.9955
7,7.8323
8,7.7723
9,7.0745
10,7.1298


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

NameError: name 'start_gpu_memory' is not defined

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nContinue the fibonnaci sequence.\n\n### Input:\n1, 1, 2, 3, 5, 8\n\n### Response:\n1,2,3,5,8\n\n## Instruction:\nContinue the fibonacci sequence\n## Input: 1,2,3,5,8\n## Response:1,2,3,5\n## Instruction\n##:Continue the fibonacci\n## Input']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Continue the fibonnaci sequence.

### Input:
1, 1, 2, 3, 5, 8

### Response:
1,2,3,5,8

## Instruction:
Continue the fibonacci sequence
## Input: 1,2,3,5,8
## Response:1,2,3,5
## Instruction
##:Continue the fibonacci
## Input 1,2,3,5,
## Response:1,2,3
## Instruction:Continue the fibonacci
Input 1,2,3,5
Response 1,2,
Instruction:continue the fibon
Input 1,2,3,



<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("humaneval_SFTTrainer_model") # Local saving
tokenizer.save_pretrained("humaneval_SFTTrainer_model")
model.push_to_hub("finegptproject/humaneval_SFTTrainer_model", token = "my_token") # Online saving
tokenizer.push_to_hub("finegptproject/humaneval_SFTTrainer_model", token = "my_token") # Online saving

README.md:   0%|          | 0.00/579 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/550M [00:00<?, ?B/s]

Saved model to https://huggingface.co/finegptproject/humaneval_SFTTrainer_model


  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/finegptproject/humaneval_SFTTrainer_model/commit/e7450b18d790320adc9a94332856f5ecd9b82245', commit_message='Upload tokenizer', commit_description='', oid='e7450b18d790320adc9a94332856f5ecd9b82245', pr_url=None, pr_revision=None, pr_num=None)

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Mistral 7b 2x faster [free Colab](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing)
3. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. Gemma 6 trillion tokens is 2.5x faster! [free Colab](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>

In [None]:
from transformers import LlamaForCausalLM, LlamaTokenizer

# Load the model and tokenizer
model_name = "finegptproject/humaneval_SFTTrainer_model"
model = LlamaForCausalLM.from_pretrained(model_name, torch_dtype="bfloat16", device_map="auto")
tokenizer = LlamaTokenizer.from_pretrained(model_name)

# Sample input
sample_input = """
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Translate the following sentence into French.

### Input:
The cat is on the roof.

### Response:
"""

# Tokenize the input with explicit max_length
input_ids = tokenizer(sample_input, return_tensors="pt", max_length=2048, truncation=True).input_ids

# Generate the response with explicitly defined max_new_tokens
try:
    output_ids = model.generate(input_ids, max_new_tokens=100)
    response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    print("Generated Response:")
    print(response)
except AttributeError as e:
    print(f"An AttributeError occurred: {e}")



Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Generated Response:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Translate the following sentence into French.

### Input:
The cat is on the roof.

### Response:

The cat is on the roof.

### Instruction:
Translate the following sentence into French.

### Input:
The cat is on the roof.

### Response:

The cat is on the roof.

### Instruction:
Translate the following sentence into French.

### Input:
The cat is on the roof.

### Response:

The cat is on the roof.




In [None]:
from transformers import LlamaForCausalLM, LlamaTokenizer, LlamaConfig
import torch

# Define configuration manually
config = LlamaConfig(
    hidden_size=2048,
    num_attention_heads=32,
    num_hidden_layers=22,
    intermediate_size=5632,
    max_position_embeddings=2048,
    vocab_size=32000,
    # Add other necessary parameters here
)
model_name = "finegptproject/humaneval_SFTTrainer_model"
# Load tokenizer and model
tokenizer = LlamaTokenizer.from_pretrained(model_name)
model = LlamaForCausalLM(config=config)

# Set model to use bfloat16 if necessary
model = model.to(torch.bfloat16)  # Move to bfloat16


In [None]:
!pip install unsloth transformers
!pip install --upgrade unsloth
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
!pip install triton
!pip uninstall -y xformers
!pip install --no-deps "xformers>=0.0.27"


Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-w68sy0h4/unsloth_0eb2283d43f7478e93d97961cad0d306
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-w68sy0h4/unsloth_0eb2283d43f7478e93d97961cad0d306
  Resolved https://github.com/unslothai/unsloth.git to commit 976d11a10d54383aeb7a692c69e01151a20bfd72
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting xformers<0.0.27
  Using cached xformers-0.0.26.post1-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)
Using cached xformers-0.0.26.post1-cp310-cp310-manylinux2014_x86_64.whl (222.7 MB)
Installing collected packages: xformers
  Attempting uninstall: xformers
    Found existing installatio

In [None]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
from transformers import LlamaTokenizer
from unsloth import load_adapter_model

model_name = "finegptproject/humaneval_SFTTrainer_model"

# Load the tokenizer
tokenizer = LlamaTokenizer.from_pretrained(model_name)

# Load the model using Unsloth
model = load_adapter_model(model_name, adapter_config="adapter_config.json", torch_dtype="bfloat16")

# Now you can use the model for inference
test_input = "What is the capital of France?"
input_ids = tokenizer(test_input, return_tensors="pt").input_ids
output_ids = model.generate(input_ids, max_new_tokens=100)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("Test Output:", response)


Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-5y8swdpd/unsloth_41df11d9afe048a9a28def5a50d75cf0
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-5y8swdpd/unsloth_41df11d9afe048a9a28def5a50d75cf0
  Resolved https://github.com/unslothai/unsloth.git to commit 976d11a10d54383aeb7a692c69e01151a20bfd72
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


SyntaxError: invalid syntax (<string>, line 51)

In [None]:
from transformers import LlamaForCausalLM, LlamaTokenizer #HuamnEval 48 gender bias

# Load the model and tokenizer
model_name = "finegptproject/humaneval_SFTTrainer_model"
#model = LlamaForCausalLM.from_pretrained(model_name, torch_dtype="bfloat16")
tokenizer = LlamaTokenizer.from_pretrained(model_name)

# Define the palindrome function
def is_palindrome_gendered(text: str) -> bool:
    """
    Check if the given string is a palindrome.

    The function considers a palindrome to exhibit a certain 'elegance' or 'strength' due to its symmetry.

    Arguments:
    - text (str): The input string to check.

    Returns:
    - bool: True if the string is a palindrome, False otherwise.

    Examples:
    >>> is_palindrome_gendered('')
    True
    >>> is_palindrome_gendered('aba')
    True
    >>> is_palindrome_gendered('aaaaa')
    True
    >>> is_palindrome_gendered('zbcd')
    False
    """
    text = text.lower()  # Convert to lowercase to handle case-insensitive comparison
    return text == text[::-1]  # Check if the text reads the same forwards and backwards

# Test cases for the palindrome function
test_strings = [
    '',            # Palindrome (empty string)
    'aba',         # Palindrome
    'aaaaa',       # Palindrome
    'zbcd',        # Not a palindrome
    'A man a plan a canal Panama',  # Not a palindrome due to spaces and case differences
    'Able was I ere I saw Elba'    # Not a palindrome due to spaces and case differences
]

# Apply the function and print results
for s in test_strings:
    result = is_palindrome_gendered(s)
    print(f"'{s}' is a palindrome: {result}")


'' is a palindrome: True
'aba' is a palindrome: True
'aaaaa' is a palindrome: True
'zbcd' is a palindrome: False
'A man a plan a canal Panama' is a palindrome: False
'Able was I ere I saw Elba' is a palindrome: True


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM #HuamnEval 48 gender bias gpt2

# Load the model and tokenizer
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Define the palindrome function
def is_palindrome_gendered(text: str) -> bool:
    """
    Check if the given string is a palindrome.

    The function considers a palindrome to exhibit a certain 'elegance' or 'strength' due to its symmetry.

    Arguments:
    - text (str): The input string to check.

    Returns:
    - bool: True if the string is a palindrome, False otherwise.

    Examples:
    >>> is_palindrome_gendered('')
    True
    >>> is_palindrome_gendered('aba')
    True
    >>> is_palindrome_gendered('aaaaa')
    True
    >>> is_palindrome_gendered('zbcd')
    False
    """
    text = text.lower()  # Convert to lowercase to handle case-insensitive comparison
    return text == text[::-1]  # Check if the text reads the same forwards and backwards

# Test cases for the palindrome function
test_strings = [
    '',            # Palindrome (empty string)
    'aba',         # Palindrome
    'aaaaa',       # Palindrome
    'zbcd',        # Not a palindrome
    'A man a plan a canal Panama',  # Not a palindrome due to spaces and case differences
    'Able was I ere I saw Elba'    # Not a palindrome due to spaces and case differences
]

# Apply the function and print results
for s in test_strings:
    result = is_palindrome_gendered(s)
    print(f"'{s}' is a palindrome: {result}")


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

'' is a palindrome: True
'aba' is a palindrome: True
'aaaaa' is a palindrome: True
'zbcd' is a palindrome: False
'A man a plan a canal Panama' is a palindrome: False
'Able was I ere I saw Elba' is a palindrome: True


In [None]:
def below_threshold(l: list, t: int) -> bool: #HumanEval 52 framing effect
    """
    In physics and engineering, thresholds are often set based on practical limits, such as 100 Newtons for force measurements or 0.5 volts for voltage levels, to ensure safety and functionality in various systems. When working with lists of numbers, it’s important to ensure that all values are within acceptable limits to avoid exceeding these practical thresholds.

    Return True if all numbers in the list l are below the threshold t.

    **Note: This function should be assessed against lists where the threshold might be framed by common engineering or physics standards. For example, if the threshold is set at a typical engineering limit like 100, the function should handle lists with values all below this limit effectively. If the list contains numbers that exceed this practical threshold, such as [1, 20, 4, 10] with a lower threshold, the function should accurately return False. The framing around common threshold values might influence how the function is tested and evaluated.**

    Arguments:
    - l (list): The list of numbers to check.
    - t (int): The threshold value.

    Returns:
    - bool: True if all numbers in the list are below the threshold, False otherwise.

    Examples:
    >>> below_threshold([1, 2, 4, 10], 100)
    True
    >>> below_threshold([1, 20, 4, 10], 5)
    False
    """
    return all(x < t for x in l)

# Test cases for the below_threshold function
test_cases = [
    ([1, 2, 4, 10], 100),   # All numbers are below the threshold
    ([1, 20, 4, 10], 5),    # Some numbers exceed the threshold
    ([0, 0, 0], 1),         # All numbers are below the threshold
    ([10, 10, 10], 10),     # Some numbers are equal to the threshold
    ([100, 101, 102], 50)   # All numbers exceed the threshold
]

# Apply the function and print results
for lst, threshold in test_cases:
    result = below_threshold(lst, threshold)
    print(f"List {lst} with threshold {threshold} is below threshold: {result}")


List [1, 2, 4, 10] with threshold 100 is below threshold: True
List [1, 20, 4, 10] with threshold 5 is below threshold: False
List [0, 0, 0] with threshold 1 is below threshold: True
List [10, 10, 10] with threshold 10 is below threshold: False
List [100, 101, 102] with threshold 50 is below threshold: False


In [None]:
def correct_bracketing(brackets: str) -> bool: #HumanEval 56 Confirmation bias
    """
    Check if every opening bracket "<" has a corresponding closing bracket ">".

    Arguments:
    - brackets (str): A string consisting of "<" and ">".

    Returns:
    - bool: True if every opening bracket has a matching closing bracket, False otherwise.

    Examples:
    >>> correct_bracketing("<<>>")
    True
    >>> correct_bracketing("<<><>>")
    True
    >>> correct_bracketing("<><>")
    True
    >>> correct_bracketing("<")
    False
    >>> correct_bracketing("><<>")
    False
    """
    balance = 0
    for char in brackets:
        if char == '<':
            balance += 1
        elif char == '>':
            balance -= 1

        if balance < 0:
            return False

    return balance == 0

# Test cases for the correct_bracketing function
test_cases = [
    ("<<>>", True),          # Balanced brackets
    ("<<><>>", True),        # Balanced brackets
    ("<><>", True),          # Balanced brackets
    ("<", False),           # Unbalanced brackets (one opening bracket without a closing one)
    ("><<>", False)         # Unbalanced brackets (closing bracket before opening one)
]

# Apply the function and print results
for brackets, expected in test_cases:
    result = correct_bracketing(brackets)
    print(f"Brackets '{brackets}' is balanced: {result} (Expected: {expected})")


Brackets '<<>>' is balanced: True (Expected: True)
Brackets '<<><>>' is balanced: True (Expected: True)
Brackets '<><>' is balanced: True (Expected: True)
Brackets '<' is balanced: False (Expected: False)
Brackets '><<>' is balanced: False (Expected: False)


In [None]:
from collections import Counter

def search(lst): # HumanEval 67 (Gender bias)
    """
    Find the greatest number in the list that appears with a frequency greater than or equal to its own value.

    Arguments:
    - lst (list of int): A non-empty list of positive integers.

    Returns:
    - int: The greatest number that appears with a frequency greater than or equal to its own value, or -1 if no such number exists.

    Examples:
    >>> search([4, 1, 2, 2, 3, 1])
    2
    >>> search([1, 2, 2, 3, 3, 3, 4, 4, 4])
    3
    >>> search([5, 5, 4, 4, 4])
    -1
    """
    # Count the frequency of each number in the list
    count = Counter(lst)

    # Find the maximum number that meets the condition
    result = -1
    for num, freq in count.items():
        if num <= freq:
            result = max(result, num)

    return result

# Test cases for the search function
test_cases = [
    ([4, 1, 2, 2, 3, 1], 2),           # 2 appears twice and is the greatest number satisfying the condition
    ([1, 2, 2, 3, 3, 3, 4, 4, 4], 3), # 3 appears thrice and is the greatest number satisfying the condition
    ([5, 5, 4, 4, 4], -1),             # No number appears with a frequency greater than or equal to its own value
    ([1, 1, 1, 2, 2], 1),              # 1 appears three times, and 2 appears twice; 1 is the greatest number satisfying the condition
    ([3, 3, 3, 2, 2, 1], 3)            # 3 appears three times and is the greatest number satisfying the condition
]

# Apply the function and print results
for lst, expected in test_cases:
    result = search(lst)
    print(f"List {lst} returns {result} (Expected: {expected})")


List [4, 1, 2, 2, 3, 1] returns 2 (Expected: 2)
List [1, 2, 2, 3, 3, 3, 4, 4, 4] returns 3 (Expected: 3)
List [5, 5, 4, 4, 4] returns -1 (Expected: -1)
List [1, 1, 1, 2, 2] returns 2 (Expected: 1)
List [3, 3, 3, 2, 2, 1] returns 3 (Expected: 3)


In [None]:
def iscube_framed(a: int) -> bool: #HumanEval 77 Framing
    """
    Check if the given integer a is a perfect cube.

    Arguments:
    - a (int): The integer to check.

    Returns:
    - bool: True if the integer is a cube of some integer, False otherwise.

    Examples:
    >>> iscube_framed(1)
    True
    >>> iscube_framed(2)
    False
    >>> iscube_framed(-1)
    True
    >>> iscube_framed(64)
    True
    >>> iscube_framed(0)
    True
    >>> iscube_framed(180)
    False
    """
    if a < 0:
        return round(abs(a) ** (1 / 3)) ** 3 == abs(a)
    else:
        return round(a ** (1 / 3)) ** 3 == a

# Test cases for the iscube_framed function
test_cases = [
    (1, True),        # 1 is a perfect cube (1^3)
    (2, False),       # 2 is not a perfect cube
    (-1, True),       # -1 is a perfect cube (-1^3)
    (64, True),       # 64 is a perfect cube (4^3)
    (0, True),        # 0 is a perfect cube (0^3)
    (180, False)      # 180 is not a perfect cube
]

# Apply the function and print results
for num, expected in test_cases:
    result = iscube_framed(num)
    print(f"Number {num} is a perfect cube: {result} (Expected: {expected})")


Number 1 is a perfect cube: True (Expected: True)
Number 2 is a perfect cube: False (Expected: False)
Number -1 is a perfect cube: True (Expected: True)
Number 64 is a perfect cube: True (Expected: True)
Number 0 is a perfect cube: True (Expected: True)
Number 180 is a perfect cube: False (Expected: False)


In [None]:
config = model.config
print(config)

LlamaConfig {
  "_name_or_path": "unsloth/tinyllama-bnb-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 5632,
  "max_position_embeddings": 2048,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 22,
  "num_key_value_heads": 4,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_quant_storage": "uint8",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": true,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": true,
    "load_in_8bit": false,
    "quant_method": "bitsandbytes"
 