# Finetuning LLaMA 3 8b / Gemma 2b model using qlora (quantized Low Rank Adaptation) for parameter efficent finetuning.
Using the Unloth pakage for speed up the finetuning process.

In [1]:
# %%capture
# # Installs Unsloth, Xformers (Flash Attention) and all other packages!
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# !pip install --no-deps xformers trl peft accelerate bitsandbytes

In [2]:
import torch
major_version, minor_version = torch.cuda.get_device_capability()
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    !pip install --no-deps xformers trl peft accelerate bitsandbytes
pass
!pip install triton transformers
!pip install -U datasets
!pip install --pre -U xformers ##### this take some time


# restart the kernel after running this cell

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-rrkw1bmi/unsloth_110e1b2e12544f889c22e08ec5385cf2
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-rrkw1bmi/unsloth_110e1b2e12544f889c22e08ec5385cf2

  Resolved https://github.com/unslothai/unsloth.git to commit 27fa021a7bb959a53667dd4e7cdb9598c207aa0d
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

In [3]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 8048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: NVIDIA L4. Max memory: 22.168 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.27.dev792. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [4]:
# Load dataset
import pandas as pd
import os

# df_evaluated = pd.read_pickle("/content/0122_10000_evaluated.pkl")
df_evaluated = pd.read_pickle(os.path.join(os.getcwd(), "0122_10000_evaluated.pkl"))
# df_news = pd.read_pickle("/content/mixtral_integrated_df.pkl")
# df_news = pd.read_pickle("/content/3_new_days_mixtral_integrated_df.pkl")
df_news = pd.read_pickle(os.path.join(os.getcwd(), "3_new_days_mixtral_integrated_df.pkl"))
df_news = df_news[df_news['answer'] != 'Error: LLM call failed']
df_evaluated = df_evaluated[df_evaluated["accuracy"] > 4.5]
# df_evaluated.head(3)
# df_news.head(3)

In [5]:
import pandas as pd


# Define the number of instances to select per language
split_language = 1000

# Create a dictionary to store language-specific DataFrames
language_dataframes = {
    lang: df_evaluated[df_evaluated["language"] == lang].sample(split_language, random_state=42)
    for lang in df_evaluated["language"].unique()
}

# Access the DataFrames for each language using the dictionary
df_finetuning_en = language_dataframes["en"]  # Access English DataFrame
df_finetuning_it = language_dataframes["it"]  # Access Italian DataFrame (if it exists)
df_finetuning_es = language_dataframes["es"]  # Access Spanish DataFrame (if it exists)
df_finetuning_fr = language_dataframes["fr"]  # Access French DataFrame (if it exists)

# Print DataFrame shapes
print(df_finetuning_en.shape)
print(df_finetuning_it.shape)
print(df_finetuning_es.shape)
print(df_finetuning_fr.shape)

df_finetuning = pd.concat([df_finetuning_en, df_finetuning_it, df_finetuning_es, df_finetuning_fr], ignore_index=True)


(1000, 5)
(1000, 5)
(1000, 5)
(1000, 5)


In [6]:
df_news['language'] = 'it'


In [7]:
# Merge the selected dataframes based on 'question' and 'answer'
merged_df = pd.merge(df_finetuning, df_news, on=['question', 'answer', 'language'], how='outer')
# merged_df.head(2)

In [8]:
# rename the columns
merged_df = merged_df.sample(frac=1).reset_index(drop=True)
df_finetuning = merged_df.rename(columns = {"question": "instruction", "answer": "output" , "language": "input"})

In [9]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context (if present). Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
# dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
# dataset = dataset.map(formatting_prompts_func, batched = True,)
from datasets import Dataset
dataset = Dataset.from_pandas(df_finetuning)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/4171 [00:00<?, ? examples/s]

### Inference before training

In [10]:
#test the model

def prompt_inference(prmpt):
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference
    inputs = tokenizer(
    [
        prmpt
    ], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True)
    return tokenizer.batch_decode(outputs)[0].split("### Response:")[-1]

In [11]:
prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Cos'la serie di fibonacci, restituisci per 10 elementi commentandoli testualmente

### Response:"""

print("result")
prompt_inference(prmpt=prompt)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


result


" \n```python\n# Cos'la serie di fibonacci, restituisci per 10 elementi commentandoli testualmente\ndef fibonacci(n):\n    if n == 0:\n        return 0\n    elif n == 1:\n        return 1\n    else:\n        return fibonacci(n-1) + fibonacci(n-2)\n\nfor i in range(10):\n    print(fibonacci(i))\n```<|end_of_text|>"

In [12]:
df_test_it_before_training = df_finetuning[df_finetuning['input']=='it']['instruction'].iloc[0:10]

In [13]:
for idx, instruction in df_test_it_before_training.items():
  prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

  ### Instruction:
  {instruction}


  ### Response:"""

  print(f"[**]Prompt {idx+1}")
  print(instruction)
  #   print(prompt)
  print("[*]result:")
  inputs = tokenizer(
    [
        alpaca_prompt.format(
            instruction, # instruction
            "it", # input
            "", # output - leave this blank for generation!
        )
    ], return_tensors = "pt").to("cuda")
  outputs = model.generate(**inputs, max_new_tokens = 200, use_cache = True)
  outputs = tokenizer.batch_decode(outputs)
  result =outputs[0].split("Response:")[-1].strip()

  
  print(result)
  print("\n")  # Add a newline for better readability

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[**]Prompt 1
Write a Lua script that generates random maze patterns using Prim's algorithm.
[*]result:


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


```lua
function generate_maze(n, m)
    local maze = {}

    for i = 1, n do
        maze[i] = {}
        for j = 1, m do
            maze[i][j] = {}
        end
    end

    -- fill in with random values

    return maze
end

function prim(maze, start)
    local frontier = {}
    local discovered = {}
    local visited = {}

    table.insert(frontier, start)

    while #frontier > 0 do
        local current = table.remove(frontier)

        discovered[current] = true

        for neighbor, value in pairs(maze[current]) do
            if not discovered[neighbor] then
                table.insert(frontier, neighbor)
            end
        end
    end

    return visited
end

local maze = generate_maze(5, 5)

local start = 1
local end = 5

prim(maze, start


[**]Prompt 6
Write a PHP script that takes user input from a web form and stores it in a MySQL database.
[*]result:
<|end_of_text|>


[**]Prompt 9
Vedi una barca piena di gente. Non è affondata, ma se guardi di nuovo non vedi una so

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


It was the Titanic.<|end_of_text|>


[**]Prompt 14
Considerati più titoli e contenuti degli articoli, riassumili e integrali in un unico testo.
        Input: Title: Mitsotakis: 'L'immigrazione può risolvere carenza di manodopera' 
 Content: ATENE - "Non temiamo il termine integrazione; siamo una società aperta che ha dimostrato, in passato, la volontà di accogliere coloro che cercano di integrarsi nella società greca. Rendere la loro vita permanente qui nel tempo è un passo naturale". Lo ha dichiarato il premier greco, Kyriakos Mitsotakis, durante una conferenza ad Atene dal nome 'Soluzioni europee alla sfida comune della migrazione', alla presenza del vicepresidente della Commissione europea, Margaritis Schinas, e della commissaria europea per gli Affari interni, Ylva Johansson. Lo riporta il sito di Kathimerini. "Nel 2023 abbiamo gestito i flussi migratori in modo più efficace rispetto a molti dei nostri partner", ha ricordato Mitsotakis, sottolineando che "la migrazione non è neces

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


La politica migratoria la detta l'Europa. 
Non si placa la polemica politica suscitata dall'intesa del governo con il partito indipendentista JuntsXCat per la cessione alla Catalogna delle competenze "integrali" in materia di immigrazione, come contropartita al via libera a tre decreti chiave. Accordo che ha provocato frizioni del partito dell'ex presidente catalano Carles Puigdemont con i soci repubblicani di Erc al governo della Generalitat, tenuti all'oscuro. Il ministro di Presidenza, Giustizia e Rapporti con il parlamento, Felix Bolanos, ha avvertito oggi che "la politica migratoria è una politica europea" e che "gli orientamenti vengono dall'Europa", in dichiarazioni ai cronisti dopo un incontro con il presidente dell'alto


[**]Prompt 18
Un treno lascia New York City a 60 mph mentre un altro treno lascia Los Angeles viaggiando a 80 mph. Quante banane ci sono su ogni treno?
[*]result:


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


There are 120 bananas on each train.

### Explanation:
There are 120 bananas on each train. The first train is traveling at 60 miles per hour. The second train is traveling at 80 miles per hour. To find the number of bananas on each train, we must multiply the number of miles each train travels by the number of bananas on each train. The first train travels 60 miles, so there are 60 bananas on that train. The second train travels 80 miles, so there are 80 bananas on that train. To find the total number of bananas on each train, we must add the number of bananas on each train. The first train has 60 bananas, and the second train has 80 bananas. Therefore, there are 120 bananas on each train.

### Comment:
The response is correct, but it does not explain how the number of bananas on each train was calculated. The response should explain how the number of bananas on each train was calculated.

###


[**]Prompt 26
Un parco a tema sta progettando di costruire nuove montagne russe con un cos

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


it
<|end_of_text|>


[**]Prompt 29
Parte dell'ala di un uccello, questo palindromo aiuta con la stabilità del volo.
[*]result:
<|end_of_text|>


[**]Prompt 33
Data la domanda: Kendall ha chiesto a Tracy se voleva andare a vedere Lady Gaga; decise che non avrebbe accettato un no come risposta. Dato il contesto: cosa ha fatto Kendall? Possibili risposte: ha venduto i biglietti per Lady Gaga, ha deciso che avrebbe accettato solo una risposta sì, è andata a vedere Lady Gaga da sola
La risposta è:
[*]result:


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


ha venduto i biglietti per Lady Gaga
<|end_of_text|>


[**]Prompt 35
Domanda: scegli l'opzione in linea con il buon senso per rispondere alla domanda. Domanda: John gioca a scacchi con il suo compagno di stanza. Fanno una mossa ciascuno tra una lezione e l'altra. Dove è molto probabilmente sistemato il suo set degli scacchi? Opzioni: A. Canada B. armadio C. dormitorio D. salotto E. cassetto
Risposta:
[*]result:


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


The set of chess is in the room of the dormitory.

<|end_of_text|>


[**]Prompt 36
Informazioni: - Il tenente generale Ali Muhammad Jan Aurakzai (lingua urdu:), è un ufficiale generale in pensione di grado a tre stelle dell'esercito pakistano che ha servito come comandante del corpo dell'XI corpo e comandante principale del comando occidentale. In qualità di comandante, ha comandato tutte le risorse militari di combattimento e ha supervisionato il dispiegamento pacifico dell'XI Corpo nelle aree settentrionali e nelle aree tribali ad amministrazione federale (FATA). Aurakzai era il principale generale dell'esercito che guidò le forze combattenti del Pakistan in risposta all'invasione americana dell'Afghanistan in seguito agli attacchi terroristici negli Stati Uniti. Dopo essersi ritirato dall'esercito, è stato nominato governatore del Khyber-Pakhtunkhwa del Pakistan, dal maggio 2006 fino alle sue dimissioni nel gennaio 2008. - Una repubblica islamica è il nome dato a diversi stati in pa

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [14]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [15]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 50,
        # max_steps = 60,
        num_train_epochs = 1,
        learning_rate = 2e-4,
        # fp32 = True,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/4171 [00:00<?, ? examples/s]

In [16]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA L4. Max memory = 22.168 GB.
7.0 GB of memory reserved.


In [17]:
# install the leatest version of torch and xFormers 
# ! pip install torch xformers

In [18]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 4,171 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 4
\        /    Total batch size = 4 | Total steps = 1,042
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
10,1.87
20,1.4789
30,1.2115
40,1.1019
50,0.9549
60,1.1407
70,1.0523
80,0.942
90,1.1569
100,1.1629


In [26]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

4140.891 seconds used for training.
69.01 minutes used for training.
Peak reserved memory = 12.959 GB.
Peak reserved memory for training = 5.959 GB.
Peak reserved memory % of max memory = 58.458 %.
Peak reserved memory for training % of max memory = 26.881 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [29]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 250, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context (if present). Write a response that appropriately completes the request.\n\n### Instruction:\nContinue the fibonnaci sequence.\n\n### Input:\n1, 1, 2, 3, 5, 8\n\n### Response:\nThe Fibonacci sequence continues as: 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155, 165580141, 267444597, 433494437, 701408733, 1134903170, 1836311903, 29712151<|end_of_text|>']

In [31]:
for idx, instruction in df_test_it_before_training.items():
  prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

  ### Instruction:
  {instruction}


  ### Response:"""

  print(f"[**]Prompt {idx+1}")
  print(instruction)
  #   print(prompt)
  print("[*]result:")
  inputs = tokenizer(
    [
        alpaca_prompt.format(
            instruction, # instruction
            "it", # input
            "", # output - leave this blank for generation!
        )
    ], return_tensors = "pt").to("cuda")
  outputs = model.generate(**inputs, max_new_tokens = 500, use_cache = True)
  outputs = tokenizer.batch_decode(outputs)
  result =outputs[0].split("Response:")[-1].strip()

  
  print(result)
  print("\n") 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[**]Prompt 1
Write a Lua script that generates random maze patterns using Prim's algorithm.
[*]result:


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


```lua
-- Prim's Algorithm for generating random maze patterns in Lua

local function generate_maze(width, height)
    local maze = {}
    local visited = {}

    for y = 1, height do
        maze[y] = {}
        for x = 1, width do
            maze[y][x] = false
            visited[y][x] = false
        end
    end

    local start_x, start_y = math.random(width), math.random(height)
    visited[start_y][start_x] = true
    maze[start_y][start_x] = true

    local queue = {{start_x, start_y}}

    while #queue > 0 do
        local current = table.remove(queue)
        local x, y = current[1], current[2]

        for _, neighbor in ipairs({{x - 1, y}, {x + 1, y}, {x, y - 1}, {x, y + 1}}) do
            local nx, ny = neighbor[1], neighbor[2]

            if nx >= 1 and nx <= width and ny >= 1 and ny <= height and not visited[ny][nx] then
                visited[ny][nx] = true
                maze[ny][nx] = true

                table.insert(queue, {nx, ny})
            end
        end


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


To create a PHP script that takes user input from a web form and stores it in a MySQL database, you need to have the following things set up:

1. A web server with PHP support.
2. A MySQL database with a table named "users" containing columns for ID, name, email, and password.

Here's a simple example of how you can do this:

```php
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>PHP Form to MySQL</title>
</head>
<body>

<form action="store.php" method="post">
    Name: <input type="text" name="name"><br>
    Email: <input type="email" name="email"><br>
    Password: <input type="password" name="password"><br>
    <input type="submit" value="Submit">
</form>

</body>
</html>
```

This is a simple HTML form that accepts user input for name, email, and password. When the user submits the form, it sends the data to a PHP script called "store.php".

Now let's create the "store.php" file:

```php
<?php
// Database credentials
$servername = "localhost";
$usernam

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Tutti sono in mare.<|end_of_text|>


[**]Prompt 14
Considerati più titoli e contenuti degli articoli, riassumili e integrali in un unico testo.
        Input: Title: Mitsotakis: 'L'immigrazione può risolvere carenza di manodopera' 
 Content: ATENE - "Non temiamo il termine integrazione; siamo una società aperta che ha dimostrato, in passato, la volontà di accogliere coloro che cercano di integrarsi nella società greca. Rendere la loro vita permanente qui nel tempo è un passo naturale". Lo ha dichiarato il premier greco, Kyriakos Mitsotakis, durante una conferenza ad Atene dal nome 'Soluzioni europee alla sfida comune della migrazione', alla presenza del vicepresidente della Commissione europea, Margaritis Schinas, e della commissaria europea per gli Affari interni, Ylva Johansson. Lo riporta il sito di Kathimerini. "Nel 2023 abbiamo gestito i flussi migratori in modo più efficace rispetto a molti dei nostri partner", ha ricordato Mitsotakis, sottolineando che "la migrazione non è neces

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Kyriakos Mitsotakis, premier greco, ha dichiarato che l'immigrazione può risolvere la carenza di manodopera in Grecia, sottolineando la volontà della società greca di accogliere coloro che cercano di integrarsi nella società. Ha inoltre affermato che la Grecia può fare da apripista in questo processo. Il parlamento greco ha recentemente votato una legge che concede un permesso di residenza e di lavoro di tre anni agli immigrati irregolari che rispettino determinati parametri.

Matteo Salvini, vicepremier italiano, ha annunciato che non si candiderà alle prossime elezioni europee, continuando a fare il ministro. Ha anche anticipato che il generale Roberto Vannacci è ancora uno dei suoi obiettivi per le europee. Tuttavia, Vannacci ha chiarito che l'ultima parola spetta ancora a lui e che non ha mai detto di essersi candidato.

Il governo spagnolo ha sottolineato che la politica migratoria è una politica europea e che gli orientamenti vengono dall'Europa. Il ministro di Presidenza, Giusti

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Ci sono 0 banane su ogni treno.<|end_of_text|>


[**]Prompt 26
Un parco a tema sta progettando di costruire nuove montagne russe con un costo di costruzione stimato di 10 milioni di dollari. Si prevede di attirare altri 200.000 visitatori all'anno grazie alla nuova attrazione. Il prezzo medio del biglietto è di $ 50 e si prevede che le spese operative del parco aumenteranno del 15% dopo la costruzione delle montagne russe. Calcola il periodo di recupero dell'investimento e fornisci una motivazione dettagliata.
[*]result:


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Per calcolare il periodo di recupero dell'investimento, dobbiamo prima calcolare il ricavo annuale e le spese operative. Successivamente, divideremo il costo di costruzione stimato per il ricavo annuale per scoprire quante volte il parco a tema deve vendere i biglietti per recuperare l'investimento.

Ricavo annuale = numero di visitatori * prezzo medio del biglietto
Ricavo annuale = 200.000 visitatori * $ 50/biglietto
Ricavo annuale = $ 10.000.000

Spese operative = spese operative iniziali + incremento di spese operative
Spese operative = $ 10.000.000 * 0,15
Spese operative = $ 1.500.000

Periodo di recupero = Costo di costruzione / Ricavo annuale - Spese operative
Periodo di recupero = $ 10.000.000 / $ 10.000.000 - $ 1.500.000
Periodo di recupero = 8,33 anni

Il periodo di recupero dell'investimento è di circa 8,33 anni.<|end_of_text|>


[**]Prompt 29
Parte dell'ala di un uccello, questo palindromo aiuta con la stabilità del volo.
[*]result:


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


ala<|end_of_text|>


[**]Prompt 33
Data la domanda: Kendall ha chiesto a Tracy se voleva andare a vedere Lady Gaga; decise che non avrebbe accettato un no come risposta. Dato il contesto: cosa ha fatto Kendall? Possibili risposte: ha venduto i biglietti per Lady Gaga, ha deciso che avrebbe accettato solo una risposta sì, è andata a vedere Lady Gaga da sola
La risposta è:
[*]result:


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Ha deciso che avrebbe accettato solo una risposta sì.<|end_of_text|>


[**]Prompt 35
Domanda: scegli l'opzione in linea con il buon senso per rispondere alla domanda. Domanda: John gioca a scacchi con il suo compagno di stanza. Fanno una mossa ciascuno tra una lezione e l'altra. Dove è molto probabilmente sistemato il suo set degli scacchi? Opzioni: A. Canada B. armadio C. dormitorio D. salotto E. cassetto
Risposta:
[*]result:


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


C. dormitorio

Spiegazione: John gioca a scacchi con il suo compagno di stanza, che significa che probabilmente hanno un set di scacchi nella stanza dove dormono, che è il dormitorio. Le altre opzioni non sono molto probabili perché il Canada è un paese, l'armadio è un posto dove si mettono le vestiti, il salotto è una stanza per ricevere ospiti e il cassetto è un piccolo contenitore per oggetti.<|end_of_text|>


[**]Prompt 36
Informazioni: - Il tenente generale Ali Muhammad Jan Aurakzai (lingua urdu:), è un ufficiale generale in pensione di grado a tre stelle dell'esercito pakistano che ha servito come comandante del corpo dell'XI corpo e comandante principale del comando occidentale. In qualità di comandante, ha comandato tutte le risorse militari di combattimento e ha supervisionato il dispiegamento pacifico dell'XI Corpo nelle aree settentrionali e nelle aree tribali ad amministrazione federale (FATA). Aurakzai era il principale generale dell'esercito che guidò le forze combattenti

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [21]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context (if present). Write a response that appropriately completes the request.

### Instruction:
Continue the fibonnaci sequence.

### Input:
1, 1, 2, 3, 5, 8

### Response:
13, 21, 34, 55, 89, 144<|end_of_text|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [32]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [6]:
# #free memory 
# # import torch
# # torch.cuda.empty_cache()
# !pip install numba
# from numba import cuda
# device = cuda.get_current_device()
# device.reset()


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [4]:
if True:
    max_seq_length = 8048 # Choose any! We auto support RoPE Scaling internally!
    dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: NVIDIA L4. Max memory: 22.168 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.27.dev792. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [5]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context (if present). Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

In [6]:


# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context (if present). Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is a famous tall tower in Paris?\n\n### Input:\n\n\n### Response:\nThe Eiffel Tower<|end_of_text|>']

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [24]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [7]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
# if True: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if True: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... Done.


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [8]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 40.1 out of 60.46 RAM for saving.


 34%|███▍      | 11/32 [00:00<00:01, 17.47it/s]


make: Entering directory '/teamspace/studios/this_studio/llama.cpp'
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE 
I NVCCFLAGS: -std=c++11 -O3 
I LDFLAGS:    
I CC:        cc (Ubuntu 9.4

  0%|          | 0/32 [00:00<?, ?it/s]


OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import shutil
shutil.copy('./model-unsloth.Q4_K_M.gguf', '/content/drive/MyDrive/')

# files.upload({'model-unsloth.Q4_K_M.gguf': '/content/drive/MyDrive/models/model-unsloth.Q4_K_M.gguf'})

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>