In [8]:
#%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
#!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
#!pip install --no-deps xformers trl peft accelerate bitsandbytes

<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [9]:
#!unzip model_finetune_fincance.zip
#!unzip model_finetune_alter.zip
#!unzip model_finetune_slotech.zip

In [10]:
#! THIS SHOULD BE CHANGED TO ACCOMADATE YOUR DIRECTORY ARCHITECTURE

import json


with open(f"./model_finetune_slotech.json","r") as f:
    data_slotech = json.load(f)
with open(f"./model_finetune_fincance.json","r") as f:
    data_fincance = json.load(f)
with open(f"./model_finetune_alter.json","r") as f:
    data_alter = json.load(f)

In [11]:
data = []
data.extend(data_slotech)
data.extend(data_fincance)
data.extend(data_alter)

In [12]:
import pandas as pd
d = []
for x in data:
    assert len(x)==2
    instruction = "This is conversation in slovenian language. Input will be in slovenian language. Response should also be in slovenian language."
    d.append({
        x[0]["role"]:x[0]["content"],
        x[1]["role"]:x[1]["content"],
        "instruction":instruction
    })

df = pd.DataFrame(d)
df = df.rename(columns={'user':'input', 'assistant':'output', "instruction":"instruction"})
df

Unnamed: 0,input,output,instruction
0,Vroče k hudič....kaj storiti ? novo kišto ?,"ja odpri, poglej če je vent sploh vkloplen (a ...",This is conversation in slovenian language. In...
1,Vroče k hudič....kaj storiti ? novo kišto ?,"Rzdri napajalc, ven vzemi vent, poglej če je z...",This is conversation in slovenian language. In...
2,Vroče k hudič....kaj storiti ? novo kišto ?,"Ja, poglej in ga menjaj. Pa ne laufaj računal...",This is conversation in slovenian language. In...
3,sam iz kasnih banalnih stvari folk dela proble...,"@user ma prav. Pač ne šlatej, če je vroče. Amp...",This is conversation in slovenian language. In...
4,"Odpri napajalc, zamenji vent in drugic malo po...",DR_M : pač bolje preventiva kot potem kurativa...,This is conversation in slovenian language. In...
...,...,...,...
444799,"Vem, samo ravno tako si moraš vzet nekaj časa....",za urejanje je drugače full bolj uporaben je p...,This is conversation in slovenian language. In...
444800,"Vem, samo ravno tako si moraš vzet nekaj časa....","ksz: izšel je update za 6xxx,7xxx,8xxx in 9xx...",This is conversation in slovenian language. In...
444801,"Vem, samo ravno tako si moraš vzet nekaj časa....","Hmm, na podnapisi.net sem opazil, da imajo nek...",This is conversation in slovenian language. In...
444802,"Vem, samo ravno tako si moraš vzet nekaj časa....",Ali imate prav nastavljeno kodiranje podnapiso...,This is conversation in slovenian language. In...


In [13]:
df.iloc[0]["input"],df.iloc[0]["output"]

('Vroče k hudič....kaj storiti ? novo kišto ?',
 'ja odpri, poglej če je vent sploh vkloplen (a si kak čudne zadeve s škatlo delu?  ), lahko da je šow vent... če napajalc in vse ostalo dela, zamenji vent pa bo.  hehe, men se je včeri vent v enem testnem napajalcu skuru... menda zato, ker je dubu 60V...')

In [14]:
import sklearn
import sklearn.model_selection
df_train, df_test = sklearn.model_selection.train_test_split(df,train_size=0.01,random_state=42, shuffle=True)

In [15]:
print("TRAIN")
display(df_train)
print("TEST")
display(df_test)

TRAIN


Unnamed: 0,input,output,instruction
333998,Novo pri Telemachu II.. Nadaljujemo temo. http...,Tudi jaz sem imel težave v tujini. Isto je bil...,This is conversation in slovenian language. In...
390025,"Zakaj se ne bom vozil? Če mulota vprašam, bo r...",Kaj maš pa za šraufat?,This is conversation in slovenian language. In...
285622,Novo pri Izimobilu. Spam zaželjen,Hmm...na podpori so še aktivni ...,This is conversation in slovenian language. In...
44760,"Iščem nekoga, ki bi mi naredil nekaj kopij. Ne...",Ponavadi je tipka za aktivacijo novega daljinc...,This is conversation in slovenian language. In...
92122,"se oproščam, da se nisem že včeraj javil, bi s...",A pr men (sva skoraj soseda) ti za 32EUR proda...,This is conversation in slovenian language. In...
...,...,...,...
259178,@assistant je napisal(a): wtf jap,"a bomo odprli oglasno desko rekreacije :) evo,...",This is conversation in slovenian language. In...
365838,Kako sem jaz na hitro računal bi blo okol 1000...,pa saj menjaš samo tiste ventile ki so hin. Os...,This is conversation in slovenian language. In...
131932,Na enem pcju virtualko in potem rdp :),Zgleda da bo tole še najbolj pametno - tudi za...,This is conversation in slovenian language. In...
146867,Ti TV podpira 4k? :),ne,This is conversation in slovenian language. In...


TEST


Unnamed: 0,input,output,instruction
245758,V Ambrusu so odstranili barikade. neverjetno. ...,"policija ščiti rome pre večino, ko pa romi kog...",This is conversation in slovenian language. In...
68827,A so žarnice z žarilno nitko sploh še na voljo...,"Beri tole temo. Odsvetujem LED svetilko, ker ...",This is conversation in slovenian language. In...
138732,"Torej kakšne so dejansko slabosti, če bi se pr...",A stanje na računu Unicredit je pa kaj bolj?,This is conversation in slovenian language. In...
133298,Prejel sem še eno zanimivo ponudbo od chemets....,Seveda je med skeniranjem potrebno objekt na m...,This is conversation in slovenian language. In...
222000,[@assistant] > [@user] > > [@assistant] > > Za...,Zaključek je briljanten :).,This is conversation in slovenian language. In...
...,...,...,...
15372,"Sploh se men to cudno zdi, ker je strojni inzi...","Webdizajnerjev je preveč, da. Res dobrih webdi...",This is conversation in slovenian language. In...
98538,sem koncal solo za avtomehanika tako da nebo p...,Eni grejo pač na hard way..,This is conversation in slovenian language. In...
143252,Pravkar sem z Mbills na Aliexpresu kupil en iz...,Jup. Dolar se je v zadnjih nekaj tednih okrepi...,This is conversation in slovenian language. In...
347860,"Čiščenje turbo polnilnika. Pozdravljeni, kupil...",Bencinar nima dpf-ja,This is conversation in slovenian language. In...


In [16]:
## Model preperation

In [17]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/llama-2-13b-bnb-4bit",
    "unsloth/codellama-34b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit", # New Google 6 trillion tokens model 2.5x faster!
    "unsloth/gemma-2b-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


config.json:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/4.14G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/137k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [18]:
import datasets
import pyarrow
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""


EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }


dataset_train = datasets.Dataset(pyarrow.Table.from_pandas(df_train))
dataset_train = dataset_train.map(formatting_prompts_func, batched = True,)

dataset_test = datasets.Dataset(pyarrow.Table.from_pandas(df_test))
dataset_test = dataset_test.map(formatting_prompts_func, batched = True,)

#from datasets import load_dataset
#dataset = load_dataset("leo009/alpaca-cleaned-zh-cn", split = "train")
#dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/4448 [00:00<?, ? examples/s]

Map:   0%|          | 0/440356 [00:00<?, ? examples/s]

In [19]:
print(type(dataset_train))
dataset_train

<class 'datasets.arrow_dataset.Dataset'>


Dataset({
    features: ['input', 'output', 'instruction', '__index_level_0__', 'text'],
    num_rows: 4448
})

In [20]:
print(type(dataset_test))
dataset_test

<class 'datasets.arrow_dataset.Dataset'>


Dataset({
    features: ['input', 'output', 'instruction', '__index_level_0__', 'text'],
    num_rows: 440356
})

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [21]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [22]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset_train,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        #max_steps = 60, # Set num_train_epochs = 1 for full training runs
        num_train_epochs = 1,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/4448 [00:00<?, ? examples/s]

In [23]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
4.52 GB of memory reserved.


In [24]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 4,448 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 556
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.9698
2,3.0724
3,2.7023
4,2.62
5,2.4992
6,2.4849
7,2.712
8,2.3815
9,1.8592
10,2.0539


RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [25]:
 print(1)

1


In [26]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

NameError: name 'trainer_stats' is not defined

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [27]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "This is conversation in slovenian language. Input will be in slovenian language. Response should also be in slovenian language.", # instruction
        "Vroče k hudič....kaj storiti ? novo kišto ?", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "This is conversation in slovenian language. Input will be in slovenian language. Response should also be in slovenian language.", # instruction
        "Vroče k hudič....kaj storiti ? novo kišto ?", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.model',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "", # instruction
        "", # input
        "", # output - leave this blank for generation!
    ),
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True)
tokenizer.batch_decode(outputs)

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if True:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if True: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if True: model.push_to_hub_merged("leo009/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if True: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if True: model.push_to_hub_merged("leo009/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if True: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if True: model.push_to_hub_merged("leo009/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if True: model.save_pretrained_gguf("model", tokenizer,)
if True: model.push_to_hub_gguf("leo009/mistral-7b-v3", tokenizer, token = "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")

# Save to 16bit GGUF
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if True: model.push_to_hub_gguf("leo009/mistral-7b-v3", tokenizer, quantization_method = "f16", token = "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")

# Save to q4_k_m GGUF
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if True: model.push_to_hub_gguf("leo009/mistral-7b-v3", tokenizer, quantization_method = "q4_k_m", token = "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")
if True: model.push_to_hub_gguf("leo009/mistral-7b-v3", tokenizer, quantization_method = "q5_k_m", token = "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")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).