To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Gemma 3N Guide](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [3]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

### Unsloth

#### Text Completion / Raw Text Training
This is a community notebook collaboration with [Mithex].

We train on `Tiny Stories` (link [here](https://huggingface.co/datasets/roneneldan/TinyStories)) which is a collection of small stories. For example:
```
Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun.
Beep was a healthy car because he always had good fuel....
```
Instead of `Alpaca`'s Question Answer format, one only needs 1 column - the `"text"` column. This means you can finetune on any dataset and let your model act as a text completion model, like for novel writing.


In [4]:
%env UNSLOTH_RETURN_LOGITS=1 # Run this to disable CCE since it is not supported for CPT

env: UNSLOTH_RETURN_LOGITS=1 # Run this to disable CCE since it is not supported for CPT


In [3]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-v0.3", # "unsloth/mistral-7b" for 16bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.6: Fast Mistral patching. Transformers: 4.53.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.14G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/157 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

We also add `embed_tokens` and `lm_head` to allow the model to learn out of distribution data.

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",

                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Offloading input_embeddings to disk to save VRAM
Unsloth: Offloading output_embeddings to disk to save VRAM


Unsloth 2025.7.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Unsloth: Training embed_tokens in mixed precision to save VRAM
Unsloth: Training lm_head in mixed precision to save VRAM


<a name="Data"></a>
### Data Prep
We now use the Tiny Stories dataset from https://huggingface.co/datasets/roneneldan/TinyStories. We only sample the first 5000 rows to speed training up. We must add `EOS_TOKEN` or `tokenizer.eos_token` or else the model's generation will go on forever.

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb)

In [5]:
from datasets import load_dataset
dataset = load_dataset("OnalBusra/Asansor", split = "train[:2500]")
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    return { "text" : [example + EOS_TOKEN for example in examples["text"]] }
dataset = dataset.map(formatting_prompts_func, batched = True,)

README.md:   0%|          | 0.00/554 [00:00<?, ?B/s]

data/train-00000-of-00001-3abca7f074d3de(…):   0%|          | 0.00/241k [00:00<?, ?B/s]

data/test-00000-of-00001-d20103c29780c40(…):   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1790 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/199 [00:00<?, ? examples/s]

Map:   0%|          | 0/1790 [00:00<?, ? examples/s]

Print out 5 stories from `Tiny Stories`

In [6]:
for row in dataset[:5]["text"]:
    print("=========================")
    print(row)

### Instruction:
Asansör kabinindeki havalandırma çalışmıyor, ne yapmalıyım?

### Response:
Kabin havalandırması düzenli bakım gerektiren bir ekipmandır. Havalandırma çalışmıyorsa, önce sigortaları kontrol edin. Sorun devam ederse, Butkon teknik servisiyle iletişime geçerek profesyonel destek talep edin. Konforunuz için bu sorunun hızlıca çözülmesi önemlidir.</s>
### Instruction:
Kabin içindeki kapasite etiketi silinmiş, ne yapmalıyım?

### Response:
Bu acil bir durum. Kullanıcı güvenliği için kapasite etiketi derhal yenilenmeli. Lütfen teknik servisi arayın, Butkon ekibi en kısa sürede müdahale edecektir.</s>
### Instruction:
Asansör kabinindeki acil durum butonuna bastığımda kimse cevap vermiyor. Ne yapmalıyım?

### Response:
Öncelikle sakin olun. Butonun çalıştığından emin olmak için bir kez daha basın. Eğer yanıt alınmazsa, cep telefonunuzla 112 Acil veya Sen Butkon teknik destek hattını (belgelerde belirtilen numara) arayın. Kabin içindeki talimatların yanında iletişim bilgileri d

<a name="Train"></a>
### Continued Pretraining
Now let's use Unsloth's `UnslothTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 20 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

Also set `embedding_learning_rate` to be a learning rate at least 2x or 10x smaller than `learning_rate` to make continual pretraining work!

In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        warmup_ratio = 0.1,
        num_train_epochs = 1,

        learning_rate = 5e-5,
        embedding_learning_rate = 5e-6,

        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/1790 [00:00<?, ? examples/s]

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
6.367 GB of memory reserved.


In [8]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,790 | Num Epochs = 1 | Total steps = 112
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 603,979,776 of 7,852,003,328 (7.69% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,0.7255
2,0.6742
3,0.6754
4,0.6245
5,0.2354
6,0.0536
7,0.0254
8,0.0216
9,0.0193
10,0.0092


In [1]:
# Google Drive'ı mount et (Colab kullanıyorsan)
from google.colab import drive
drive.mount('/content/drive')

# Kaydetme dizini (örnek)
save_path = "/content/drive/MyDrive/Butkon_UnsLoTh_Model"

# Modeli kaydet
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)


Mounted at /content/drive


NameError: name 'model' is not defined

In [2]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from google.colab import drive

drive.mount('/content/drive')

# Yolları kendine göre ayarla
drive_lora_path  = "/content/drive/MyDrive/Butkon_UnsLoTh_Model"  # LoRA ağırlıkları burada
drive_merged_path = "/content/drive/MyDrive/Butkon_UnsLoTh_Model_merged"  # Birleştirilmiş model için

# Base modeli yükle (aynı base model olmalı)
base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/mistral-7b-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

# LoRA adaptörünü yükle ve merge et
lora_model = PeftModel.from_pretrained(
    base_model,
    drive_lora_path,
    torch_dtype=torch.float16
)

merged_model = lora_model.merge_and_unload()  # Burada önemli olan merge_and_unload

# Modeli kaydet
os.makedirs(drive_merged_path, exist_ok=True)
merged_model.save_pretrained(
    drive_merged_path,
    safe_serialization=True
)

# Tokenizer’ı da kaydet
tokenizer = AutoTokenizer.from_pretrained(
    "unsloth/mistral-7b-v0.3",
    trust_remote_code=True
)
tokenizer.save_pretrained(drive_merged_path)

print(f"✅ Model başarıyla birleştirildi ve '{drive_merged_path}' klasörüne kaydedildi.")


Mounted at /content/drive


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/157 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

✅ Model başarıyla birleştirildi ve '/content/drive/MyDrive/Butkon_UnsLoTh_Model_merged' klasörüne kaydedildi.


In [9]:
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && mkdir build && cd build && cmake .. && cmake --build .


Cloning into 'llama.cpp'...
remote: Enumerating objects: 57001, done.[K
remote: Counting objects: 100% (200/200), done.[K
remote: Compressing objects: 100% (154/154), done.[K
remote: Total 57001 (delta 137), reused 47 (delta 46), pack-reused 56801 (from 3)[K
Receiving objects: 100% (57001/57001), 136.09 MiB | 14.53 MiB/s, done.
Resolving deltas: 100% (41255/41255), done.
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMA

In [3]:
#!ls -lh asansor_finetuned_merged/*.gguf
!ls -lh /content/drive/MyDrive/Butkon_Merged_Complete_Model/*.gguf

ls: cannot access '/content/drive/MyDrive/Butkon_Merged_Complete_Model/*.gguf': No such file or directory


In [10]:
!cd llama.cpp && python3 convert_hf_to_gguf.py /content/drive/MyDrive/Butkon_UnsLoTh_Model_merged --outtype f16


INFO:hf-to-gguf:Loading model: Butkon_UnsLoTh_Model_merged
INFO:hf-to-gguf:Model architecture: MistralForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00003.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {4096, 32768}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.float16 --> F16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.float16 --> F16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.float16 --> F16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.float16 --> F16, shape = {4096, 1024}
INFO:hf-

In [11]:
!cd llama.cpp && ./build/bin/llama-quantize \
  /content/drive/MyDrive/Butkon_UnsLoTh_Model_merged/Butkon_UnsLoTh_Model_merged-7.2B-F16.gguf \
  /content/drive/MyDrive/Butkon_UnsLoTh_Model_merged/butkon_unsloth_model.q4_0.gguf \
  q4_0

main: build = 5952 (92204260)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/content/drive/MyDrive/Butkon_UnsLoTh_Model_merged/Butkon_UnsLoTh_Model_merged-7.2B-F16.gguf' to '/content/drive/MyDrive/Butkon_UnsLoTh_Model_merged/butkon_unsloth_model.q4_0.gguf' as Q4_0
llama_model_loader: loaded meta data with 37 key-value pairs and 291 tensors from /content/drive/MyDrive/Butkon_UnsLoTh_Model_merged/Butkon_UnsLoTh_Model_merged-7.2B-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Butkon_UnsLoTh_Model_merged
llama_model_loader: - kv   3:                         general.size_l

In [13]:
!pip install llama-cpp-python


Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.14.tar.gz (51.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.0/51.0 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.3.14-cp311-cp311-linux_x86_64.whl size=4295719 sha256=1b1439264a0f85dffc

In [22]:
from llama_cpp import Llama
import gradio as gr
import traceback
import os
import torch

MODEL_PATH = "/content/drive/MyDrive/Butkon_UnsLoTh_Model_merged/butkon_unsloth_model.q4_0.gguf"

try:
    llm = Llama(
    model_path=MODEL_PATH,
    n_ctx=2048,
    n_threads=os.cpu_count(),
    n_gpu_layers=-1 if torch.cuda.is_available() else 0,
    chat_format="chatml"
    )

    print("✅ Llama modeli başarıyla yüklendi.")
except Exception as e:
    print("💥 Model yüklenirken hata oluştu!")
    traceback.print_exc()
    raise e

def generate_response(user_input, chat_history):
    try:
        messages = [{"role": "system", "content": "Butkon asansör firmasının yapay zeka destek asistanısın."}]

        for human_msg, assistant_msg in chat_history:
            messages.append({"role": "user", "content": human_msg})
            messages.append({"role": "assistant", "content": assistant_msg})

        messages.append({"role": "user", "content": user_input})

        response = llm.create_chat_completion(
         messages=messages,
         temperature=0.7,
           top_p=0.9,
          max_tokens=200
         )

        # 🛠 Hata ayıklama için yanıtı yazdır
        print("🔁 Model cevabı:", response)

        # Yanıta ulaşmaya çalış
        if isinstance(response, dict) and "choices" in response:
            choice = response["choices"][0]
            message = choice.get("message", {})
            answer = response["choices"][0]["message"]["content"].strip()
        else:
            answer = "⚠️ Beklenmedik model yanıt formatı."

        return "", chat_history + [[user_input, answer]]

    except Exception as e:
        print("💥 Yanıt üretilirken hata oluştu!")
        traceback.print_exc()
        return "", chat_history + [[user_input, f"⚠️ Hata oluştu: {str(e)}"]]

with gr.Blocks() as demo:
    gr.Markdown("## 🛗 Butkon Asansör Destek Chatbot")
    chatbot = gr.Chatbot(height=400)
    user_input = gr.Textbox(placeholder="Sorunuzu yazın...", label="Mesaj")
    send_btn = gr.Button("Gönder")
    clear_btn = gr.Button("Temizle")

    user_input.submit(generate_response, [user_input, chatbot], [user_input, chatbot])
    send_btn.click(generate_response, [user_input, chatbot], [user_input, chatbot])
    clear_btn.click(lambda: [], None, chatbot)

try:
    demo.launch(server_name="0.0.0.0", server_port=7875, share=True)
except Exception as e:
    print("💥 Gradio başlatılırken hata oluştu!")
    traceback.print_exc()
    raise e


llama_model_loader: loaded meta data with 37 key-value pairs and 291 tensors from /content/drive/MyDrive/Butkon_UnsLoTh_Model_merged/butkon_unsloth_model.q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Butkon_UnsLoTh_Model_merged
llama_model_loader: - kv   3:                         general.size_label str              = 7.2B
llama_model_loader: - kv   4:                   general.base_model.count u32              = 1
llama_model_loader: - kv   5:                  general.base_model.0.name str              = Mistral 7b v0.3 Bnb 4bit
llama_model_loader: - kv   6:               general.base_model.0.version str           

✅ Llama modeli başarıyla yüklendi.
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://e966d4e68c938d3cd1.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

2780.9282 seconds used for training.
46.35 minutes used for training.
Peak reserved memory = 11.432 GB.
Peak reserved memory for training = 5.065 GB.
Peak reserved memory % of max memory = 77.516 %.
Peak reserved memory for training % of max memory = 34.344 %.


<a name="Inference"></a>
### Inference
Let's run the model!

We first will try to see if the model follows the style and understands to write a story that is within the distribution of "Tiny Stories". Ie a story fit for a bed time story most likely.

We select "Once upon a time, in a galaxy, far far away," since it normally is associated with Star Wars.

In [None]:
from transformers import TextIteratorStreamer
from threading import Thread
text_streamer = TextIteratorStreamer(tokenizer)
import textwrap
max_print_width = 100

# Before running inference, call `FastLanguageModel.for_inference` first

FastLanguageModel.for_inference(model)

inputs = tokenizer(
[
    "Once upon a time, in a galaxy, far far away,"
]*1, return_tensors = "pt").to("cuda")

generation_kwargs = dict(
    inputs,
    streamer = text_streamer,
    max_new_tokens = 256,
    use_cache = True,
)
thread = Thread(target = model.generate, kwargs = generation_kwargs)
thread.start()

length = 0
for j, new_text in enumerate(text_streamer):
    if j == 0:
        wrapped_text = textwrap.wrap(new_text, width = max_print_width)
        length = len(wrapped_text[-1])
        wrapped_text = "\n".join(wrapped_text)
        print(wrapped_text, end = "")
    else:
        length += len(new_text)
        if length >= max_print_width:
            length = 0
            print()
        print(new_text, end = "")
    pass
pass

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>Once upon a time, in a galaxy, far faraway, there was a little girl named Lily. She loved to 
play with her toys and explore the universe. One day, she found a big, shiny rock. She picked it up and 
put it in her pocket.

Lily went to play with her friends, but she forgot about the rock. When she 
came back home, she realized that she had lost the rock. She was very sad and started to cry.

Her mom 
saw her crying and asked her what was wrong. Lily told her about the rock and how she lost it. Her mom 
said, "Don't worry, we can find it again." They went back to the place where Lily found the rock and 
searched for it. After a while, they found the rock and Lily was very happy. She learned that it's 
important to take care of her things and not to lose them. The end.</s>

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
