### Text Completion / Raw Text Training
This is a community notebook collaboration with [Mithex].

We train on `Tiny Stories` (link [here](https://huggingface.co/datasets/roneneldan/TinyStories)) which is a collection of small stories. For example:
```
Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun.
Beep was a healthy car because he always had good fuel....
```
Instead of `Alpaca`'s Question Answer format, one only needs 1 column - the `"text"` column. This means you can finetune on any dataset and let your model act as a text completion model, like for novel writing.

---

To run this, press "Runtime" and press "Run all" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

In [None]:
%%capture
# Install unsloth library along with xformers version 0.0.28.post2
!pip install unsloth "xformers==0.0.28.post2"

# Uninstall any existing version of unsloth to avoid conflicts
# Then, reinstall the latest nightly version directly from the GitHub repository
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Note: Using %%capture suppresses output to keep the notebook output clean.
#       Remove %%capture if you want to view installation logs and messages.


* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen, Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Llama-3 15 trillion tokens **2x faster**! See our [Llama-3 notebook](https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing)

In [None]:
from unsloth import FastLanguageModel
import torch

# Set model configuration parameters
max_seq_length = 2048  # Define maximum sequence length, RoPE Scaling is auto-supported
dtype = None           # Set to None for auto-detection. Use Float16 for Tesla T4/V100, Bfloat16 for Ampere+
load_in_4bit = True    # Enable 4-bit quantization for reduced memory usage; can be set to False for full precision

# List of 4-bit pre-quantized models for faster downloading and reduced memory footprint
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # Mistral v3 model, optimized for 2x faster inference
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 model with 15 trillion tokens, optimized for 2x speed
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 model, optimized for 2x faster inference
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma model, optimized for 2.2x faster inference
]  # Additional models available at https://huggingface.co/unsloth

# Load the specified model and tokenizer with 4-bit quantization settings
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/mistral-7b-v0.3",  # Specify the model to load, replace with 16-bit model if needed
    max_seq_length=max_seq_length,         # Apply max sequence length as defined above
    dtype=dtype,                           # Data type for model loading
    load_in_4bit=load_in_4bit,             # Use 4-bit quantization for efficient memory usage
    # token="hf_...",  # Uncomment and add token if using gated models like Llama-2-7b-hf
)

# Confirm the model and tokenizer are loaded successfully
print("Model and tokenizer loaded with the following configuration:")
print(f"Model Name: {'unsloth/mistral-7b-v0.3'}")
print(f"Max Sequence Length: {max_seq_length}")
print(f"Data Type: {dtype if dtype else 'Auto-detected'}")
print(f"4-bit Quantization Enabled: {load_in_4bit}")


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.7: Fast Mistral patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers, TRL and unsloth via:
`pip install --upgrade --no-cache-dir --no-deps unsloth transformers git+https://github.com/huggingface/trl.git`


Model and tokenizer loaded with the following configuration:
Model Name: unsloth/mistral-7b-v0.3
Max Sequence Length: 2048
Data Type: Auto-detected
4-bit Quantization Enabled: True


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

We also add `embed_tokens` and `lm_head` to allow the model to learn out of distribution data.

In [None]:
# Print all module names in MistralModel to confirm their availability
#for name, module in model.named_modules():
 #   print(name)


In [None]:
# Configure the model with Parameter-Efficient Fine-Tuning (PEFT) settings using confirmed modules
model = FastLanguageModel.get_peft_model(
    model=model,  # The base model to apply PEFT on
    r=128,  # LoRA rank parameter for adaptation capacity, recommended values: 8, 16, 32, 64, 128
    target_modules=[
        "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "lm_head"
    ],  # Only include modules confirmed from your model structure

    lora_alpha=32,  # Scaling factor for LoRA adaptation, increasing impact of LoRA-modified weights
    lora_dropout=0,  # LoRA dropout, set to 0 for optimized performance without dropout

    bias="none",  # Bias handling mode, "none" is optimized for resource efficiency

    # Memory optimization options:
    # - "unsloth" mode enables 30% less VRAM usage and allows 2x larger batch sizes
    use_gradient_checkpointing="unsloth",  # Use "unsloth" gradient checkpointing to reduce memory usage for long contexts

    random_state=3407,  # Seed for random processes to ensure reproducibility

    use_rslora=True,  # Enable Rank-Stabilized LoRA (RS-LoRA) for improved training stability

    loftq_config=None,  # Leave as None unless specific LoftQ quantization configuration is needed
)

# Print confirmation of settings
print("PEFT configuration successfully applied with the following modules:")
print("Target modules:", ["v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "lm_head"])


Unsloth: Offloading output_embeddings to disk to save VRAM


  offloaded_W = torch.load(filename, map_location = "cpu", mmap = True)
Not an error, but Unsloth cannot patch Attention layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2024.10.7 patched 32 layers with 0 QKV layers, 32 O layers and 32 MLP layers.


Unsloth: Training lm_head in mixed precision to save VRAM
PEFT configuration successfully applied with the following modules:
Target modules: ['v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj', 'lm_head']


<a name="Data"></a>
### Data Prep
We now use the Tiny Stories dataset from https://huggingface.co/datasets/roneneldan/TinyStories. We only sample the first 5000 rows to speed training up. We must add `EOS_TOKEN` or `tokenizer.eos_token` or else the model's generation will go on forever.

If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).

In [None]:
from datasets import load_dataset

# Load a subset of the "TinyStories" dataset, taking only the first 2500 training samples
dataset = load_dataset("roneneldan/TinyStories", split="train[:2500]")

# Define the End Of Sentence (EOS) token from the tokenizer
EOS_TOKEN = tokenizer.eos_token  # This token marks the end of each generated text

# Define a function to format prompts by appending the EOS token to each text example
def formatting_prompts_func(examples):
    # Append EOS_TOKEN to each text in the examples batch
    return {"text": [example + EOS_TOKEN for example in examples["text"]]}

# Apply the formatting function to the dataset in a batched manner
dataset = dataset.map(formatting_prompts_func, batched=True)

# Print a sample to confirm the EOS token is added correctly
print("Sample formatted data:", dataset[0]["text"])


README.md:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

(…)-00000-of-00004-2d5a1467fff1081b.parquet:   0%|          | 0.00/249M [00:00<?, ?B/s]

(…)-00001-of-00004-5852b56a2bd28fd9.parquet:   0%|          | 0.00/248M [00:00<?, ?B/s]

(…)-00002-of-00004-a26307300439e943.parquet:   0%|          | 0.00/246M [00:00<?, ?B/s]

(…)-00003-of-00004-d243063613e5a057.parquet:   0%|          | 0.00/248M [00:00<?, ?B/s]

(…)-00000-of-00001-869c898b519ad725.parquet:   0%|          | 0.00/9.99M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2119719 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/21990 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Sample formatted data: One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.</s>


Print out 5 stories from `Tiny Stories`

In [None]:
for row in dataset[:5]["text"]:
    print("=========================")
    print(row)

One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.</s>
Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun. Beep was a healthy car because he always had good fuel. Good fuel made Beep happy and strong.

One day, Beep was driving in the park when he saw a big tree. The tree had many leaves that were fallin

<a name="Train"></a>
### Continued Pretraining
Now let's use Unsloth's `UnslothTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 20 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

Also set `embedding_learning_rate` to be a learning rate at least 2x or 10x smaller than `learning_rate` to make continual pretraining work!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

# Initialize the UnslothTrainer with the model, tokenizer, and training dataset
trainer = UnslothTrainer(
    model=model,               # The model to fine-tune
    tokenizer=tokenizer,       # Tokenizer associated with the model
    train_dataset=dataset,     # Training dataset with formatted prompts
    dataset_text_field="text", # Field in the dataset containing text data
    max_seq_length=max_seq_length,  # Maximum sequence length
    dataset_num_proc=8,        # Number of processes for data loading

    # Define training arguments for UnslothTrainer
    args=UnslothTrainingArguments(
        per_device_train_batch_size=2,     # Batch size per device during training
        gradient_accumulation_steps=8,     # Accumulate gradients over multiple steps

        warmup_ratio=0.1,                  # Ratio of steps for warmup
        num_train_epochs=1,                # Number of training epochs

        learning_rate=5e-5,                # Base learning rate
        embedding_learning_rate=5e-6,      # Learning rate for embedding layer

        fp16=not is_bfloat16_supported(),  # Use FP16 if BF16 is not supported
        bf16=is_bfloat16_supported(),      # Use BF16 if supported by the device
        logging_steps=1,                   # Log metrics every step
        optim="adamw_8bit",                # Optimizer using 8-bit Adam for memory efficiency
        weight_decay=0.00,                 # Weight decay regularization
        lr_scheduler_type="cosine",        # Cosine learning rate scheduler
        seed=3407,                         # Seed for reproducibility
        output_dir="outputs",              # Directory to save model checkpoints and logs
        report_to="none",                  # Set to WandB or other if logging to external services
    ),
)

# Print confirmation of training setup
print("UnslothTrainer initialized with the following configuration:")
print("Batch size per device:", 2)
print("Gradient accumulation steps:", 8)
print("Learning rate:", 5e-5)
print("Embedding learning rate:", 5e-6)
print("Scheduler type:", "cosine")
print("Output directory:", "outputs")
print("FP16 enabled:", not is_bfloat16_supported())
print("BF16 enabled:", is_bfloat16_supported())
print("Logging steps:", 1)


Map (num_proc=8):   0%|          | 0/2500 [00:00<?, ? examples/s]

UnslothTrainer initialized with the following configuration:
Batch size per device: 2
Gradient accumulation steps: 8
Learning rate: 5e-05
Embedding learning rate: 5e-06
Scheduler type: cosine
Output directory: outputs
FP16 enabled: True
BF16 enabled: False
Logging steps: 1


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.922 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 2,500 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 156
 "-____-"     Number of trainable parameters = 415,236,096


**** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers, TRL and Unsloth!
`pip install --upgrade --no-cache-dir --no-deps unsloth transformers git+https://github.com/huggingface/trl.git`
Unsloth: Setting lr = 5.00e-06 instead of 5.00e-05 for lm_head.


Step,Training Loss
1,1.437
2,1.4719
3,1.4675
4,1.3156
5,1.4162
6,1.373
7,1.4207
8,1.1691
9,1.1698
10,1.2815


In [None]:
if torch.cuda.is_available():
    # Calculate initial GPU memory and maximum GPU memory
    max_memory = torch.cuda.get_device_properties(0).total_memory / 1024 / 1024 / 1024  # Max memory in GB
    start_gpu_memory = torch.cuda.memory_reserved() / 1024 / 1024 / 1024  # Initial GPU memory in GB

    # Calculate memory usage statistics
    used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)  # Peak reserved memory in GB
    used_memory_for_lora = round(used_memory - start_gpu_memory, 3)  # Memory used specifically for LoRA training in GB
    used_percentage = round(used_memory / max_memory * 100, 3)  # Percentage of peak memory usage relative to max memory
    lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)  # Percentage of LoRA-specific memory usage

    # Display time and memory statistics
    print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
    print(f"{round(trainer_stats.metrics['train_runtime'] / 60, 2)} minutes used for training.")
    print(f"Peak reserved memory = {used_memory} GB.")
    print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
    print(f"Peak reserved memory % of max memory = {used_percentage} %.")
    print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
else:
    print("CUDA device not available. Memory and time stats are not displayed.")


2711.3101 seconds used for training.
45.19 minutes used for training.
Peak reserved memory = 10.119 GB.
Peak reserved memory for training = -0.0 GB.
Peak reserved memory % of max memory = 68.612 %.
Peak reserved memory for training % of max memory = -0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model!

We first will try to see if the model follows the style and understands to write a story that is within the distribution of "Tiny Stories". Ie a story fit for a bed time story most likely.

We select "Once upon a time, in a galaxy, far far away," since it normally is associated with Star Wars.

In [None]:
from transformers import TextIteratorStreamer
import textwrap
from unsloth import FastLanguageModel

# Enable inference mode
FastLanguageModel.for_inference(model)  # This enables optimized settings for inference

# Initialize the text streamer with the tokenizer
text_streamer = TextIteratorStreamer(tokenizer)

# Define maximum print width for text wrapping
max_print_width = 100

# Prepare the input text and tokenize it for the model
input_text = "Once upon a time, in a galaxy, far far away,"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")  # Move to GPU

# Extract input_ids for generation to avoid accessing any other tensor properties
input_ids = inputs["input_ids"]

# Verify the shape of the input_ids tensor before generation
print(f"Input tensor shape: {input_ids.shape}")

# Set generation arguments with input_ids only
generation_kwargs = dict(
    input_ids=input_ids,
    streamer=text_streamer,
    max_new_tokens=256,
    use_cache=True,
)

# Generate text
output = model.generate(**generation_kwargs)

# Stream and print generated text in real-time with word wrapping
length = 0  # Track the length of printed text on the current line
for j, new_text in enumerate(text_streamer):
    if j == 0:
        # Wrap and print the first chunk of text with a defined width
        wrapped_text = textwrap.wrap(new_text, width=max_print_width)
        length = len(wrapped_text[-1])  # Update the length with the last line's length
        wrapped_text = "\n".join(wrapped_text)  # Join wrapped lines with newline
        print(wrapped_text, end="")  # Print the wrapped text
    else:
        length += len(new_text)
        if length >= max_print_width:
            length = 0  # Reset line length if it exceeds max_print_width
            print()  # Print newline
        print(new_text, end="")  # Print the new text as it streams


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Input tensor shape: torch.Size([1, 14])
<s> Once upon a time, in a galaxy, far faraway, there was a little girl named Lily. She loved to 
play with her toys and explore the universe. One day, she found a shiny rock and decided to take it 
with her.

As she was walking, she saw a big, scary monster. The monster said, "Give me your rock, or I 
will hurt you!" Lily was scared, but she didn't want to give up her rock. She said, "No, this is my 
rock, and I won't give it to you!"

The monster got angry and tried to grab the rock from Lily. But Lily 
was brave and held on tight. The monster gave up and walked away. Lily was happy that she didn't lose 
her rock and learned that it's important to stand up for what you believe in.</s>

Example 2

In [None]:
from transformers import TextIteratorStreamer
import textwrap
from unsloth import FastLanguageModel

# Enable inference mode for Unsloth models
FastLanguageModel.for_inference(model)

# Initialize the text streamer with the tokenizer
text_streamer = TextIteratorStreamer(tokenizer)

# Define maximum print width for text wrapping
max_print_width = 100

# New prompt for the model
input_text = "In the heart of the enchanted forest, a mysterious light began to glow,"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")  # Move to GPU

# Extract input_ids for generation to avoid accessing any other tensor properties
input_ids = inputs["input_ids"]

# Verify the shape of the input_ids tensor before generation
print(f"Input tensor shape: {input_ids.shape}")

# Set generation arguments with input_ids only
generation_kwargs = dict(
    input_ids=input_ids,
    streamer=text_streamer,
    max_new_tokens=256,
    use_cache=True,
)

# Generate text
output = model.generate(**generation_kwargs)

# Stream and print generated text in real-time with word wrapping
length = 0  # Track the length of printed text on the current line
for j, new_text in enumerate(text_streamer):
    if j == 0:
        # Wrap and print the first chunk of text with a defined width
        wrapped_text = textwrap.wrap(new_text, width=max_print_width)
        length = len(wrapped_text[-1])  # Update the length with the last line's length
        wrapped_text = "\n".join(wrapped_text)  # Join wrapped lines with newline
        print(wrapped_text, end="")  # Print the wrapped text
    else:
        length += len(new_text)
        if length >= max_print_width:
            length = 0  # Reset line length if it exceeds max_print_width
            print()  # Print newline
        print(new_text, end="")  # Print the new text as it streams


Input tensor shape: torch.Size([1, 18])
<s> In the heart of the enchanted forest, a mysterious light began toglow, illuminating the trees 
and the ground. A little girl named Lily was walking through the forest when she saw the light. She 
stopped to take a closer look.

"What is that?" she asked.

A small voice answered, "It's a magical light. 
It can make anything you wish for come true."

Lily was amazed. She wished for a new toy, but the 
light said, "No, you must wish for something more important."

Lily thought for a moment and then said, 
"I wish for peace in the world."

The light glowed brighter and brighter until it filled the forest 
with a warm, peaceful glow. Lily smiled and thanked the light. She knew that her wish had come true.</s>

#### Save this model into google drive


In [None]:
# Define the path to save the model
save_path = "/content/drive/My Drive/My_Model"

# Save the model and tokenizer to the specified path
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model and tokenizer saved to {save_path}")


Model and tokenizer saved to /content/drive/My Drive/My_Model


#### Save it as a zip file into google drive


In [None]:
from google.colab import drive
import shutil
import os

# Step 1: Mount Google Drive
drive.mount('/content/drive')

# Step 2: Define the path to save the model
model_folder = "/content/My_Model"
model_zip_path = "/content/drive/My Drive/My_Model.zip"  # Path where zip will be saved in Google Drive

# Save the model and tokenizer to the specified folder
os.makedirs(model_folder, exist_ok=True)
model.save_pretrained(model_folder)
tokenizer.save_pretrained(model_folder)

# Step 3: Zip the folder and move to Google Drive
shutil.make_archive("/content/My_Model", 'zip', model_folder)
shutil.move("/content/My_Model.zip", model_zip_path)

print(f"Model and tokenizer saved as a zip file to {model_zip_path}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Model and tokenizer saved as a zip file to /content/drive/My Drive/My_Model.zip


###  Gradio-based interactive chatbot interface.

##### Load the model

In [None]:
# Step 1: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Step 2: Define the model path
model_path = "/content/drive/My Drive/My_Model"  # Update with your actual path

# Step 3: Load the model and tokenizer
from unsloth import FastLanguageModel
from transformers import AutoTokenizer

# Load the model and tokenizer from the specified path
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_path,
    max_seq_length=2048,  # Adjust as needed
    dtype=None,           # Use float16 or bfloat16 if supported
    load_in_4bit=True      # Adjust based on your saved model settings
)

# Enable inference mode
FastLanguageModel.for_inference(model)

print("Model and tokenizer loaded successfully.")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
==((====))==  Unsloth 2024.10.7: Fast Mistral patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: Will load /content/drive/My Drive/My_Model as a legacy tokenizer.
Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers, TRL and unsloth via:
`pip install --upgrade --no-cache-dir --no-deps unsloth transformers git+https://github.com/huggingface/trl.git`
Not an error, but Unsloth cannot patch Attention layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2024.10.7 patched 32 layers with 0 QKV layers, 32 O layers and 32 MLP layers.


Model and tokenizer loaded successfully.


In [None]:
from transformers import TextStreamer

def chatbot_response(user_input, max_new_tokens=128):
    try:
        # Tokenize the input
        inputs = tokenizer(user_input, return_tensors="pt").to("cuda")
        input_ids = inputs["input_ids"]

        # Set up TextStreamer for real-time output streaming
        text_streamer = TextStreamer(tokenizer)

        # Generate output with the streamer
        model.generate(
            input_ids=input_ids,
            max_new_tokens=max_new_tokens,
            streamer=text_streamer,
            use_cache=True
        )

        # Since TextStreamer automatically prints the response as it streams,
        # you may want to capture the full response for Gradio to return
        # (TextStreamer currently does not have built-in return support)
        return "Response is being streamed in real-time."

    except Exception as e:
        # Print the error with traceback for debugging
        print("Error during response generation:", e)
        return f"An error occurred: {str(e)}"


load from a directory

In [None]:
import torch

# Check CUDA availability
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("CUDA device:", torch.cuda.get_device_name(0))

# Test loading model
try:
    model_path = "/content/drive/My Drive/My_Model"  # Adjust to your actual path
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_path,
        max_seq_length=2048,
        dtype=None,
        load_in_4bit=True
    )
    FastLanguageModel.for_inference(model)
    print("Model and tokenizer loaded successfully.")
except Exception as e:
    print("Error loading model:", e)


CUDA available: True
CUDA device: Tesla T4
==((====))==  Unsloth 2024.10.7: Fast Mistral patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: Will load /content/drive/My Drive/My_Model as a legacy tokenizer.


Model and tokenizer loaded successfully.


In [None]:
!pip install gradio




In [None]:
import gradio as gr
from transformers import TextStreamer
from unsloth import FastLanguageModel
import torch

# Enable inference mode for Unsloth models
FastLanguageModel.for_inference(model)

# Define the chatbot response function
def chatbot_response(user_input, max_new_tokens=128):
    # Tokenize the input
    inputs = tokenizer(user_input, return_tensors="pt").to("cuda")
    input_ids = inputs["input_ids"]

    # Set up TextStreamer for real-time output streaming
    text_streamer = TextStreamer(tokenizer)

    # Generate output
    model.generate(
        input_ids=input_ids,
        max_new_tokens=max_new_tokens,
        streamer=text_streamer,
        use_cache=True
    )

    # Capture the generated response
    response = ""
    for new_text in text_streamer:
        response += new_text

    return response

# Set up Gradio interface with a textbox for user input and output
interface = gr.Interface(
    fn=chatbot_response,
    inputs=[
        gr.Textbox(label="Enter your prompt here", placeholder="Type a story or question..."),
        gr.Slider(32, 256, value=128, step=1, label="Max New Tokens"),
    ],
    outputs="text",
    title="Interactive AI Chatbot",
    description="Type a prompt to interact with the AI model and watch the responses in real-time."
)

interface.launch(debug=True)

# Launch the Gradio interface
interface.launch()


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://5932878cf1b7a07d8f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


<s>hai, I am a little girl. I like to play with my toys. I have a big box of toys. I have a doll, a car, a ball, a bear, and a book. I like to play with my toys in my room.

One day, I want to play with my toys in the living room. I ask my mom, "Can I play with my toys in the living room?" My mom says, "Yes, you can play with your toys in the living room, but you have to be careful. Do not make a mess."

I go to the living room with my toys


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/queueing.py", line 624, in process_events
    response = await route_utils.call_process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/route_utils.py", line 323, in call_process_api
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 2018, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1567, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 8

<s>hai, I am a little girl. I like to play with my toys. I have a big box of toys. I have a doll, a car, a ball, a bear, and a book. I like to play with my toys in my room.

One day, I want to play with my toys in the living room. I ask my mom, "Can I play with my toys in the living room?" My mom says, "Yes, you can play with your toys in the living room, but you have to be careful. Do not make a mess."

I go to the living room with my toys


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/queueing.py", line 624, in process_events
    response = await route_utils.call_process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/route_utils.py", line 323, in call_process_api
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 2018, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1567, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 8

<s>exit()

# 1. 100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/queueing.py", line 624, in process_events
    response = await route_utils.call_process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/route_utils.py", line 323, in call_process_api
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 2018, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1567, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 8

Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://250365ded50450f1f1.gradio.live
Killing tunnel 127.0.0.1:7861 <> https://2e5ab25a3dba0d4206.gradio.live
Killing tunnel 127.0.0.1:7862 <> https://5932878cf1b7a07d8f.gradio.live
Rerunning server... use `close()` to stop if you need to change `launch()` parameters.
----
Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://5932878cf1b7a07d8f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
