<a href="https://colab.research.google.com/github/Vinooj/llm-fine_tuning-experiments/blob/main/ascii_art_completion_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Completion finetuning using unsloth

This notebook makes use of unsloth to finetune a model for a completion task.
In this example we will finetune the llama 3.2 base model to generate ascii art. I would recommend using the unsloth library compared to just using the huggingface library as it requires less memory and is faster.

Adapted from unsloth notebooks, if something is broken check on:
https://unsloth.ai/

In [None]:
%%capture
# Automatically select the appropriate PyTorch index at runtime by inspecting the installed CUDA driver version via --torch-backend=auto
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install vllm torch torchvision torchaudio --torch-backend=auto

# Install core packages without dependencies (to avoid version conflicts)
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl

# Install specific triton version without dependencies
!pip install triton==2.1.0 --no-deps

# Install unsloth-related packages
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install --no-deps unsloth

# Install remaining packages with dependencies (these are generally stable)
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

In [None]:
import torch
import triton
import unsloth
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Triton: {triton.__version__}")
print("All packages installed successfully!")

### Load base model

In [None]:
# Import FastLanguageModel instead of AutoModelForCausalLM.from_pretrained form
# Huggingface which leverages Optimized Kernels, Efficient Memory Management,
# Smart Data Type Handling ( Precision) to improve training speed
from unsloth import FastLanguageModel
import torch
from google.colab import userdata

# ets the maximum number of tokens that this specific instance of the model and
# its tokenizer will be configured to handle during our finetuning and subsequent
# inference.
max_seq_length = 2048


# we are telling Unsloth to automatically determine the most suitable data type
#(precision) for the model based on the available hardware (like your GPU).
# Unsloth is designed to leverage faster and more memory-efficient data types,
#such as bfloat16 or float16, if your hardware supports them.
dtype = None


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3.2-3B",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit = False,
    token=userdata.get('HF_TOKEN')
)

In [None]:
tokenizer.clean_up_tokenization_spaces = False

### Add lora to base model and patch with Unsloth

In [None]:
# More info about parameters: https://huggingface.co/docs/peft/v0.11.0/en/package_reference/lora#peft.LoraConfig
target_modules =  ["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"]

# When adding special tokens
train_embeddings = False

if train_embeddings:
  target_modules = target_modules + ["lm_head"]

# PEFT stands for "Parameter-Efficient Finetuning," and Unsloth integrates with PEFT methods like LoRA, QLoRA
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,              # A rank of 16 is a common value that balances expressiveness with
                         # parameter efficiency. A higher rank means more parameters in the
                         # LoRA adapters, allowing for more complex changes but also increasing the risk of overfitting slightly
    target_modules = target_modules,  # On which modules of the llm the lora weights are used
    lora_alpha = 16,     # scales the weights of the adapters (more influence on base model), 16 was recommended on reddit
                         # Having a value same as r, lora_alpha/r = 1 is the normal.
    lora_dropout = 0,    # Default on 0.05 in tutorial but unsloth says 0 is better, This is a regularization technique
    bias = "none",       # "none" is optimized. Contributes to VRAM (GPU memory) and improving training efficiency.
    use_gradient_checkpointing = "unsloth", #"unsloth" for very long context, decreases vram. Contributes to VRAM (GPU memory) and improving training efficiency.
    random_state = 3407,
    use_rslora = False,  # scales lora_alpha with 1/sqrt(r), huggingface says this works better.
                         # Now, let's look at use_rslora = False. This parameter controls whether
                         # "Rank-Stabilized LoRA" is used. Rank-Stabilized LoRA is a variation
                         # where the LoRA adapter's output is scaled by lora_alpha / sqrt(r) instead of lora_alpha / r
    loftq_config = None, # And LoftQ
)

In [None]:
empty_prompt = """
{ascii_art}
"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func_no_prompt(examples):
  ascii_art_samples = examples["ascii"]
  training_prompts = []
  for ascii_art in ascii_art_samples:
      training_prompt = empty_prompt.format(ascii_art=ascii_art) + EOS_TOKEN
      training_prompts.append(training_prompt)
  return { "text" : training_prompts, }


from datasets import load_dataset
dataset = load_dataset("pookie3000/ascii-cats", split = "train")
dataset = dataset.map(formatting_prompts_func_no_prompt, batched = True)

 ### Visualize dataset

In [None]:
for i, sample in enumerate(dataset):
    print(f"\n------ Sample {i + 1} ----")
    print(sample["text"])
    if i > 2:
      break

In [None]:
from trl import SFTTrainer, SFTConfig
# from transformers import TrainingArguments
# from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 1, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 2, # Set this for 1 full training run.
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

In [None]:
trainer_stats = trainer.train()

### inference

In [None]:
from transformers import TextStreamer

def generate_ascii_art(model):
    FastLanguageModel.for_inference(model)
    inputs = tokenizer("", return_tensors = "pt").to("cuda")
    text_streamer = TextStreamer(tokenizer)
    # https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/text_generation#transformers.GenerationMixin
    # https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/text_generation#transformers.GenerationConfig
    for token in model.generate(**inputs, streamer = text_streamer, max_new_tokens = 100):
        print(token)
        pass

In [None]:
for _ in range(3):
  generate_ascii_art(model)

## Saving

### Save lora adapter

This is both useful for inference and if you want to load the model again

In [None]:
model.push_to_hub(
    "vinooj/Llama-3.2-3B-ascii-cats-lora",
    tokenizer,
    token = userdata.get('HF_WRITE_TOKEN')
)

### Merge model with lora weights and save to gguf

You can then do inference locally with Ollama or llama.cpp

##### Popular quantization methods

- **q4_k_m**  
  4bit quantization. Low memory. All models you pull with ollama uses this quantization.
- **q8_0**  
  8bit quantization. Medium memory.
- **f16**  
  16 bit quantization. A lot of models are already in 16 bit so then no quantization happens
- **not_quantized**  
  Often same as f16.

In [None]:
model.push_to_hub_gguf(
    "vinooj/Llama-3.2-3B-ascii-cats-lora-q4_k_m-GGUF",
    tokenizer,
    quantization_method="q4_k_m",
    token = userdata.get('HF_WRITE_TOKEN')
)

In [None]:
from unsloth import FastLanguageModel
import torch
from google.colab import userdata

### Load model and saved lora adapters
For if you want to continue finetuning or want to do inference using the model in safetensor format.

In [None]:

from transformers import TextStreamer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="vinooj/Llama-3.2-3B-ascii-cats-lora",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = False,
    token=userdata.get('HF_TOKEN')
)


def generate_ascii_art(model):
    FastLanguageModel.for_inference(model)
    inputs = tokenizer("", return_tensors = "pt").to("cuda")
    text_streamer = TextStreamer(tokenizer)
    # https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/text_generation#transformers.GenerationMixin
    # https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/text_generation#transformers.GenerationConfig
    for token in model.generate(**inputs, streamer = text_streamer, max_new_tokens = 100):
        print(token)
        pass


In [None]:
generate_ascii_art(model)