## 🚀 Day 11/15 — Fine-Tuning with Unsloth AI

# **Project: Fine-Tuning Orpheus Urdu TTS to adapt the voice of legendary orator Zia Mohiuddin**

## Project Goal

This notebook demonstrates how to fine-tune `orpheus-urdu-tts` to replicate the iconic voice and oratorical style of the legendary Pakistani orator, Zia Mohiuddin.

## Dataset

We'll be using` muhammadsaadgondal/urdu-tts` dataset which contains a collection of Zia Mohiuddin's speech and corresponding text, originally compiled for ASR tasks.

---

### 👋🏻 About Me

Hi, I'm **Aasher Kamal** — a Generative & Agentic AI developer passionate about building intelligent systems with LLMs.

I have started a **15-day challenge** to master fine-tuning using the open-source **Unsloth AI** framework. This journey will cover everything from LoRA and QLoRA to reinforcement learning, vision, and TTS fine-tuning — all hands-on, all open-source.

I'll be documenting my learnings, experiments, and challenges daily.

---

### 🌐 Connect with Me

- [LinkedIn](https://www.linkedin.com/in/aasher-kamal/)
- [GitHub](https://github.com/aasherkamal216)
- [X (Twitter)](https://x.com/Aasher_Kamal)
- [Facebook](https://www.facebook.com/aasher.kamal)
- [Website](https://aasherkamal.framer.website/)

Let’s build and learn together! 💡

---

### 📖 Acknowledgements

- The dataset used for fine-tuning this model is based on the work of [muhammadsaadgondal](https://huggingface.co/muhammadsaadgondal)

### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install snac

In [None]:
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "mahwizzzz/orpheus-urdu-tts",
    max_seq_length= 2048,
    dtype = None,
    load_in_4bit = False
)

### LoRA model

In [11]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 64,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2025.8.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


### Loading and preprocessing the dataset


In [None]:
from datasets import load_dataset
from google.colab import userdata

hf_token = userdata.get('HF_TOKEN')

dataset = load_dataset("muhammadsaadgondal/urdu-tts", split = "train", token=hf_token)

In [None]:
import locale
import torchaudio.transforms as T
import os
import torch
from snac import SNAC

locale.getpreferredencoding = lambda: "UTF-8"

# Check if the audio column exists and get the sampling rate
if "audio" not in dataset.features or "array" not in dataset[0]["audio"]:
    raise ValueError("The 'audio' column with waveform arrays is missing from the dataset.")
ds_sample_rate = dataset[0]["audio"]["sampling_rate"]

print(f"\nOriginal dataset sampling rate: {ds_sample_rate} Hz")
print("Will be resampled to 24000 Hz for the model.")

# Initialize the SNAC audio tokenizer model
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cuda")

def tokenise_audio(waveform):
  """Converts a raw audio waveform into a sequence of SNAC audio tokens."""
  waveform = torch.from_numpy(waveform).unsqueeze(0)
  waveform = waveform.to(dtype=torch.float32)
  resample_transform = T.Resample(orig_freq=ds_sample_rate, new_freq=24000)
  waveform = resample_transform(waveform)

  waveform = waveform.unsqueeze(0).to("cuda")

  # Generate the codes from SNAC
  with torch.inference_mode():
    codes = snac_model.encode(waveform)

  # Interleave the codes from the 3 codebooks into a single sequence
  all_codes = []
  for i in range(codes[0].shape[1]):
    all_codes.append(codes[0][0][i].item()+128266)
    all_codes.append(codes[1][0][2*i].item()+128266+4096)
    all_codes.append(codes[2][0][4*i].item()+128266+(2*4096))
    all_codes.append(codes[2][0][(4*i)+1].item()+128266+(3*4096))
    all_codes.append(codes[1][0][(2*i)+1].item()+128266+(4*4096))
    all_codes.append(codes[2][0][(4*i)+2].item()+128266+(5*4096))
    all_codes.append(codes[2][0][(4*i)+3].item()+128266+(6*4096))

  return all_codes

def add_codes(example):
    """Applies the audio tokenization to each example in the dataset."""
    codes_list = None
    try:
        answer_audio = example.get("audio")
        if answer_audio and "array" in answer_audio:
            audio_array = answer_audio["array"]
            codes_list = tokenise_audio(audio_array)
    except Exception as e:
        print(f"Skipping row due to error during audio tokenization: {e}")
    example["codes_list"] = codes_list
    return example

print("\nTokenizing audio waveforms...")
dataset = dataset.map(add_codes, remove_columns=["audio"])

# Define special tokens for formatting the final input sequence
tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009
start_of_speech = tokeniser_length + 1
end_of_speech = tokeniser_length + 2
start_of_human = tokeniser_length + 3
end_of_human = tokeniser_length + 4
start_of_ai = tokeniser_length + 5
end_of_ai =  tokeniser_length + 6
pad_token = tokeniser_length + 7
audio_tokens_start = tokeniser_length + 10

# Filter out any examples that failed during tokenization
dataset = dataset.filter(lambda x: x["codes_list"] is not None)
dataset = dataset.filter(lambda x: len(x["codes_list"]) > 0)
print(f"Number of samples after filtering bad audio: {len(dataset)}")

def remove_duplicate_frames(example):
    """Removes consecutive audio frames that are identical to reduce redundancy."""
    vals = example["codes_list"]
    if len(vals) % 7 != 0:
        # Pad with the last frame if length is not divisible by 7
        remainder = len(vals) % 7
        padding = vals[-remainder:]
        vals.extend(padding * ((7 - remainder) // len(padding)) ) # A bit complex but it works
        vals.extend(vals[-7:] * ( (7-len(vals)%7)//7) )
        if len(vals) % 7 != 0: vals.extend(vals[-7:])

    result = vals[:7]
    for i in range(7, len(vals), 7):
        if vals[i] != result[-7]: # Check only the first token of the 7-token frame
            result.extend(vals[i:i+7])
    example["codes_list"] = result
    return example

print("Removing duplicate audio frames...")
dataset = dataset.map(remove_duplicate_frames)

def create_input_ids(example):
    """Formats the text and audio tokens into the final conversational sequence."""
    # Your dataset is single-speaker, so we only need the 'text' field.
    text_prompt = example["text"]

    text_ids = tokenizer.encode(text_prompt, add_special_tokens=True)
    text_ids.append(end_of_text)

    example["text_tokens"] = text_ids

    input_ids = (
        [start_of_human]
        + example["text_tokens"]
        + [end_of_human]
        + [start_of_ai]
        + [start_of_speech]
        + example["codes_list"]
        + [end_of_speech]
        + [end_of_ai]
    )
    example["input_ids"] = input_ids
    example["labels"] = input_ids
    example["attention_mask"] = [1] * len(input_ids)
    return example

print("Creating final input sequences...")
dataset = dataset.map(create_input_ids, remove_columns=["text", "codes_list", "filename"])

# The final cleanup step
columns_to_keep = ["input_ids", "labels", "attention_mask"]
columns_to_remove = [col for col in dataset.column_names if col not in columns_to_keep]
dataset = dataset.remove_columns(columns_to_remove)

print("\nPreprocessing complete! The dataset is ready for training.")
print("Final dataset features:", dataset.features)
print("First example's input_ids length:", len(dataset[0]['input_ids']))

### Run Inference Before Fine-Tuning to check the model's performance

In [30]:
import torch
from IPython.display import display, Audio
from typing import List, Tuple, Optional

def generate_audio_from_prompts(
    prompts: List[str],
    model,
    tokenizer,
    snac_model,
) -> List[Tuple[str, Optional[Audio]]]:
    """
    Generates audio from a list of text prompts using the fine-tuned Orpheus model.
    """
    # 1. Set up models for inference
    FastLanguageModel.for_inference(model)
    snac_model.to("cpu")

    # 2. Define special tokens and manually construct the correct input for each prompt
    start_of_human = 128259
    end_of_human = 128260
    start_of_ai = 128261
    start_of_speech = 128257
    end_of_text = 128009
    pad_token = 128263

    sequences_to_pad = []
    for prompt in prompts:
        text_ids = tokenizer.encode(prompt, add_special_tokens=True)
        # This is the required format to trigger the model's speech generation
        full_sequence = [start_of_human] + text_ids + [end_of_text, end_of_human, start_of_ai, start_of_speech]
        sequences_to_pad.append({"input_ids": torch.tensor(full_sequence)})

    # 3. Use the tokenizer to correctly pad the batch
    tokenizer.padding_side = "left"
    tokenizer.pad_token_id = pad_token
    inputs = tokenizer.pad(
        sequences_to_pad,
        padding=True,
        return_tensors="pt",
    ).to("cuda")

    # 4. Generate the audio tokens from the model
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=1200,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        repetition_penalty=1.1,
        num_return_sequences=1,
        eos_token_id=128258,  # End of Speech token
        use_cache=True,
    )

    # 5. Decode the generated tokens back to audio waveforms
    end_of_speech_token = 128258
    audio_token_start_id = 128266

    processed_rows = []
    for i, generated_sequence in enumerate(generated_ids):
        input_len = (inputs["input_ids"][i] != pad_token).sum().item()
        generated_part = generated_sequence[input_len:]

        # Filter to keep only valid audio tokens, preventing errors
        valid_audio_tokens = generated_part[generated_part >= audio_token_start_id]
        valid_audio_tokens = valid_audio_tokens[valid_audio_tokens != end_of_speech_token]
        processed_rows.append(valid_audio_tokens)

    code_lists = []
    for row in processed_rows:
        row_length = row.size(0)
        if row_length == 0:
            code_lists.append([])
            continue

        new_length = (row_length // 7) * 7
        trimmed_row = row[:new_length]
        trimmed_row = [t.item() - audio_token_start_id for t in trimmed_row]
        code_lists.append(trimmed_row)

    def redistribute_codes(code_list):
      if not code_list:
        return torch.tensor([])

      layer_1, layer_2, layer_3 = [], [], []
      for i in range(len(code_list) // 7):
        frame = code_list[i*7 : (i+1)*7]
        layer_1.append(frame[0])
        layer_2.append(frame[1] - 4096)
        layer_3.append(frame[2] - (2*4096))
        layer_3.append(frame[3] - (3*4096))
        layer_2.append(frame[4] - (4*4096))
        layer_3.append(frame[5] - (5*4096))
        layer_3.append(frame[6] - (6*4096))

      codes = [torch.tensor(layer_1).unsqueeze(0),
               torch.tensor(layer_2).unsqueeze(0),
               torch.tensor(layer_3).unsqueeze(0)]

      audio_hat = snac_model.decode(codes)
      return audio_hat

    audio_samples = [redistribute_codes(code_list) for code_list in code_lists]

    # 6. Package the results
    final_results = []
    for i, samples in enumerate(audio_samples):
        if samples.numel() > 0:
            audio_widget = Audio(samples.detach().squeeze().cpu().numpy(), rate=24000)
            final_results.append((prompts[i], audio_widget))
        else:
            final_results.append((prompts[i], None))

    # Clean up GPU memory
    del generated_ids, inputs, audio_samples, sequences_to_pad
    torch.cuda.empty_cache()

    return final_results

In [9]:
new_prompts = [
    "یہی وہ آواز ہے جس نے اردو بولنے والوں کی نسلوں کو مسحور کیا ہے۔",
]

# Call the function to get the generated audio
generated_audio_results = generate_audio_from_prompts(
    prompts=new_prompts,
    model=model,
    tokenizer=tokenizer,
    snac_model=snac_model,
)

for prompt, audio_widget in generated_audio_results:
    print(f"\nPrompt: {prompt}")
    if audio_widget:
        display(audio_widget)
    else:
        print("-> Audio generation failed for this prompt.")

# Clean up to save RAM
del generated_audio_results


Prompt: یہی وہ آواز ہے جس نے اردو بولنے والوں کی نسلوں کو مسحور کیا ہے۔


<a name="Train"></a>
### Train the model

In [None]:
from transformers import TrainingArguments,Trainer,DataCollatorForSeq2Seq
trainer = Trainer(
    model = model,
    train_dataset = dataset,
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 2,
        learning_rate = 1e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        save_steps = 30,
        output_dir = "outputs",
        report_to = "none", 
    ),
)

In [13]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA L4. Max memory = 22.161 GB.
6.641 GB of memory reserved.


In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 353 | Num Epochs = 2 | Total steps = 178
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 97,255,424 of 3,398,122,496 (2.86% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,5.1807
2,5.1266
3,5.0477
4,5.0095
5,5.1038
6,5.1375
7,5.0829
8,5.0481
9,4.8712
10,4.9551


In [15]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

272.9458 seconds used for training.
4.55 minutes used for training.
Peak reserved memory = 7.605 GB.
Peak reserved memory for training = 0.964 GB.
Peak reserved memory % of max memory = 34.317 %.
Peak reserved memory for training % of max memory = 4.35 %.


<a name="Inference"></a>
### Inference after fine-tuning



In [36]:
prompts = [
"میں بہت خوش ہوں۔",
"آج مجھے یہ بتاتے ہوئے بہت خوشی ہو رہی ہے کہ میں نے یہ کام مکمل کر لیا ہے۔"
]

generated_audio_results = generate_audio_from_prompts(
    prompts=prompts,
    model=model,
    tokenizer=tokenizer,
    snac_model=snac_model,
)

for prompt, audio_widget in generated_audio_results:
    print(f"\nPrompt: {prompt}")
    if audio_widget:
        display(audio_widget)
    else:
        print("-> Audio generation failed for this prompt.")

# Clean up to save RAM
del generated_audio_results


Prompt: میں بہت خوش ہوں۔



Prompt: آج مجھے یہ بتاتے ہوئے بہت خوشی ہو رہی ہے کہ میں نے یہ کام مکمل کر لیا ہے۔


### Upload the merged model on Hugging Face

In [None]:
model.push_to_hub_merged("Aasher/Orpheus_Urdu_TTS_FineTuned", tokenizer, save_method = "merged_16bit", token = hf_token)