<a href="https://colab.research.google.com/github/cgshft/unsloth-tts-finetuning/blob/main/Orpheus_(3B)_TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**NEW** Unsloth now supports training the new **gpt-oss** model from OpenAI! You can start finetune gpt-oss for free with our **[Colab notebook](https://x.com/UnslothAI/status/1953896997867729075)**!

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Gemma 3N Guide](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [6]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install snac

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

Thank you to [Etherl](https://huggingface.co/Etherll) for creating this notebook!

In [28]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",
    # Qwen3 new models
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    # Other very popular models!
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/Llama-3.3-70B",
    "unsloth/mistral-7b-instruct-v0.3",
    "unsloth/Phi-4",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/orpheus-3b-0.1-ft",
    max_seq_length= 2048, # Choose any for long context!
    dtype = None, # Select None for auto detection
    load_in_4bit = False, # Select True for 4bit which reduces memory usage
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.8.8: Fast Llama patching. Transformers: 4.55.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [27]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

NameError: name 'model' is not defined

Mount Google Drive

In [8]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


Extract .wavs

In [9]:
# !mkdir /content/wavs
# !tar -xvzf /content/drive/MyDrive/Coding/TTS/wavs.tar.gz -C /content

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
wavs/common_voice_eo_18092949.wav
wavs/common_voice_eo_17985744.wav
wavs/common_voice_eo_20731551.wav
wavs/common_voice_eo_20677382.wav
wavs/common_voice_eo_17908199.wav
wavs/common_voice_eo_19844002.wav
wavs/common_voice_eo_20885801.wav
wavs/common_voice_eo_20720086.wav
wavs/common_voice_eo_17900346.wav
wavs/common_voice_eo_21216690.wav
wavs/common_voice_eo_19069674.wav
wavs/common_voice_eo_20878233.wav
wavs/common_voice_eo_20847868.wav
wavs/common_voice_eo_17907325.wav
wavs/common_voice_eo_21225983.wav
wavs/common_voice_eo_20708159.wav
wavs/common_voice_eo_19543659.wav
wavs/common_voice_eo_20495843.wav
wavs/common_voice_eo_19543664.wav
wavs/common_voice_eo_20689897.wav
wavs/common_voice_eo_17926437.wav
wavs/common_voice_eo_20424971.wav
wavs/common_voice_eo_21485266.wav
wavs/common_voice_eo_18415456.wav
wavs/common_voice_eo_20434329.wav
wavs/common_voice_eo_20878246.wav
wavs/common_voice_eo_19327691.wav
wavs/common_voice

<a name="Data"></a>
### Data Prep  

OLD: We will use the `MrDragonFox/Elise`, which is designed for training TTS models. Ensure that your dataset follows the required format: **text, audio** for single-speaker models or **source, text, audio** for multi-speaker models. You can modify this section to accommodate your own dataset, but maintaining the correct structure is essential for optimal training.

In [5]:
##### OLD #####
# from datasets import load_dataset
# dataset = load_dataset("MrDragonFox/Elise", split = "train")

Load Mozilla Common Voice Dataset from .csv instead

In [4]:
import pandas as pd
import os
from datasets import load_dataset, Audio

# --- Configuration ---
# The name of your input CSV file (pipe-separated).
# Make sure this file is in the same directory as this script.
csv_filename = '/content/drive/MyDrive/Coding/TTS/metadata_full.csv'

# The desired name for your output Parquet file.
output_filename = 'esperanto_tts_dataset.parquet'

# The absolute path to the folder containing your 'wavs' directory.
# For Google Colab, this is typically '/content/'. Set to '' if not needed.
base_path = '/content/'

# --- Script ---

# 1. Check if the input file exists before proceeding.
if not os.path.exists(csv_filename):
    print(f"Error: The file '{csv_filename}' was not found.")
    print("Please make sure your CSV file is in the same folder as this script and the name is correct.")
else:
    try:
        # 2. Read the pipe-separated CSV file into a Pandas DataFrame.
        #    - `sep='|'` tells pandas to use the pipe character as the delimiter.
        #    - `header=0` specifies that the first row of the CSV is the header.
        print(f"Reading data from '{csv_filename}'...")
        df = pd.read_csv(csv_filename, sep='|', header=0)

        # Rename the 'audio_file' column to 'audio' to match the target schema.
        if 'audio_file' in df.columns:
            df.rename(columns={'audio_file': 'audio'}, inplace=True)

        # Prepend the base path to the audio file paths if one is provided.
        # This fixes pathing issues in environments like Google Colab.
        if base_path:
            print(f"Prepending '{base_path}' to all audio file paths...")
            df['audio'] = df['audio'].apply(lambda path: os.path.join(base_path, path))

        # 3. (Optional but Recommended) Verify the first few rows of the DataFrame.
        #    This helps confirm that the data was loaded and paths were updated correctly.
        print("\n--- Data Preview (First 5 Rows) ---")
        display(df.head())

        # 4. (Optional but Recommended) Check the schema and data types.
        #    This shows the structure of your table.
        print("\n--- DataFrame Info ---")
        df.info()

        # 5. Save the DataFrame to a Parquet file.
        #    - `engine='pyarrow'` is the standard engine for writing Parquet files.
        #    - `index=False` prevents pandas from writing the DataFrame index as a column.
        print(f"\nWriting data to '{output_filename}'...")
        df.to_parquet(output_filename, engine='pyarrow', index=False)

        print(f"\n✅ Success! Your Parquet file '{output_filename}' has been created.")

        # --- Verification Step ---
        # This section compares your dataset's schema with the original Hugging Face dataset.
        print("\n--- Verifying Schema Against MrDragonFox/Elise ---")
        try:
            print("Loading original dataset from Hugging Face (this may take a moment)...")
            # Load the reference dataset from Hugging Face
            hf_dataset = load_dataset("MrDragonFox/Elise", split="train")

            # Convert both the local and remote datasets to pandas DataFrames for easy comparison
            local_df = pd.read_parquet(output_filename)
            hf_df = hf_dataset.to_pandas()

            # Compare column names
            local_cols = sorted(list(local_df.columns))
            hf_cols = sorted(list(hf_df.columns))

            print(f"\nYour file's columns: {local_cols}")
            print(f"Target dataset's columns: {hf_cols}")

            if local_cols == hf_cols:
                print("✅ Column names match.")

                # Compare data types (dtypes)
                if local_df.dtypes.equals(hf_df.dtypes):
                    print("✅ Data types match.")
                    print("\n🎉 Schema verification successful! Your dataset structure matches the target.")
                else:
                    print("⚠️ Warning: Data types do not match exactly.")
                    print("This might be okay if they are compatible.")
                    print("\nYour dtypes:\n", local_df.dtypes)
                    print("\nTarget dtypes:\n", hf_df.dtypes)
            else:
                print("❌ Error: Column names DO NOT match!")
                print(f"Columns in your file but not in target: {set(local_cols) - set(hf_cols)}")
                print(f"Columns in target but not in your file: {set(hf_cols) - set(local_cols)}")

            # --- Final Loading Step ---
            print("\n--- Loading Your Local Parquet as a Hugging Face Dataset ---")

            # Load your local parquet file into a Dataset object.
            dataset = load_dataset('parquet', data_files={'train': output_filename}, split='train')

            # Cast the 'audio' column to the 'Audio' feature.
            dataset = dataset.cast_column("audio", Audio(sampling_rate=22050))

            print("\n✅ Your local dataset is now loaded and ready for training!")
            print("Preview of the first item (notice the 'audio' field is now processed):")
            print(dataset[0])


        except ImportError:
            print("\nSkipping schema verification. To enable this check, please install the 'datasets' library:")
            print("pip install datasets")
        except Exception as e:
            print(f"\nAn error occurred during schema verification: {e}")

    except Exception as e:
        print(f"\nAn error occurred: {e}")
        print("Please check that your CSV file is formatted correctly (filepath|transcription).")

Reading data from '/content/drive/MyDrive/Coding/TTS/metadata_full.csv'...
Prepending '/content/' to all audio file paths...

--- Data Preview (First 5 Rows) ---


Unnamed: 0,audio,text
0,/content/wavs/common_voice_eo_20917469.wav,Iom post iom tiu tamen akiris la respekton de ...
1,/content/wavs/common_voice_eo_20560866.wav,Dumtempe li edziĝis kaj iĝis patro.
2,/content/wavs/common_voice_eo_20260609.wav,"Numero unua: Pipo, tabakujo kaj alumetujo; por..."
3,/content/wavs/common_voice_eo_17899644.wav,Ameno diablon ne forpelas.
4,/content/wavs/common_voice_eo_20105659.wav,Ĝi estis vaporboato.



--- DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7478 entries, 0 to 7477
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   audio   7478 non-null   object
 1   text    7478 non-null   object
dtypes: object(2)
memory usage: 117.0+ KB

Writing data to 'esperanto_tts_dataset.parquet'...

✅ Success! Your Parquet file 'esperanto_tts_dataset.parquet' has been created.

--- Verifying Schema Against MrDragonFox/Elise ---
Loading original dataset from Hugging Face (this may take a moment)...

Your file's columns: ['audio', 'text']
Target dataset's columns: ['audio', 'text']
✅ Column names match.
✅ Data types match.

🎉 Schema verification successful! Your dataset structure matches the target.

--- Loading Your Local Parquet as a Hugging Face Dataset ---


Generating train split: 0 examples [00:00, ? examples/s]


✅ Your local dataset is now loaded and ready for training!
Preview of the first item (notice the 'audio' field is now processed):
{'audio': {'path': '/content/wavs/common_voice_eo_20917469.wav', 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -3.05175781e-05, -3.05175781e-05,  0.00000000e+00]), 'sampling_rate': 22050}, 'text': 'Iom post iom tiu tamen akiris la respekton de la kapitano.'}


In [5]:
#@title Tokenization Function

import locale
import torchaudio.transforms as T
import os
import torch
from snac import SNAC
locale.getpreferredencoding = lambda: "UTF-8"
# ds_sample_rate = dataset[0]["audio"]["sampling_rate"] # Removed as it will be available after casting in the previous cell

snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cuda")
def tokenise_audio(waveform):
  waveform = torch.from_numpy(waveform).unsqueeze(0)
  waveform = waveform.to(dtype=torch.float32)
  resample_transform = T.Resample(orig_freq=dataset[0]["audio"]["sampling_rate"], new_freq=24000) # Access sampling rate directly from dataset
  waveform = resample_transform(waveform)

  waveform = waveform.unsqueeze(0).to("cuda")

  #generate the codes from snac
  with torch.inference_mode():
    codes = snac_model.encode(waveform)

  all_codes = []
  for i in range(codes[0].shape[1]):
    all_codes.append(codes[0][0][i].item()+128266)
    all_codes.append(codes[1][0][2*i].item()+128266+4096)
    all_codes.append(codes[2][0][4*i].item()+128266+(2*4096))
    all_codes.append(codes[2][0][(4*i)+1].item()+128266+(3*4096))
    all_codes.append(codes[1][0][(2*i)+1].item()+128266+(4*4096))
    all_codes.append(codes[2][0][(4*i)+2].item()+128266+(5*4096))
    all_codes.append(codes[2][0][(4*i)+3].item()+128266+(6*4096))


  return all_codes

def add_codes(example):
    # Always initialize codes_list to None
    codes_list = None

    try:
        answer_audio = example.get("audio")
        # If there's a valid audio array, tokenise it
        if answer_audio and "array" in answer_audio:
            audio_array = answer_audio["array"]
            codes_list = tokenise_audio(audio_array)
    except Exception as e:
        print(f"Skipping row due to error: {e}")
        # Keep codes_list as None if we fail
    example["codes_list"] = codes_list

    return example

dataset = dataset.map(add_codes, remove_columns=["audio"])

tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009

start_of_speech = tokeniser_length + 1
end_of_speech = tokeniser_length + 2

start_of_human = tokeniser_length + 3
end_of_human = tokeniser_length + 4

start_of_ai = tokeniser_length + 5
end_of_ai =  tokeniser_length + 6
pad_token = tokeniser_length + 7

audio_tokens_start = tokeniser_length + 10

dataset = dataset.filter(lambda x: x["codes_list"] is not None)
dataset = dataset.filter(lambda x: len(x["codes_list"]) > 0)

def remove_duplicate_frames(example):
    vals = example["codes_list"]
    if len(vals) % 7 != 0:
        raise ValueError("Input list length must be divisible by 7")

    result = vals[:7]

    removed_frames = 0

    for i in range(7, len(vals), 7):
        current_first = vals[i]
        previous_first = result[-7]

        if current_first != previous_first:
            result.extend(vals[i:i+7])
        else:
            removed_frames += 1

    example["codes_list"] = result

    return example

dataset = dataset.map(remove_duplicate_frames)

tok_info = '''*** HERE you can modify the text prompt
If you are training a multi-speaker model (e.g., canopylabs/orpheus-3b-0.1-ft),
ensure that the dataset includes a "source" field and format the input accordingly:
- Single-speaker: f"{example['text']}"
- Multi-speaker: f"{example['source']}: {example['text']}"
'''
print(tok_info)

def create_input_ids(example):
    # Determine whether to include the source field
    text_prompt = f"{example['source']}: {example['text']}" if "source" in example else example["text"]

    text_ids = tokenizer.encode(text_prompt, add_special_tokens=True)
    text_ids.append(end_of_text)

    example["text_tokens"] = text_ids
    input_ids = (
        [start_of_human]
        + example["text_tokens"]
        + [end_of_human]
        + [start_of_ai]
        + [start_of_speech]
        + example["codes_list"]
        + [end_of_speech]
        + [end_of_ai]
    )
    example["input_ids"] = input_ids
    example["labels"] = input_ids
    example["attention_mask"] = [1] * len(input_ids)

    return example


dataset = dataset.map(create_input_ids, remove_columns=["text", "codes_list"])
columns_to_keep = ["input_ids", "labels", "attention_mask"]
columns_to_remove = [col for col in dataset.column_names if col not in columns_to_keep]

dataset = dataset.remove_columns(columns_to_remove)



Map:   0%|          | 0/7478 [00:00<?, ? examples/s]

Filter:   0%|          | 0/7478 [00:00<?, ? examples/s]

Filter:   0%|          | 0/7478 [00:00<?, ? examples/s]

Map:   0%|          | 0/7478 [00:00<?, ? examples/s]

*** HERE you can modify the text prompt
If you are training a multi-speaker model (e.g., canopylabs/orpheus-3b-0.1-ft),
ensure that the dataset includes a "source" field and format the input accordingly:
- Single-speaker: f"{example['text']}"
- Multi-speaker: f"{example['source']}: {example['text']}"



Map:   0%|          | 0/7478 [00:00<?, ? examples/s]

RAM Clear

In [25]:
# import torch

# torch.cuda.empty_cache()

# import gc

# # Delete large variables you are no longer using
# del model
# del trainer
# del dataset

# # Run garbage collection
# gc.collect()

2396

In [23]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

import re

def get_system_ram_usage():
    try:
        # Read memory information from /proc/meminfo
        with open('/proc/meminfo', 'r') as f:
            meminfo = f.read()

        # Use regular expressions to find the relevant values
        mem_total_kb = int(re.search(r'MemTotal:\s+(\d+)', meminfo).group(1))
        mem_available_kb = int(re.search(r'MemAvailable:\s+(\d+)', meminfo).group(1))

        # Calculate used memory
        mem_used_kb = mem_total_kb - mem_available_kb

        # Convert kilobytes to gigabytes
        total_ram_gb = round(mem_total_kb / (1024**2), 2)
        used_ram_gb = round(mem_used_kb / (1024**2), 2)
        free_ram_gb = round(mem_available_kb / (1024**2), 2)

        # Calculate percentage
        percent_used = round((mem_used_kb / mem_total_kb) * 100, 2)

        print("System RAM Usage (from /proc/meminfo):")
        print(f"Total: {total_ram_gb} GB")
        print(f"Used: {used_ram_gb} GB")
        print(f"Free (Available): {free_ram_gb} GB")
        print(f"Percentage Used: {percent_used}%")

    except FileNotFoundError:
        print("Could not find /proc/meminfo. This method only works on Linux-based systems.")
    except Exception as e:
        print(f"An error occurred: {e}")

# Run the function to get the stats
get_system_ram_usage()

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.557 GB.
25.645 GB of memory reserved.
System RAM Usage (from /proc/meminfo):
Total: 83.48 GB
Used: 76.35 GB
Free (Available): 7.13 GB
Percentage Used: 91.46%


<a name="Train"></a>
### Train the model
Now let's use Huggingface  `Trainer`! More docs here: [Transformers docs](https://huggingface.co/docs/transformers/main_classes/trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

**Note:** Using a per_device_train_batch_size >1 may lead to errors if multi-GPU setup to avoid issues, ensure CUDA_VISIBLE_DEVICES is set to a single GPU (e.g., CUDA_VISIBLE_DEVICES=0).

In [26]:


from transformers import TrainingArguments,Trainer,DataCollatorForSeq2Seq
tokenizer.pad_token_id = 128263
data_collator = DataCollatorForSeq2Seq(
  tokenizer,
  pad_to_multiple_of = 8,
  padding = True,
)

trainer = Trainer(
    model = model,
    train_dataset = dataset,
    args = TrainingArguments(
        per_device_train_batch_size = 96,
        gradient_accumulation_steps = 2,
        warmup_steps = 100,
        num_train_epochs = 1, # Set this for 1 full training run.
        # max_steps = 5,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
        save_strategy="steps", # Add this to save checkpoints
        save_steps=100, # Add this to specify how often to save checkpoints
        save_total_limit=2, # Optional: Limit the number of checkpoints to save
    ),
    data_collator=data_collator,
)
# To resume from a checkpoint, uncomment the line below and replace "path/to/checkpoint"
# with the actual path to your checkpoint directory.
# trainer_stats = trainer.train(resume_from_checkpoint="path/to/checkpoint")
trainer_stats = trainer.train()

NameError: name 'model' is not defined

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model! You can change the prompts



In [None]:
# prompts = [
#     "Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.",
# ]
prompts = [
    # "Ĉi tiu esprimo povus esti dirata ankaŭ male: Diris Zamenhof al Klara"
    "Ĉiu lingvo daŭre ŝanĝiĝas, dum ĝi estas parolata!"
]
lang = "eo"
dataset_length = "10h"
steps = "60st"
epochs = ""

chosen_voice = None # None for single-speaker

In [None]:
#@title Run Inference


FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# Moving snac_model cuda to cpu
snac_model.to("cpu")

prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token = torch.tensor([[ 128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)

all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

input_ids = all_padded_tensors.to("cuda")
attention_mask = all_attention_masks.to("cuda")
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=128258,
     use_cache = True
  )
token_to_find = 128257
token_to_remove = 128258

token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)

if len(token_indices[1]) > 0:
    last_occurrence_idx = token_indices[1][-1].item()
    cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
    cropped_tensor = generated_ids

mask = cropped_tensor != token_to_remove

processed_rows = []

for row in cropped_tensor:
    masked_row = row[row != token_to_remove]
    processed_rows.append(masked_row)

code_lists = []

for row in processed_rows:
    row_length = row.size(0)
    new_length = (row_length // 7) * 7
    trimmed_row = row[:new_length]
    trimmed_row = [t - 128266 for t in trimmed_row]
    code_lists.append(trimmed_row)


def redistribute_codes(code_list):
  layer_1 = []
  layer_2 = []
  layer_3 = []
  for i in range((len(code_list)+1)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes]
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)

import os
from scipy.io.wavfile import write as write_wav
from IPython.display import display, Audio
import re

# Specify the path to save your audio files
output_path = "generated_audio"

# Create the directory if it doesn't already exist
os.makedirs(output_path, exist_ok=True)

# Define the sample rate
sample_rate = 24000

if len(prompts) != len(my_samples):
  raise Exception("Number of prompts and samples do not match")
else:
  for i in range(len(my_samples)):
    prompt_text = prompts[i]
    samples = my_samples[i]

    print(f"Processing prompt: '{prompt_text}'")

    # Prepare the audio data
    audio_data = samples.detach().squeeze().to("cpu").numpy()

    # --- NEW: Sanitize the prompt to create a valid filename ---
    # This takes the first 5 words, joins them with underscores, and removes invalid characters
    safe_prompt_text = "_".join(prompt_text.split())
    safe_prompt_text = re.sub(r'[\\/*?:"<>|]', "", safe_prompt_text)

    # --- NEW: Construct the filename using your variables ---
    filename = f"{safe_prompt_text}_orpheus_{lang}_{dataset_length}_{steps}.wav"
    file_path = os.path.join(output_path, filename)

    # Save the audio data to the specified path
    write_wav(file_path, sample_rate, audio_data)
    print(f"✅ Audio saved to: {file_path}")

    # You can still display the audio in the notebook as before
    display(Audio(audio_data, rate=sample_rate))
    print("-" * 30)

# Clean up to save RAM
del my_samples, samples, audio_data


In [None]:
# Import libraries for saving the audio file
import os
from scipy.io.wavfile import write as write_wav
from IPython.display import display, Audio

# --- NEW: Specify the path to save your audio files ---
output_path = "generated_audio" # You can change "generated_audio" to any folder name

# --- NEW: Create the directory if it doesn't already exist ---
os.makedirs(output_path, exist_ok=True)

# Define the sample rate
sample_rate = 24000

if len(prompts) != len(my_samples):
  raise Exception("Number of prompts and samples do not match")
else:
  for i in range(len(my_samples)):
    prompt_text = prompts[i]
    samples = my_samples[i]

    print(f"Processing prompt: '{prompt_text}'")

    # Prepare the audio data
    audio_data = samples.detach().squeeze().to("cpu").numpy()

    # --- NEW: Define the full file path for the output WAV file ---
    # We'll name the file based on its index, e.g., "output_0.wav"
    file_path = os.path.join(output_path, f"output_{i}.wav")

    # --- NEW: Save the audio data to the specified path ---
    write_wav(file_path, sample_rate, audio_data)
    print(f"✅ Audio saved to: {file_path}")

    # You can still display the audio in the notebook as before
    display(Audio(audio_data, rate=sample_rate))
    print("-" * 30) # Adds a separator for clarity

# Clean up to save RAM
del my_samples, samples, audio_data

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

### Saving to float16

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False:
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
