#Fine-tuning unsloth/orpheus-3b model on kazakh-corpus2 dataset

###Installing Dependencies
This cell installs all necessary libraries for fine-tuning. It checks if the environment is Google Colab and installs accordingly:
*   unsloth: the main fine-tuning library.
*   bitsandbytes, accelerate, xformers, etc.: libraries for memory-efficient and faster training.

In [2]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth
!pip install snac

In [3]:
#logging to HF
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

###Model Setup
nitializes the base model using FastLanguageModel from unsloth.
*   dtype = None: allows automatic precision detection.
*   load_in_4bit = False: disables 4-bit quantization (which would save VRAM but may reduce precision).

In [None]:
from unsloth import FastLanguageModel
import torch
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/orpheus-3b-0.1-ft",
    max_seq_length= 2048, # Choose any for long context!
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.7: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.61G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/5.41M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/22.8M [00:00<?, ?B/s]

###Applying LoRA Adapters
Applies Low-Rank Adaptation (LoRA) to the model using get_peft_model():
*   r = 64: rank of LoRA adaptation matrices.
*   target_modules: specific transformer layers to adapt (attention and feedforward).
*   lora_alpha and lora_dropout: fine-tuning parameters to regulate learning.

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.5.7 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


###Dataset Loading and Chunking
Defines a preprocessing function split_long_rows_batch() to split input sequences longer than 2048 tokens into manageable chunks. Prevents exceeding the model’s maximum sequence length.

Here I've used already tokenized dataset named "adilet11/kazakh-corpus2-preprocessed". If you are using raw dataset you need to preprocess it.

In [None]:
from datasets import load_dataset
import numpy as np

MAX_LEN = 2048

def split_long_rows_batch(batch):
    out_input_ids, out_labels, out_masks = [], [], []
    for ids in batch["input_ids"]:               # ids - уже list[int]
        for i in range(0, len(ids), MAX_LEN):
            chunk = ids[i : i + MAX_LEN]
            if chunk:
                out_input_ids.append(chunk)
                out_labels.append(chunk)
                out_masks.append([1] * len(chunk))
    return {
        "input_ids":      out_input_ids,
        "labels":         out_labels,
        "attention_mask": out_masks,
    }

raw_ds = load_dataset(
    "adilet11/kazakh-corpus2-preprocessed",
    split="train",
    streaming=False,
)

dataset = raw_ds.map(
    split_long_rows_batch,
    batched=True,
    batch_size=1000,
    num_proc=4,                # параллельно
    remove_columns=raw_ds.column_names,
)

print(dataset[0]["input_ids"][:10], len(dataset[0]["input_ids"]))     # ≤ 2048

README.md:   0%|          | 0.00/414 [00:00<?, ?B/s]

train-00000-of-00005.parquet:   0%|          | 0.00/148M [00:00<?, ?B/s]

train-00001-of-00005.parquet:   0%|          | 0.00/148M [00:00<?, ?B/s]

train-00002-of-00005.parquet:   0%|          | 0.00/148M [00:00<?, ?B/s]

train-00003-of-00005.parquet:   0%|          | 0.00/148M [00:00<?, ?B/s]

train-00004-of-00005.parquet:   0%|          | 0.00/148M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/264952 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/264952 [00:00<?, ? examples/s]

[128259, 128000, 17394, 119793, 7094, 17721, 58317, 1506, 142, 96] 706


###Dataset Splitting
Splits the dataset into:


*   90% training set
*   5% validation set
*   5% test set

Ensures balanced training and reliable evaluation.





In [None]:
train_val = dataset.train_test_split(test_size=0.10, seed=42)   # 90 % train
train_ds  = train_val["train"]            # 90 %
tmp_ds    = train_val["test"]             # leftover 10 %

val_test  = tmp_ds.train_test_split(test_size=0.50, seed=42)
val_ds    = val_test["train"]             # 5 %
test_ds   = val_test["test"]              # 5 %

print(train_ds.num_rows, val_ds.num_rows, test_ds.num_rows)


238610 13256 13257


<a name="Train"></a>
### Train the model
Now let's use Huggingface  `Trainer`! More docs here: [Transformers docs](https://huggingface.co/docs/transformers/main_classes/trainer). We do  `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

**Note:** Using a per_device_train_batch_size >1 may lead to errors if multi-GPU setup to avoid issues, ensure CUDA_VISIBLE_DEVICES is set to a single GPU (e.g., CUDA_VISIBLE_DEVICES=0).

In [None]:
# Install library
!pip install wandb --upgrade

# Setting up Wandb
!wandb login

import os

os.environ["WANDB_PROJECT"] = "orpheus_3B_fine-tuning_kz_v5.0"
os.environ["WANDB_LOG_MODEL"] = "checkpoint"

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33madikmath11[0m ([33madikmath11-inn-lab[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
import torch
from typing import List

pad_id    = tokenizer.pad_token_id or tokenizer.eos_token_id   # 128004
label_pad = -100                                               # ignoring in loss

def tts_collator(batch: List[dict]):
    # let's find the length of the longest example in this batch
    max_len = max(len(ex["input_ids"]) for ex in batch)

    input_ids, labels, attn = [], [], []
    for ex in batch:
        seq_len = len(ex["input_ids"])
        pad_len = max_len - seq_len

        # ------- input_ids ---------
        input_ids.append(ex["input_ids"] + [pad_id] * pad_len)

        # ------- attention_mask ----
        attn.append(ex.get("attention_mask", [1]*seq_len) + [0] * pad_len)

        # ------- labels ------------
        # either take the ready ex["labels"], or copy the input_ids
        lbl = ex.get("labels", ex["input_ids"]).copy()
        labels.append(lbl + [label_pad] * pad_len)

    return {
        "input_ids":      torch.tensor(input_ids, dtype=torch.long),
        "labels":         torch.tensor(labels,    dtype=torch.long),
        "attention_mask": torch.tensor(attn,      dtype=torch.long),
    }

In [None]:
from transformers import TrainingArguments,Trainer,DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported


trainer = Trainer(
    model = model,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=tts_collator,
    args = TrainingArguments(
        output_dir="outputs/run_a100",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,   # eff batch 16
        learning_rate       = 1e-4,
        warmup_ratio        = 0.03,
        num_train_epochs    = 2,           # ≈30 k steps
        bf16                = True,
        fp16                = False,
        optim               = "adamw_8bit",
        weight_decay        = 0.0,
        lr_scheduler_type   = "linear",
        logging_steps       = 25,
        save_steps          = 2000,
        save_total_limit    = 3,
        eval_strategy = "steps",
        eval_steps          = 2000,
        report_to           = ["wandb"],
        load_best_model_at_end = True,
        seed                = 3407,
    ),
)

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.557 GB.
7.15 GB of memory reserved.


In [None]:
import wandb
run = wandb.init()
artifact = run.use_artifact('adikmath11-inn-lab/orpheus_3B_fine-tuning_kz_v5.0/model-fefi38k6:v4', type='model')
artifact_dir = artifact.download()

[34m[1mwandb[0m: Currently logged in as: [33madikmath11[0m ([33madikmath11-inn-lab[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Downloading large artifact model-fefi38k6:v4, 561.05MB. 8 files... 
[34m[1mwandb[0m:   8 of 8 files downloaded.  
Done. 0:0:24.7 (22.7MB/s)


In [None]:
trainer_stats = trainer.train(resume_from_checkpoint=artifact_dir)

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 238,610 | Num Epochs = 2 | Total steps = 29,826
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 97,255,424/3,398,122,496 (2.86% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
26000,4.1155,4.099775
28000,4.0343,4.099735


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient
[34m[1mwandb[0m: Adding directory to artifact (./outputs/run_a100/checkpoint-26000)... Done. 1.3s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/run_a100/checkpoint-28000)... Done. 1.3s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/run_a100/checkpoint-29826)... Done. 1.4s


In [None]:
test_metrics = trainer.evaluate(test_ds, metric_key_prefix="test")
print(test_metrics)


{'test_loss': 4.113007545471191, 'test_runtime': 1421.2347, 'test_samples_per_second': 9.328, 'test_steps_per_second': 1.167, 'epoch': 1.999991618121621}


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

19107.3822 seconds used for training.
318.46 minutes used for training.
Peak reserved memory = 17.314 GB.
Peak reserved memory for training = 10.164 GB.
Peak reserved memory % of max memory = 43.77 %.
Peak reserved memory for training % of max memory = 25.695 %.


### Inference
Let's run the model!

In [None]:
prompts = [
    "Қандай керемет күн! Менің жүрегім қуанышпен соғып тұр. Сүйікті адамымның күлкісі - менің өмірімдегі ең әдемі дыбыс. Менің оған деген махаббатым шексіз, теңіздей терең!",
]

chosen_voice = None # None for single-speaker

In [None]:
#@title Run Inference
from snac import SNAC

snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# Moving snac_model cuda to cpu
snac_model.to("cpu")

prompts_ = [(f"{chosen_voice}: " + p) if  chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token = torch.tensor([[ 128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)

all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

input_ids = all_padded_tensors.to("cuda")
attention_mask = all_attention_masks.to("cuda")
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1500,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=128258,
     use_cache = True
  )
token_to_find = 128257
token_to_remove = 128258

token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)

if len(token_indices[1]) > 0:
    last_occurrence_idx = token_indices[1][-1].item()
    cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
    cropped_tensor = generated_ids

mask = cropped_tensor != token_to_remove

processed_rows = []

for row in cropped_tensor:
    masked_row = row[row != token_to_remove]
    processed_rows.append(masked_row)

code_lists = []

for row in processed_rows:
    row_length = row.size(0)
    new_length = (row_length // 7) * 7
    trimmed_row = row[:new_length]
    trimmed_row = [t - 128266 for t in trimmed_row]
    code_lists.append(trimmed_row)


def redistribute_codes(code_list):
  layer_1 = []
  layer_2 = []
  layer_3 = []
  for i in range((len(code_list)+1)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes]
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)
from IPython.display import display, Audio
if len(prompts) != len(my_samples):
  raise Exception("Number of prompts and samples do not match")
else:
  for i in range(len(my_samples)):
    print(prompts[i])
    samples = my_samples[i]
    display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
# Clean up to save RAM
del my_samples,samples

Қандай керемет күн! Менің жүрегім қуанышпен соғып тұр. Сүйікті адамымның күлкісі - менің өмірімдегі ең әдемі дыбыс. Менің оған деген махаббатым шексіз, теңіздей терең!


Saving, loading finetuned models

In [None]:
model.save_pretrained("/content/drive/MyDrive/orpheus_3B_fine-tuning_kz_v2.0/orpheus_3B_fine-tuned_kz_model_v2.0")  # Local saving
tokenizer.save_pretrained("/content/drive/MyDrive/orpheus_3B_fine-tuning_kz_v2.0/orpheus_3B_fine-tuned_kz_tokenizer_v2.0")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('/content/drive/MyDrive/orpheus_3B_fine-tuning_kz_v5.0/orpheus_3B_fine-tuned_kz_tokenizer_v5.0/tokenizer_config.json',
 '/content/drive/MyDrive/orpheus_3B_fine-tuning_kz_v5.0/orpheus_3B_fine-tuned_kz_tokenizer_v5.0/special_tokens_map.json',
 '/content/drive/MyDrive/orpheus_3B_fine-tuning_kz_v5.0/orpheus_3B_fine-tuned_kz_tokenizer_v5.0/tokenizer.json')

In [7]:
model.push_to_hub("adilet11/orpheus_3B_fine-tuned_kz_v2.0", token = "") # Online saving
tokenizer.push_to_hub("adilet11/orpheus_3B_fine-tuned_kz_v2.0", token = "") # Online saving

README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/895 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/389M [00:00<?, ?B/s]

Saved model to https://huggingface.co/adilet11/orpheus_3B_fine-tuned_kz_v2.0


README.md:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/22.8M [00:00<?, ?B/s]

## Running inference of the fine-tuned model

In [4]:
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = "/content/drive/MyDrive/orpheus_3B_fine-tuning_kz_v2.0/orpheus_3B_fine-tuned_kz_model_v2.0",   # here are adapter_config.json / adapter_model.safetensors
    max_seq_length = 4096,
    dtype          = torch.bfloat16 if is_bfloat16_supported() else torch.float16,
    load_in_4bit   = True
)
FastLanguageModel.for_inference(model)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.8: Fast Llama patching. Transformers: 4.52.2.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/5.41M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/22.8M [00:00<?, ?B/s]

Unsloth 2025.5.8 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(156940, 3072, padding_idx=128004)
        (layers): ModuleList(
          (0): LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=3072, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear(
      

In [5]:
prompts = [
    "Оу! Бұл не?! Керемет! Мұндай ғажайыпты ешқашан көрген емеспін! Қалай мүмкін? Бұл шынымен де болуы мүмкін бе?",
]

chosen_voice = None # None for single-speaker

In [6]:
#@title Run Inference
from snac import SNAC

snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")

# Moving snac_model cuda to cpu
snac_model.to("cpu")

prompts_ = [(f"{chosen_voice}: " + p) if  chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token = torch.tensor([[ 128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)

all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

input_ids = all_padded_tensors.to("cuda")
attention_mask = all_attention_masks.to("cuda")
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1  ,
      num_return_sequences=1,
      eos_token_id=128258,
     use_cache = True
  )
token_to_find = 128257
token_to_remove = 128258

token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)

if len(token_indices[1]) > 0:
    last_occurrence_idx = token_indices[1][-1].item()
    cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
    cropped_tensor = generated_ids

mask = cropped_tensor != token_to_remove

processed_rows = []

for row in cropped_tensor:
    masked_row = row[row != token_to_remove]
    processed_rows.append(masked_row)

code_lists = []

for row in processed_rows:
    row_length = row.size(0)
    new_length = (row_length // 7) * 7
    trimmed_row = row[:new_length]
    trimmed_row = [t - 128266 for t in trimmed_row]
    code_lists.append(trimmed_row)


def redistribute_codes(code_list):
  layer_1 = []
  layer_2 = []
  layer_3 = []
  for i in range((len(code_list)+1)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes]
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)
from IPython.display import display, Audio
if len(prompts) != len(my_samples):
  raise Exception("Number of prompts and samples do not match")
else:
  for i in range(len(my_samples)):
    print(prompts[i])
    samples = my_samples[i]
    display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
# Clean up to save RAM
del my_samples,samples

config.json:   0%|          | 0.00/300 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/79.5M [00:00<?, ?B/s]

Оу! Бұл не?! Керемет! Мұндай ғажайыпты ешқашан көрген емеспін! Қалай мүмкін? Бұл шынымен де болуы мүмкін бе?
