🎯 Fine-tune Wav2Vec2-base on Kurmanci

| Step | Task                                           |
| ---- | ---------------------------------------------- |
| 1️⃣  | Preprocess Mozilla `train.tsv` for Kurmanci    |
| 2️⃣  | Convert to HuggingFace `datasets.Dataset`      |
| 3️⃣  | Prepare tokenizer (use dummy CTC tokenizer)    |
| 4️⃣  | Configure Trainer + TrainingArguments          |
| 5️⃣  | Fine-tune Wav2Vec2-base with CTC loss          |
| 6️⃣  | Evaluate on `dev.tsv` and compare with Whisper |


✅ Pipeline Summary
Preload tokenizer & processor

Load + filter train.tsv (1%) & dev.tsv

Preprocess audio and text

Prepare model on GPU (.to("cuda"))

Fine-tune with validation steps

Evaluate WER on test.tsv after training




Cell 1: Install & Import Dependencies
This ensures all required Hugging Face and audio tools are ready.

In [None]:
#✅ Cell 1: Install & Import + Setup Root Paths
import os, re, json, torch
import pandas as pd
import librosa, torchaudio
import numpy as np
from pathlib import Path

from datasets import Dataset, Audio
from transformers import (
    Wav2Vec2ForCTC, Wav2Vec2Processor,
    Wav2Vec2FeatureExtractor, Wav2Vec2CTCTokenizer,
    TrainingArguments, Trainer,
)
from torch.nn.utils.rnn import pad_sequence
import evaluate

jiwer = evaluate.load("wer")

# Setup paths
PROJECT_ROOT = Path(__file__).resolve().parents[1] if "__file__" in globals() else Path.cwd().parents[1]
DATA_DIR = PROJECT_ROOT / "data" / "kurmanci"
MODELS_DIR = PROJECT_ROOT / "models"
OUTPUT_DIR = MODELS_DIR / "wav2vec2-kurmanci-final"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


In [None]:
#✅ Cell 2: Load Kurmanci Dataset (1%)
train_path = "../data/kurmanci/train.tsv"
dev_path = "../data/kurmanci/dev.tsv"
clips_path = "../data/kurmanci/clips"

train_df = pd.read_csv(train_path, sep="\t").dropna(subset=["path", "sentence"])
dev_df = pd.read_csv(dev_path, sep="\t").dropna(subset=["path", "sentence"])
train_df = train_df.sample(frac=0.01, random_state=42)

train_df["path"] = train_df["path"].apply(lambda x: os.path.join(clips_path, x))
dev_df["path"] = dev_df["path"].apply(lambda x: os.path.join(clips_path, x))

train_ds = Dataset.from_pandas(train_df[["path", "sentence"]])
dev_ds = Dataset.from_pandas(dev_df[["path", "sentence"]])

train_ds = train_ds.cast_column("path", Audio(sampling_rate=16000))
dev_ds = dev_ds.cast_column("path", Audio(sampling_rate=16000))


In [18]:
train_ds["path"]

[{'path': '../data/kurmanci/clips\\common_voice_kmr_35060090.mp3',
  'array': array([ 1.27329258e-11,  2.91038305e-11,  1.45519152e-11, ...,
          3.83282409e-20, -2.89579389e-20,  2.30187821e-20], shape=(43200,)),
  'sampling_rate': 16000},
 {'path': '../data/kurmanci/clips\\common_voice_kmr_35064883.mp3',
  'array': array([-7.27595761e-11,  1.45519152e-11,  1.45519152e-10, ...,
          5.04088291e-07, -1.02555339e-06, -6.17599312e-07], shape=(46656,)),
  'sampling_rate': 16000},
 {'path': '../data/kurmanci/clips\\common_voice_kmr_35249487.mp3',
  'array': array([ 2.18278728e-11,  1.09139364e-11, -1.09139364e-11, ...,
          2.05973083e-06,  1.18813768e-05, -1.21401281e-05], shape=(35136,)),
  'sampling_rate': 16000},
 {'path': '../data/kurmanci/clips\\common_voice_kmr_25313893.mp3',
  'array': array([3.63797881e-12, 1.81898940e-12, 1.09139364e-11, ...,
         1.41435885e-06, 4.42025339e-06, 1.39267922e-06], shape=(79488,)),
  'sampling_rate': 16000},
 {'path': '../data/kur

In [None]:
#✅ Cell 3: Build Character-Level Tokenizer and feature extractor
vocab_set = set()
for s in train_df["sentence"]:
    vocab_set.update(list(s.lower()))

vocab_list = sorted(list(vocab_set - {" "}))
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict["|"] = len(vocab_dict)
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)

vocab_path = PROJECT_ROOT / "kurmanci_vocab.json"
with vocab_path.open("w", encoding="utf-8") as f:
    json.dump(vocab_dict, f)

tokenizer = Wav2Vec2CTCTokenizer(
    str(vocab_path), unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|"
)

feature_extractor = Wav2Vec2FeatureExtractor(
    feature_size=1, sampling_rate=16000,
    padding_value=0.0, do_normalize=True,
    return_attention_mask=True
)

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

processor

Wav2Vec2Processor:
- feature_extractor: Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": true,
  "sampling_rate": 16000
}

- tokenizer: Wav2Vec2CTCTokenizer(name_or_path='', vocab_size=40, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '[UNK]', 'pad_token': '[PAD]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	38: AddedToken("[UNK]", rstrip=True, lstrip=True, single_word=False, normalized=False, special=False),
	39: AddedToken("[PAD]", rstrip=True, lstrip=True, single_word=False, normalized=False, special=False),
	40: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	41: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False,

Cell 4: Preprocess the dataset (audio → input + label)

In [None]:
#✅ Cell 4: Preprocess Audio/Text
def preprocess(batch):
    audio = batch["path"]
    inputs = processor(audio["array"], sampling_rate=16000, return_attention_mask=True)
    with processor.as_target_processor():
        labels = processor(batch["sentence"]).input_ids
    return {
        "input_values": inputs.input_values[0],
        "attention_mask": inputs.attention_mask[0],
        "labels": labels,
    }

train_ds = train_ds.map(preprocess, remove_columns=train_ds.column_names)
dev_ds = dev_ds.map(preprocess, remove_columns=dev_ds.column_names)


Map: 100%|██████████| 53/53 [00:00<00:00, 352.79 examples/s]
Map: 100%|██████████| 3973/3973 [00:14<00:00, 265.64 examples/s]


In [25]:
(train_ds["labels"])

[[38, 31, 37, 19, 10, 7, 33, 13, 19, 14, 15, 34, 18, 4],
 [38, 28, 37, 7, 14, 17, 14, 19, 9, 37, 19, 10, 7, 35, 30, 10, 4],
 [38, 27, 37, 8, 34, 37, 27, 6, 17, 6, 37, 30, 10, 5],
 [38,
  14,
  37,
  24,
  10,
  23,
  37,
  28,
  6,
  19,
  37,
  29,
  10,
  7,
  6,
  25,
  10,
  16,
  10,
  37,
  31,
  6,
  19,
  14,
  24,
  25,
  34,
  37,
  19,
  10,
  13,
  6,
  25,
  14,
  30,
  10,
  37,
  16,
  14,
  23,
  14,
  19,
  4],
 [38, 6, 23, 33, 37, 25, 10, 37, 32, 14, 37, 30, 10, 5],
 [38, 14, 23, 10, 27, 10, 4],
 [38, 28, 37, 16, 6, 23, 37, 9, 14, 16, 14, 19],
 [38, 20, 18, 37, 9, 14, 37, 9, 10, 23, 24, 33, 37, 9, 10, 37, 7, 35, 5],
 [38, 6, 19, 37, 7, 14, 15, 6, 23, 25, 4],
 [38,
  10,
  23,
  37,
  9,
  20,
  23,
  2,
  37,
  8,
  34,
  23,
  6,
  19,
  37,
  35,
  37,
  12,
  26,
  19,
  9,
  34,
  37,
  25,
  33,
  19,
  37,
  13,
  10,
  28,
  6,
  23,
  6,
  37,
  28,
  6,
  19],
 [38, 28, 37, 33, 37, 7, 14, 7, 10, 23, 24, 14, 27, 34, 19, 10, 4],
 [38, 33, 11, 6, 37, 18, 10, 37,

In [27]:
train_ds

Dataset({
    features: ['input_values', 'attention_mask', 'labels'],
    num_rows: 53
})

In [None]:
#✅ Cell 5: Dynamic Padding Collator
class CustomCTCCollator:
    def __init__(self, processor):
        self.processor = processor
        self.pad_token_id = processor.tokenizer.pad_token_id

    def __call__(self, batch):
        input_values = [torch.tensor(b["input_values"]) for b in batch]
        attention_mask = [torch.tensor(b["attention_mask"]) for b in batch]
        labels = [torch.tensor(b["labels"]) for b in batch]

        return {
            "input_values": pad_sequence(input_values, batch_first=True, padding_value=0.0),
            "attention_mask": pad_sequence(attention_mask, batch_first=True, padding_value=0),
            "labels": pad_sequence(labels, batch_first=True, padding_value=self.pad_token_id),
        }

data_collator = CustomCTCCollator(processor)

Cell 6 – TrainingArguments

In [None]:
#✅ Cell 6: Load Pretrained Wav2Vec2 Base
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-xlsr-53",
    vocab_size=len(processor.tokenizer),
    cache_dir=MODELS_DIR / "wav2vec2-base"
).to("cuda")


Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-xlsr-53 and are newly initialized: ['lm_head.bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Cell 7 – Trainer Setup

In [None]:
#✅ Cell 7: Trainer Configuration
training_args = TrainingArguments(
    output_dir=str(OUTPUT_DIR),
    group_by_length=True,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    evaluation_strategy="epoch",
    num_train_epochs=2,
    fp16=True,
    save_steps=500,
    save_total_limit=1,
    logging_steps=25,
    learning_rate=6e-5,
    warmup_steps=50,
    weight_decay=0.005,
    report_to="none",
    push_to_hub=False
)

In [41]:
train_ds

Dataset({
    features: ['input_values', 'attention_mask', 'labels'],
    num_rows: 53
})

In [None]:
#✅ Cell 8: Metrics & Trainer Setup
wer_metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred_str = processor.batch_decode(pred_ids)
    label_ids = pred.label_ids
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
    label_str = processor.batch_decode(label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=dev_ds,
    tokenizer=processor.feature_extractor,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


  trainer = Trainer(


Cell 9: Start training

In [43]:
trainer.train()
print("🎉 Training complete!")

metrics = trainer.evaluate()
print("✅ Evaluation metrics:", metrics)

trainer.save_model("./wav2vec2-kurmanci-final")
processor.save_pretrained("./wav2vec2-kurmanci-final")


Epoch,Training Loss,Validation Loss,Wer
1,1113.5423,5068.975586,1.000804


🎉 Training complete!


✅ Evaluation metrics: {'eval_loss': 5070.03662109375, 'eval_wer': 1.0010153572788425, 'eval_runtime': 2207.1141, 'eval_samples_per_second': 1.8, 'eval_steps_per_second': 0.225, 'epoch': 1.8888888888888888}


[]