# Baseline Evaluation

Evaluating the baseline model. 

The baseline models uses *ASR* (`whisper-tiny` model) to predict the tranctipt of the speech and then uses `multilingual-e5-small` to enocde both the **true transcript** and the **predicted transcript**

In [1]:
import torch
import os, re, string, unicodedata
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer, AutoProcessor, pipeline
import librosa
import numpy as np

## Dataset Loading

- We use the [HParl: Hellenic Parliamentary Speech Corpus](https://inventory.clarin.gr/corpus/1602) which contains 120 hours of recorded speech along with transcriptions.
- For accessing we use **HuggingFace**'s [`hparl`](https://huggingface.co/datasets/ddamianos/hparl)

In [39]:
# Base directory for the dataset
base = "/mnt/h/"

# HuggingFace caches
os.environ["HF_DATASETS_CACHE"] = f"{base}/datasets"
os.environ["HF_HOME"] = base
os.environ["TRANSFORMERS_CACHE"] = f"{base}/models"

# Base directory for the dataset
orig_ds = load_dataset('ddamianos/hparl',
                       cache_dir=os.environ["HF_DATASETS_CACHE"])

In [40]:
# Keep only the 'sentence' and 'audio' columns in the test split
keep_cols = ['sentence', 'audio']
cols_to_remove = [c for c in orig_ds['test'].column_names if c not in keep_cols]
if cols_to_remove:
    orig_ds['test'] = orig_ds['test'].remove_columns(cols_to_remove)

# Display first 10 rows from test dataset
orig_ds['test'].flatten().to_pandas().head(10)

Unnamed: 0,sentence,audio.array,audio.sampling_rate
0,[UNK] λυθει μεχρι το τελος του χρονου ετσι στο...,"[-0.0021362305, 0.04498291, 0.07507324, 0.1015...",16000
1,[UNK] που εγινε αναφορα για την [UNK],"[-0.07055664, -0.041168213, -0.0050354004, 0.0...",16000
2,[UNK] η τροποποιηση του αρθρου εβδομηνταδυο το...,"[-0.0076293945, 0.012207031, 0.028289795, 0.01...",16000
3,[UNK] που εχουν συναφθει πριν την εναρξη ισχυο...,"[-0.06365967, 0.052703857, 0.016601562, -0.049...",16000
4,εχουν εφαρμογη τα [UNK],"[0.01171875, 0.020080566, 0.022888184, 0.02709...",16000
5,[UNK] στον κωδικα φορολογιας εισοδηματος,"[-0.07522583, -0.086364746, -0.120910645, -0.1...",16000
6,του κωδικα φπα και τα εισιτηρια των θεατρικων ...,"[-0.019500732, -0.032409668, -0.03302002, -0.0...",16000
7,[UNK] αν δεν το καναμε τωρα θα επρεπε να παει ...,"[0.00076293945, 0.002960205, 0.0026855469, 0.0...",16000
8,οποτε θα ηταν ενα ζητημα για τους ανθρωπους πο...,"[-0.0063476562, -0.006591797, -0.004211426, -0...",16000
9,[UNK] απο την αρχη του ετους δινουμε μια δυνατ...,"[-0.029876709, 0.019592285, 0.018554688, -0.03...",16000


## Preprocessing
We must preprocess the audio and the text in order to be read for consumption by the model

In [67]:
# Initialize tokenizer and ASR pipeline
e5_tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-small')
asr_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-tiny", device=0)

Device set to use cuda:0


### Text preprocessing
1. Remove `[UNC]`
2. Normalize Text

In [43]:
def preprocess_sentence(examples):
    """
    Preprocess the sentence column (batch):
    1. Remove [UNK] tokens
    2. Remove punctuation
    3. Normalize text (lowercase, whitespace, unicode normalization)
    4. Tokenize with multilingual-e5-small tokenizer
    """
    texts = examples['sentence']
    
    # Process all texts
    processed_texts = []
    for text in texts:
        # Remove [UNK] tokens
        text = re.sub(r'\[UNK\]', '', text)
        # Remove punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))
        # Unicode normalization (NFKD)
        text = unicodedata.normalize('NFKD', text)
        # Convert to lowercase
        text = text.lower()
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        processed_texts.append(text)
    
    # Tokenize batch with multilingual-e5-small
    tokens = e5_tokenizer(processed_texts, padding='max_length', max_length=128, truncation=True, return_tensors='pt')
    
    examples['sentence'] = processed_texts
    examples['input_ids'] = tokens['input_ids']
    examples['attention_mask'] = tokens['attention_mask']
    
    return examples

In [46]:
# Test the functions on a sample
sample = orig_ds['test'][3:4].copy()
print("Text Before:", sample['sentence'])
processed_text = preprocess_sentence(sample)
print("Text After:", processed_text['sentence'])
print("Input IDs shape:", processed_text['input_ids'].shape)
print("Attention Mask shape:", processed_text['attention_mask'].shape)

Text Before: ['[UNK] που εχουν συναφθει πριν την εναρξη ισχυος του παροντος κωδικα ']
Text After: ['που εχουν συναφθει πριν την εναρξη ισχυος του παροντος κωδικα']
Input IDs shape: torch.Size([1, 128])
Attention Mask shape: torch.Size([1, 128])


### Audio Preprocessing
1. Resample to $16\text{kHz}$
2. Run `whisper-tiny` preprocessor


In [64]:
def preprocess_audio(examples):
    """
    Preprocess the audio column (batch):
    1. Transcribe audio in Greek using ASR pipeline
    2. Tokenize transcriptions with e5
    """
    audio_dicts = examples['audio']
    
    # Transcribe each audio sample
    predicted_transcriptions = []
    for audio_dict in audio_dicts:
        # Extract array and convert to numpy
        array = audio_dict['array']
        if not isinstance(array, np.ndarray):
            array = np.array(array)
        
        # Transcribe using ASR pipeline in Greek
        result = asr_pipeline(array, generate_kwargs={"language": "el"})
        predicted_transcriptions.append(result['text'])
    
    # Tokenize transcriptions with e5
    tokens = e5_tokenizer(predicted_transcriptions, 
                         padding='max_length', 
                         max_length=128, 
                         truncation=True, 
                         return_tensors='pt')
    
    examples['predicted_transcription'] = predicted_transcriptions
    examples['predicted_input_ids'] = tokens['input_ids']
    examples['predicted_attention_mask'] = tokens['attention_mask']
    
    return examples

In [71]:
# Test the combined preprocessing function on a small batch
print("Testing combined preprocessing on a batch...")
test_batch = orig_ds['test'][0:2].copy()

result = preprocess_batch(test_batch)

print("\n=== Ground Truth Transcriptions ===")
for i, sent in enumerate(result['sentence']):
    print(f"Sample {i}: {sent}")

print("\n=== Predicted Transcriptions ===")
for i, pred in enumerate(result['predicted_transcription']):
    print(f"Sample {i}: {pred}")

print("\n=== Token Shapes ===")
print(f"GT input_ids shape: {result['input_ids'].shape}")
print(f"Predicted input_ids shape: {result['predicted_input_ids'].shape}")
print(f"GT attention_mask shape: {result['attention_mask'].shape}")
print(f"Predicted attention_mask shape: {result['predicted_attention_mask'].shape}")

Testing combined preprocessing on a batch...

=== Ground Truth Transcriptions ===
Sample 0: λυθει μεχρι το τελος του χρονου ετσι στο πρωτο κομματι
Sample 1: που εγινε αναφορα για την

=== Predicted Transcriptions ===
Sample 0:  Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι, Αυτό πρώτο κομμάτι,
Sample 1:  Εγώ που έγινε για να 

### Mapping Dataset

In [72]:
def preprocess_batch(examples):
    """
    Combined preprocessing for text and audio:
    - Text: ground truth transcription, tokenized with e5
    - Audio: predicted transcription from ASR pipeline, tokenized with e5
    """
    examples = preprocess_sentence(examples)
    examples = preprocess_audio(examples)
    return examples

In [73]:
# Apply preprocessing to the test split
print("Processing test split...")
preprocessed_ds = orig_ds['test'].map(preprocess_batch,
                                       batched=True,
                                       batch_size=8)

print("Preprocessing complete!")
print(f"New dataset columns: {preprocessed_ds.column_names}")

Processing test split...


Map:   0%|          | 8/8679 [00:11<3:28:54,  1.45s/ examples]


KeyboardInterrupt: 