## **ASR transcription -INFERENCE PIPELINE**

This notebook cell prepares audio from the `sartifyllc/Sartify_ITU_Zindi_Testdataset` (test split) and runs an Automatic Speech Recognition (ASR) pipeline using the Hugging Face model `EYEDOL/SALAMA_NEWMEDT`.

## **What it does**
- Loads the dataset (test split) and casts the `audio` column to raw bytes at 16 kHz.  
- Performs a deterministic stereo → mono mixdown and normalizes audio to a fixed loudness.  
- Writes WAV files into `waves/` using a consistent on-disk format (PCM_16).  
- Initializes a Hugging Face `transformers.pipeline` for `"automatic-speech-recognition"`.  
- Iterates the WAV files in dataset order and transcribes each file, collecting results in a list.

## Outputs
- `waves/` — deterministic WAV files ready for transcription.  
- `results` — list of transcription records (`{"filename": ..., "text": ...}`) ready to convert to a DataFrame and save as CSV.

## Caveats
- These changes reduce nondeterminism but do **not** guarantee bit-for-bit identical outputs across different hardware, CUDA/cuDNN/driver versions, or when using different GPUs (especially with FP16).  

## Quick run instructions
1. (Optional) Ensure required packages are installed.  
2. Run ALL cells (it will prepare WAVs in `waves/` and produce `results`).  
3. Convert `results` to a DataFrame and save to CSV if you need a submission file.

---


In [1]:
## LIBRARY INSTALLATION
!pip install -q transformers accelerate noisereduce
!pip install -q pydub datasets soundfile pandas torch torchaudio

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m95.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m72.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━

## REQUIREMENTS DISPLAY

In [2]:
import transformers, accelerate, datasets, soundfile, pandas, torch, torchaudio, huggingface_hub
import importlib.metadata
import pandas as pd
import re
import soundfile as sf
import numpy as np
from time import perf_counter
from transformers import pipeline

print("transformers:", transformers.__version__)
print("accelerate:", accelerate.__version__)
print("datasets:", datasets.__version__)
print("soundfile:", sf.__version__)
print("pandas:", pd.__version__)
print("numpy:", np.__version__)
print("torch:", torch.__version__)
print("torchaudio:", torchaudio.__version__)
print("huggingface_hub:", huggingface_hub.__version__)

print("pydub:", importlib.metadata.version("pydub"))
print("noisereduce:", importlib.metadata.version("noisereduce"))


2025-09-23 10:02:06.727577: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1758621726.930119      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1758621726.989852      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


transformers: 4.52.4
accelerate: 1.8.1
datasets: 3.6.0
soundfile: 0.13.1
pandas: 2.2.3
numpy: 1.26.4
torch: 2.6.0+cu124
torchaudio: 2.6.0+cu124
huggingface_hub: 0.33.1
pydub: 0.25.1
noisereduce: 3.0.3


In [3]:
%%capture captured_output 
%%time
import io
import os
import numpy as np
import pandas as pd
import soundfile as sf
import torch
import tqdm
import noisereduce as nr
from datasets import load_dataset, Audio
from transformers import pipeline

HF_MODEL_ID = "EYEDOL/SALAMA_NEWMEDT"

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
    
def normalize_audio(audio, target_lufs=-23.0):
    """
    Normalize an audio signal to a target loudness level (approximate).

    This function scales the provided audio array so its RMS matches the
    linear amplitude corresponding to `target_lufs` dB. The output is
    clipped to the [-1.0, 1.0] range.

    Parameters
    ----------
    audio : numpy.ndarray
        1-D (mono) or 2-D (multi-channel) numpy array representing audio samples.
        The function assumes floating-point samples (e.g. float32) in the range
        [-1.0, 1.0] or another consistent amplitude scale.
    target_lufs : float, optional
        Target loudness in decibels (dB). The parameter name uses "lufs" for
        familiarity, but the implementation actually converts dB to a linear
        RMS target via `10**(target_lufs/20)`. Default is -23.0.

    Returns
    -------
    numpy.ndarray
        The scaled and clipped audio array (same shape as input), clipped to
        [-1.0, 1.0].

    Notes
    -----
    - This is a simple RMS-based scaling, not a perceptual LUFS/EBU R128
      compliant loudness normalizer. It approximates loudness by scaling RMS.
    - If `audio` is multi-channel, consider manually mixing channels to mono
      before calling this function if you want to preserve a specific channel
      strategy; this implementation uses the array as-is.
    - For silent input (RMS == 0) the function returns the input unchanged.
    - The function does not change the dtype of the input array.

    Example
    -------
    >>> import numpy as np
    >>> x = np.array([0.1, -0.1, 0.2, -0.2], dtype=np.float32)
    >>> y = normalize_audio(x, target_lufs=-20.0)
    >>> y.shape
    (4,)
    """
    rms = np.sqrt(np.mean(audio**2))
    if rms > 0:
        target_rms = 10**(target_lufs/20)
        audio = audio * (target_rms / rms)
    return np.clip(audio, -1.0, 1.0)


print(f"Using device: {device}")
print(f"Loading Whisper model from Hugging Face: {HF_MODEL_ID}")

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=HF_MODEL_ID,
    torch_dtype=torch_dtype,
    device=device,
)
print("Whisper model pipeline loaded successfully.")

print("Loading dataset...")
ds = load_dataset('sartifyllc/Sartify_ITU_Zindi_Testdataset', split="test")
ds = ds.cast_column("audio", Audio(sampling_rate=16000, decode=False))
print("Dataset loaded.")
display(ds)

os.makedirs("waves", exist_ok=True)
audio_paths = []

print("Preparing audio files...")
for item in tqdm.tqdm(ds):
    audio_bytes = item['audio']['bytes']
    record_id = item['record_id']

    out_path = f"waves/{record_id}.wav"
    audio_array, sr = sf.read(io.BytesIO(audio_bytes))

    audio_array = normalize_audio(audio_array)

    sf.write(out_path, audio_array, sr)
    
    audio_paths.append({
        "path": out_path,
        "filename": item['filename']
    })

print(f"{len(audio_paths)} audio files prepared for transcription.")
print("Starting transcription...")
results = []

for audio_item in tqdm.tqdm(audio_paths):
    try:
        transcription = asr_pipeline(audio_item['path'])

        results.append({
            "filename": audio_item['filename'],
            "text": transcription['text']
        })
    except Exception as e:
        print(f"Error transcribing file {audio_item['filename']}: {e}")
        results.append({
            "filename": audio_item['filename'],
            "text": f"ERROR: {e}"
        })


print("Transcription complete.")

Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [4]:
print("Creating submission file...")
submission_df = pd.DataFrame(results)

submission_df['text'] = submission_df['text'].replace('', ' ', regex=False)

output_filename = f'huggingface_whisper_{HF_MODEL_ID.split("/")[-1]}_submission.csv'
submission_df.to_csv(output_filename, index=False)

print(f"Submission file '{output_filename}' created successfully.")
display(submission_df.head())

Creating submission file...
Submission file 'huggingface_whisper_SALAMA_NEWMEDT_submission.csv' created successfully.


Unnamed: 0,filename,text
0,451f6d89-9b85-46c3-ad8d-bfcb1c9a4e8f.wav,upana baina ya mita sitini na nne na sabini na...
1,507e10f8-0b2b-4bc0-9b69-94f96d907fb6.wav,mwanadamu Vibila ya sehemu hivi Sabato ilifany...
2,576fe4af-849e-4354-9320-368678330425.wav,"""Mke wangu ambadilika sana, tabia zimebadilika..."
3,afc27663-229a-42b8-b47d-72ac6dfda574.wav,Kuna wataalamu wengi wa sayansi hivyo bado tun...
4,fbb7dd49-c974-4c57-ae67-1270b20859e2.wav,Zamani ziliitwa Mataifa ya kati kwa upande mmoja


## **POST PROCESSING AND CLEANUP FOR FINAL SUBMISSION**

In [5]:
%%time


def preprocess_swahili_text(text):
    """
    Cleans and preprocesses Swahili text from ASR.

    This function performs the following steps:
    1. Removes leading/trailing whitespace and quotation marks.
    2. Converts the entire text to lowercase.
    3. Removes all punctuation.
    4. Removes common Swahili filler words (e.g., 'eee', 'aaa').
    5. Uses regular expressions to find and remove sequences of words
       or phrases that repeat more than twice consecutively.
    6. If the text is empty after cleaning, it replaces it with the letter 'n'.

    Args:
        text (str): The raw text string to be cleaned.

    Returns:
        str: The cleaned and preprocessed text string.
    """
    # Ensure input is a string
    cleaned_text = str(text).strip().strip('"')

    # Step 2: Convert to lowercase
    # Hapa tunabadilisha herufi zote kuwa ndogo
    cleaned_text = cleaned_text.lower()

    # Step 3: Remove punctuation
    # Hapa tunaondoa alama zote za uakifishaji (punctuation)
    # This regex removes anything that is not a word character or whitespace
    cleaned_text = re.sub(r'[^\w\s]', '', cleaned_text)

    # Step 4: Remove filler words
    # Hapa tunaondoa maneno ya kujazia kama 'eee', 'aaa', n.k.
    filler_words = ['eee', 'aaa', 'mmh', 'mh', 'uhm', 'basi']
    # Create a regex pattern to match whole words only
    word_boundary_fillers = [r'\b' + re.escape(word) + r'\b' for word in filler_words]
    filler_pattern = '|'.join(word_boundary_fillers)
    cleaned_text = re.sub(filler_pattern, '', cleaned_text)

    # Step 5: Remove consecutively repeated words or phrases
    # Hapa tunatumia "regular expressions" kuondoa maneno yanayojirudia
    # This regex looks for a sequence of words (\b.+?\b) followed by
    # one or more repetitions of that same sequence (\s+\1)+.
    for _ in range(3): # Run a few times to handle complex nested repeats
        cleaned_text = re.sub(r'(\b.+?\b)(\s+\1)+', r'\1', cleaned_text, flags=re.IGNORECASE)

    # Clean up extra whitespace that may result from removals
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

    # Step 6: If the text is empty after all cleaning, replace it with 'n'
    # Kama baada ya kusafisha hakuna text, tunaweka herufi 'n'
    if not cleaned_text:
        return 'n'

    return cleaned_text

def main():
    """
    Main function to run the data cleaning process.
    """
    # --- Instructions ---
    # 1. Make sure your CSV file is in the same directory as this script,
    #    or provide the full path to the file.
    # 2. Change 'your_input_file.csv' to the name of your file.
    # 3. Change 'cleaned_output_file.csv' to your desired output file name.
    # 4. Ensure your columns are named 'filename' and 'text'. If not,
    #    change 'text' in the line: df['text'] = df['text'].apply(...)

    input_filename = 'huggingface_whisper_SALAMA_NEWMEDT_submission.csv'
    output_filename = 'FINAL_SUBMISSION.csv'

    try:
        # Read the CSV file into a pandas DataFrame
        # Tunasoma faili la CSV
        df = pd.read_csv(input_filename)

        # Ensure the 'text' column is of string type to prevent errors with non-string data
        df['text'] = df['text'].astype(str)

        print("Original Data Sample:")
        # Keep a copy of the original text for comparison before overwriting
        original_text_sample = df['text'].head().copy()
        print(df.head())
        print("\n" + "="*30 + "\n")

        # Apply the preprocessing function to the 'text' column, overwriting it
        # Tunatumia "function" yetu kusafisha data na kuiweka tena kwenye safu ya 'text'
        df['text'] = df['text'].apply(preprocess_swahili_text)

        print("Cleaned Data Sample (Original vs. Cleaned):")
        # Display original and cleaned text for comparison
        comparison_df = pd.DataFrame({
            'original_text': original_text_sample,
            'cleaned_text': df['text'].head()
        })
        print(comparison_df)
        print("\n" + "="*30 + "\n")

        # Save the cleaned data (with the updated 'text' column) to a new CSV file
        # Tunahifadhi matokeo kwenye faili jipya la CSV
        df.to_csv(output_filename, index=False, encoding='utf-8')

        print(f"Successfully processed the file.")
        print(f"Cleaned data has been saved to '{output_filename}'")

    except FileNotFoundError:
        print(f"Error: The file '{input_filename}' was not found.")
        print("Please make sure the file name is correct and it's in the right directory.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

if __name__ == "__main__":
    main()


Original Data Sample:
                                   filename  \
0  451f6d89-9b85-46c3-ad8d-bfcb1c9a4e8f.wav   
1  507e10f8-0b2b-4bc0-9b69-94f96d907fb6.wav   
2  576fe4af-849e-4354-9320-368678330425.wav   
3  afc27663-229a-42b8-b47d-72ac6dfda574.wav   
4  fbb7dd49-c974-4c57-ae67-1270b20859e2.wav   

                                                text  
0  upana baina ya mita sitini na nne na sabini na...  
1  mwanadamu Vibila ya sehemu hivi Sabato ilifany...  
2  "Mke wangu ambadilika sana, tabia zimebadilika...  
3  Kuna wataalamu wengi wa sayansi hivyo bado tun...  
4   Zamani ziliitwa Mataifa ya kati kwa upande mmoja  


Cleaned Data Sample (Original vs. Cleaned):
                                       original_text  \
0  upana baina ya mita sitini na nne na sabini na...   
1  mwanadamu Vibila ya sehemu hivi Sabato ilifany...   
2  "Mke wangu ambadilika sana, tabia zimebadilika...   
3  Kuna wataalamu wengi wa sayansi hivyo bado tun...   
4   Zamani ziliitwa Mataifa ya kati kwa

### CHECK OUT MODEL COMPUTE MEETUP WITH ONE AUDIO

In [6]:
# --------- USER SETTINGS (edit these) ----------
WAV_PATH = "/kaggle/working/waves/022a2423-9dbd-4385-9e42-0ba030c0e7da.wav"          
USE_CUDA_IF_AVAILABLE = True
# -----------------------------------------------

def to_mono_if_needed(arr):
    arr = np.asarray(arr)
    if arr.ndim == 1:
        return arr
    return arr.mean(axis=1)

# choose device for transformers pipeline
use_cuda = USE_CUDA_IF_AVAILABLE and torch.cuda.is_available()
device_arg = 0 if use_cuda else -1
torch_dtype = torch.float16 if use_cuda else torch.float32

print(f"Loading ASR pipeline '{HF_MODEL_ID}' on {'cuda' if use_cuda else 'cpu'}...")
asr = pipeline(
    "automatic-speech-recognition",
    model=HF_MODEL_ID,
    torch_dtype=torch_dtype,
    device=device_arg
)
print("Pipeline ready.\n")

# load wav and convert to mono if needed
audio, sr = sf.read(WAV_PATH)
audio = to_mono_if_needed(audio)
clip_length_s = len(audio) / sr
print(f"Loaded WAV: {WAV_PATH}")
print(f"  sampling_rate: {sr}, clip_length: {clip_length_s:.3f} s\n")

# warm-up (not timed) to avoid first-run overhead
try:
    _ = asr(audio, sampling_rate=sr)
except TypeError:
    # fallback: pipeline might require a path
    _ = asr(WAV_PATH)

# timed transcription (ensure GPU ops finish before/after)
if use_cuda:
    torch.cuda.synchronize()
t0 = perf_counter()
try:
    out = asr(audio, sampling_rate=sr)   # prefer passing array (avoids disk I/O)
except TypeError:
    out = asr(WAV_PATH)                   # fallback to path
if use_cuda:
    torch.cuda.synchronize()
t1 = perf_counter()

decode_time_s = t1 - t0
rtfx = clip_length_s / decode_time_s if decode_time_s > 0 else float("inf")

# extract text safely
if isinstance(out, dict):
    text = out.get("text", "")
elif isinstance(out, str):
    text = out
else:
    text = str(out)

# print results
print("=== TRANSCRIPTION RESULT ===")
print(f"Clip length (s):   {clip_length_s:.3f}")
print(f"Decode time (s):   {decode_time_s:.3f}")
print(f"RTFx (clip_len / decode_time): {rtfx:.3f}\n")
print("Transcript:\n")
print(text or "<empty>")


Loading ASR pipeline 'EYEDOL/SALAMA_NEWMEDT' on cuda...


Device set to use cuda:0


Pipeline ready.

Loaded WAV: /kaggle/working/waves/022a2423-9dbd-4385-9e42-0ba030c0e7da.wav
  sampling_rate: 16000, clip_length: 6.840 s





=== TRANSCRIPTION RESULT ===
Clip length (s):   6.840
Decode time (s):   2.533
RTFx (clip_len / decode_time): 2.700

Transcript:

Lakini athari za ghasia za magenge zimekuwa mbaya sana kwangu mimi
