# Audio Exploration with Faster-Whisper

This notebook demonstrates the audio processing pipeline for kairo-core, focusing on:
1. Loading audio files
2. Transcribing with faster-whisper
3. Basic text processing and formatting

## Prerequisites
- Audio files in `data/sample/` directory
- faster-whisper installed via `uv pip install faster-whisper`

## 1. Setup and Imports

In [1]:
import json
from pathlib import Path
from typing import Dict, List, Optional
import warnings

warnings.filterwarnings("ignore")

# Audio processing
from faster_whisper import WhisperModel

# Data handling
import pandas as pd
from datetime import timedelta

# Visualization
from IPython.display import Audio, display
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

print("✅ All imports successful!")

✅ All imports successful!


## 2. Configuration

In [2]:
# Project paths
PROJECT_ROOT = Path.cwd().parent  # Assuming notebook is in notebooks/
DATA_DIR = PROJECT_ROOT / "data"
SAMPLE_DIR = DATA_DIR / "sample"
RAW_DIR = DATA_DIR / "raw"

# Create directories if they don't exist
SAMPLE_DIR.mkdir(parents=True, exist_ok=True)
RAW_DIR.mkdir(parents=True, exist_ok=True)

# Whisper configuration
WHISPER_CONFIG = {
    "model_size": "base",  # tiny, base, small, medium, large-v2, large-v3
    "device": "cpu",  # or "cuda" for GPU
    "compute_type": "int8",  # int8, float16, float32
    "num_workers": 1,
    "download_root": str(PROJECT_ROOT / "models" / "whisper"),
}

# Audio processing settings
AUDIO_CONFIG = {
    "language": "en",  # None for auto-detection
    "task": "transcribe",  # or "translate"
    "beam_size": 5,
    "best_of": 5,
    "patience": 1,
    "length_penalty": 1,
    "temperature": [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
    "compression_ratio_threshold": 2.4,
    "log_prob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "word_timestamps": True,
    "vad_filter": True,
    "vad_parameters": {
        "threshold": 0.5,
        "min_speech_duration_ms": 250,
        "max_speech_duration_s": float("inf"),
        "min_silence_duration_ms": 2000,
        "window_size_samples": 1024,
        "speech_pad_ms": 400,
    },
}

print(f"📁 Project root: {PROJECT_ROOT}")
print(f"📁 Sample directory: {SAMPLE_DIR}")
print(f"🎯 Whisper model: {WHISPER_CONFIG['model_size']}")
print(f"🖥️  Device: {WHISPER_CONFIG['device']}")

📁 Project root: /Users/eason.lim/kairo-core
📁 Sample directory: /Users/eason.lim/kairo-core/data/sample
🎯 Whisper model: base
🖥️  Device: cpu


## 3. Sample Audio File Management

In [3]:
# Function to download sample audio files
def download_sample_audio():
    """Download sample audio files for testing."""
    import urllib.request
    import urllib.error

    # Create a proper User-Agent header to avoid 406 errors
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }

    samples = [
        {
            "name": "jfk_speech.wav",
            "url": "https://github.com/mozilla/DeepSpeech/raw/master/audio/2830-3980-0043.wav",
            "description": "JFK speech sample (16kHz)",
        },
        {
            "name": "ted_talk_sample.wav",
            "url": "https://github.com/mozilla/DeepSpeech/raw/master/audio/4507-16021-0012.wav",
            "description": "TED talk sample (16kHz)",
        },
        {
            "name": "female_speaker.wav",
            "url": "https://github.com/mozilla/DeepSpeech/raw/master/audio/8455-210777-0068.wav",
            "description": "Female speaker sample (16kHz)",
        },
    ]

    for sample in samples:
        filepath = SAMPLE_DIR / sample["name"]
        if not filepath.exists():
            print(f"📥 Downloading {sample['name']}...")
            try:
                # Create request with headers
                request = urllib.request.Request(sample["url"], headers=headers)
                response = urllib.request.urlopen(request)

                # Write to file
                with open(filepath, "wb") as f:
                    f.write(response.read())

                print(f"✅ Downloaded: {sample['description']}")
            except urllib.error.HTTPError as e:
                print(f"❌ HTTP Error {e.code} for {sample['name']}: {e.reason}")
            except Exception as e:
                print(f"❌ Failed to download {sample['name']}: {e}")
        else:
            print(f"✓ Already exists: {sample['name']}")

    # Alternative: Create a simple test audio using numpy/scipy if available
    try:
        import numpy as np
        from scipy.io import wavfile

        # Generate a test tone
        test_file = SAMPLE_DIR / "test_tone.wav"
        if not test_file.exists():
            print("🎵 Generating test tone...")
            sample_rate = 16000
            duration = 3  # seconds
            frequency = 440  # A4 note

            t = np.linspace(0, duration, int(sample_rate * duration))
            audio = np.sin(2 * np.pi * frequency * t) * 0.5

            # Add some variations
            audio[int(sample_rate * 1) : int(sample_rate * 2)] *= (
                0.3  # Quieter middle section
            )

            wavfile.write(test_file, sample_rate, (audio * 32767).astype(np.int16))
            print("✅ Generated test tone")
    except ImportError:
        pass  # scipy not available, skip test tone generation


# Download samples
download_sample_audio()

# Alternative: Record your own audio in the notebook (if running locally)
print("\n💡 Tip: You can also:")
print("   1. Record audio directly in Jupyter (if running locally)")
print("   2. Upload your own audio files to data/sample/ or data/raw/")
print("   3. Use any audio file from your system")

# List available audio files
audio_extensions = [".wav", ".mp3", ".m4a", ".flac", ".ogg", ".mp4", ".mkv", ".avi"]
audio_files = []

for ext in audio_extensions:
    audio_files.extend(list(SAMPLE_DIR.glob(f"*{ext}")))
    audio_files.extend(list(RAW_DIR.glob(f"*{ext}")))

print(f"\n📂 Found {len(audio_files)} audio files:")
for i, file in enumerate(audio_files, 1):
    size_mb = file.stat().st_size / (1024 * 1024)
    print(f"{i}. {file.name} ({size_mb:.2f} MB)")

# Manual file path option
print("\n📝 Or specify a custom file path:")
print('custom_audio_path = Path("/path/to/your/audio.wav")')

📥 Downloading jfk_speech.wav...
❌ HTTP Error 404 for jfk_speech.wav: Not Found
📥 Downloading ted_talk_sample.wav...
❌ HTTP Error 404 for ted_talk_sample.wav: Not Found
📥 Downloading female_speaker.wav...
❌ HTTP Error 404 for female_speaker.wav: Not Found
🎵 Generating test tone...
✅ Generated test tone

💡 Tip: You can also:
   1. Record audio directly in Jupyter (if running locally)
   2. Upload your own audio files to data/sample/ or data/raw/
   3. Use any audio file from your system

📂 Found 1 audio files:
1. test_tone.wav (0.09 MB)

📝 Or specify a custom file path:
custom_audio_path = Path("/path/to/your/audio.wav")


## 4. Initialize Faster-Whisper Model

In [4]:
# Initialize the model
print(f"🚀 Loading Whisper {WHISPER_CONFIG['model_size']} model...")
print("This may take a few minutes on first run as the model downloads.")

try:
    model = WhisperModel(
        WHISPER_CONFIG["model_size"],
        device=WHISPER_CONFIG["device"],
        compute_type=WHISPER_CONFIG["compute_type"],
        num_workers=WHISPER_CONFIG["num_workers"],
        download_root=WHISPER_CONFIG["download_root"],
    )
    print("✅ Model loaded successfully!")
except Exception as e:
    print(f"❌ Error loading model: {e}")
    print("Try a smaller model size or check your system resources.")

🚀 Loading Whisper base model...
This may take a few minutes on first run as the model downloads.


tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

vocabulary.txt:   0%|          | 0.00/460k [00:00<?, ?B/s]

model.bin:   0%|          | 0.00/145M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.31k [00:00<?, ?B/s]

✅ Model loaded successfully!


## 5. Audio Transcription Functions

In [5]:
def format_timestamp(seconds: float) -> str:
    """Convert seconds to HH:MM:SS.mmm format."""
    td = timedelta(seconds=seconds)
    hours = int(td.total_seconds() // 3600)
    minutes = int((td.total_seconds() % 3600) // 60)
    seconds = td.total_seconds() % 60
    return f"{hours:02d}:{minutes:02d}:{seconds:06.3f}"


def transcribe_audio(audio_path: Path, model: WhisperModel, config: Dict) -> Dict:
    """
    Transcribe an audio file using faster-whisper.

    Returns:
        Dictionary containing:
        - segments: List of transcribed segments with timestamps
        - info: Audio metadata and detection info
        - full_text: Complete transcription
        - word_segments: Word-level timestamps if available
    """
    print(f"🎙️ Transcribing: {audio_path.name}")

    try:
        # Transcribe
        segments_generator, info = model.transcribe(
            str(audio_path),
            language=config.get("language"),
            task=config.get("task", "transcribe"),
            beam_size=config.get("beam_size", 5),
            best_of=config.get("best_of", 5),
            patience=config.get("patience", 1),
            length_penalty=config.get("length_penalty", 1),
            temperature=config.get("temperature", [0.0]),
            compression_ratio_threshold=config.get("compression_ratio_threshold", 2.4),
            log_prob_threshold=config.get("log_prob_threshold", -1.0),
            no_speech_threshold=config.get("no_speech_threshold", 0.6),
            word_timestamps=config.get("word_timestamps", True),
            vad_filter=config.get("vad_filter", True),
            vad_parameters=config.get("vad_parameters", {}),
        )

        # Convert generator to list
        segments = list(segments_generator)

        # Process segments
        processed_segments = []
        word_segments = []
        full_text = ""

        for segment in segments:
            # Segment-level info
            seg_dict = {
                "id": segment.id,
                "start": segment.start,
                "end": segment.end,
                "text": segment.text.strip(),
                "tokens": segment.tokens,
                "temperature": segment.temperature,
                "avg_logprob": segment.avg_logprob,
                "compression_ratio": segment.compression_ratio,
                "no_speech_prob": segment.no_speech_prob,
                "start_formatted": format_timestamp(segment.start),
                "end_formatted": format_timestamp(segment.end),
                "duration": segment.end - segment.start,
            }
            processed_segments.append(seg_dict)
            full_text += segment.text + " "

            # Word-level info if available
            if hasattr(segment, "words") and segment.words:
                for word in segment.words:
                    word_dict = {
                        "word": word.word,
                        "start": word.start,
                        "end": word.end,
                        "probability": word.probability,
                        "start_formatted": format_timestamp(word.start),
                        "end_formatted": format_timestamp(word.end),
                    }
                    word_segments.append(word_dict)

        # Audio info
        audio_info = {
            "duration": info.duration,
            "duration_formatted": format_timestamp(info.duration),
            "language": info.language,
            "language_probability": info.language_probability,
            "all_language_probs": info.all_language_probs
            if hasattr(info, "all_language_probs")
            else None,
        }

        result = {
            "file": audio_path.name,
            "segments": processed_segments,
            "word_segments": word_segments,
            "info": audio_info,
            "full_text": full_text.strip(),
            "num_segments": len(processed_segments),
            "num_words": len(word_segments),
        }

        print("✅ Transcription complete!")
        print(f"   - Duration: {audio_info['duration_formatted']}")
        print(
            f"   - Language: {audio_info['language']} (confidence: {audio_info['language_probability']:.2%})"
        )
        print(f"   - Segments: {result['num_segments']}")
        print(f"   - Words: {result['num_words']}")

        return result

    except Exception as e:
        print(f"❌ Transcription failed: {e}")
        return None

## 6. Transcribe Sample Audio

In [6]:
# Select an audio file to transcribe
if audio_files:
    # Use the first available audio file
    selected_audio = audio_files[0]
    print(f"🎵 Selected: {selected_audio.name}")

    # Display audio player
    display(Audio(str(selected_audio)))

    # Transcribe
    transcription = transcribe_audio(selected_audio, model, AUDIO_CONFIG)
else:
    print(
        "❌ No audio files found! Please add audio files to data/sample/ or data/raw/"
    )

🎵 Selected: test_tone.wav


🎙️ Transcribing: test_tone.wav
❌ Transcription failed: VadOptions.__init__() got an unexpected keyword argument 'window_size_samples'


## 7. Display Transcription Results

In [7]:
if transcription:
    # Display full text
    print("📝 Full Transcription:")
    print("=" * 80)
    print(transcription["full_text"])
    print("=" * 80)
    print(f"\n📊 Total words: {len(transcription['full_text'].split())}")

In [8]:
if transcription and transcription["segments"]:
    # Create DataFrame for segments
    segments_df = pd.DataFrame(transcription["segments"])

    # Display formatted segments
    print("🎯 Segment Details:")
    for idx, seg in segments_df.iterrows():
        print(
            f"\n[{seg['start_formatted']} → {seg['end_formatted']}] ({seg['duration']:.1f}s)"
        )
        print(f"Text: {seg['text']}")
        print(
            f"Confidence: {-seg['avg_logprob']:.3f} | No speech prob: {seg['no_speech_prob']:.3f}"
        )

## 8. Visualize Transcription Metrics

In [9]:
if transcription and transcription["segments"]:
    segments_df = pd.DataFrame(transcription["segments"])

    # Create visualizations
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))

    # 1. Segment durations
    axes[0, 0].bar(range(len(segments_df)), segments_df["duration"])
    axes[0, 0].set_xlabel("Segment Index")
    axes[0, 0].set_ylabel("Duration (seconds)")
    axes[0, 0].set_title("Segment Durations")

    # 2. Confidence scores (negative log probability)
    axes[0, 1].plot(segments_df.index, -segments_df["avg_logprob"], "o-")
    axes[0, 1].set_xlabel("Segment Index")
    axes[0, 1].set_ylabel("Confidence Score")
    axes[0, 1].set_title("Transcription Confidence by Segment")
    axes[0, 1].grid(True, alpha=0.3)

    # 3. Words per segment
    words_per_segment = segments_df["text"].apply(lambda x: len(x.split()))
    axes[1, 0].bar(range(len(segments_df)), words_per_segment)
    axes[1, 0].set_xlabel("Segment Index")
    axes[1, 0].set_ylabel("Word Count")
    axes[1, 0].set_title("Words per Segment")

    # 4. No speech probability
    axes[1, 1].scatter(
        segments_df.index,
        segments_df["no_speech_prob"],
        c=segments_df["no_speech_prob"],
        cmap="RdYlGn_r",
    )
    axes[1, 1].set_xlabel("Segment Index")
    axes[1, 1].set_ylabel("No Speech Probability")
    axes[1, 1].set_title("No Speech Probability by Segment")
    axes[1, 1].axhline(y=0.5, color="r", linestyle="--", alpha=0.5, label="Threshold")
    axes[1, 1].legend()

    plt.tight_layout()
    plt.show()

    # Summary statistics
    print("\n📊 Summary Statistics:")
    print(f"Average segment duration: {segments_df['duration'].mean():.2f}s")
    print(f"Average confidence: {-segments_df['avg_logprob'].mean():.3f}")
    print(f"Average words per segment: {words_per_segment.mean():.1f}")
    print(f"Total speaking time: {segments_df['duration'].sum():.1f}s")

## 9. Export Results

In [10]:
def export_transcription(
    transcription: Dict, output_dir: Path = None
) -> Dict[str, Path]:
    """
    Export transcription results in multiple formats.
    """
    if output_dir is None:
        output_dir = DATA_DIR / "processed" / transcription["file"].replace(".", "_")

    output_dir.mkdir(parents=True, exist_ok=True)

    exports = {}

    # 1. Plain text transcript
    txt_path = output_dir / "transcript.txt"
    with open(txt_path, "w", encoding="utf-8") as f:
        f.write(transcription["full_text"])
    exports["text"] = txt_path

    # 2. Timestamped transcript
    srt_path = output_dir / "transcript_timestamped.txt"
    with open(srt_path, "w", encoding="utf-8") as f:
        for seg in transcription["segments"]:
            f.write(f"[{seg['start_formatted']} - {seg['end_formatted']}]\n")
            f.write(f"{seg['text']}\n\n")
    exports["timestamped"] = srt_path

    # 3. JSON with all metadata
    json_path = output_dir / "transcript_full.json"
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(transcription, f, indent=2, ensure_ascii=False)
    exports["json"] = json_path

    # 4. CSV for segments
    if transcription["segments"]:
        csv_path = output_dir / "segments.csv"
        segments_df = pd.DataFrame(transcription["segments"])
        segments_df.to_csv(csv_path, index=False)
        exports["segments_csv"] = csv_path

    # 5. Word-level CSV if available
    if transcription["word_segments"]:
        words_csv_path = output_dir / "words.csv"
        words_df = pd.DataFrame(transcription["word_segments"])
        words_df.to_csv(words_csv_path, index=False)
        exports["words_csv"] = words_csv_path

    print(f"\n✅ Exported transcription to: {output_dir}")
    for format_name, path in exports.items():
        print(f"   - {format_name}: {path.name}")

    return exports


# Export the transcription
if transcription:
    exported_files = export_transcription(transcription)

## 10. Batch Processing Function

In [11]:
def batch_transcribe(
    audio_files: List[Path], model: WhisperModel, config: Dict
) -> pd.DataFrame:
    """
    Transcribe multiple audio files and return summary DataFrame.
    """
    results = []

    for audio_file in audio_files:
        print(f"\n{'=' * 60}")
        transcription = transcribe_audio(audio_file, model, config)

        if transcription:
            # Export results
            export_transcription(transcription)

            # Collect summary
            summary = {
                "file": audio_file.name,
                "duration": transcription["info"]["duration"],
                "language": transcription["info"]["language"],
                "confidence": transcription["info"]["language_probability"],
                "segments": transcription["num_segments"],
                "words": transcription["num_words"],
                "word_count": len(transcription["full_text"].split()),
                "chars": len(transcription["full_text"]),
                "status": "success",
            }
        else:
            summary = {"file": audio_file.name, "status": "failed"}

        results.append(summary)

    return pd.DataFrame(results)


# Example: Process all audio files
# Uncomment to run batch processing
# if len(audio_files) > 1:
#     print(f"\n🚀 Starting batch transcription of {len(audio_files)} files...")
#     batch_results = batch_transcribe(audio_files, model, AUDIO_CONFIG)
#     print("\n📊 Batch Processing Summary:")
#     display(batch_results)

## 11. Advanced: Custom Audio Processing Pipeline

In [12]:
class AudioTranscriptionPipeline:
    """
    Complete audio transcription pipeline with preprocessing and postprocessing.
    """

    def __init__(self, model_config: Dict, audio_config: Dict):
        self.model_config = model_config
        self.audio_config = audio_config
        self.model = None
        self._load_model()

    def _load_model(self):
        """Load Whisper model."""
        self.model = WhisperModel(
            self.model_config["model_size"],
            device=self.model_config["device"],
            compute_type=self.model_config["compute_type"],
            num_workers=self.model_config["num_workers"],
            download_root=self.model_config["download_root"],
        )

    def preprocess_audio(self, audio_path: Path) -> Optional[Path]:
        """
        Preprocess audio if needed (normalize, convert format, etc.)
        For now, just returns the original path.
        """
        # TODO: Add audio preprocessing (normalization, format conversion)
        return audio_path

    def postprocess_text(self, text: str) -> str:
        """
        Clean and format transcribed text.
        """
        # Remove extra whitespace
        text = " ".join(text.split())

        # Fix common transcription errors
        replacements = {
            " i ": " I ",
            " i'm ": " I'm ",
            " i'll ": " I'll ",
            " i'd ": " I'd ",
            " i've ": " I've ",
        }

        for old, new in replacements.items():
            text = text.replace(old, new)

        # Ensure proper sentence capitalization
        sentences = text.split(". ")
        sentences = [s.strip().capitalize() for s in sentences if s.strip()]
        text = ". ".join(sentences)

        # Add final period if missing
        if text and not text.endswith((".", "!", "?")):
            text += "."

        return text

    def extract_keywords(self, text: str, top_n: int = 10) -> List[str]:
        """
        Extract keywords from transcribed text.
        Simple frequency-based approach for now.
        """
        # Common stop words
        stop_words = {
            "the",
            "a",
            "an",
            "and",
            "or",
            "but",
            "in",
            "on",
            "at",
            "to",
            "for",
            "of",
            "with",
            "by",
            "from",
            "as",
            "is",
            "was",
            "are",
            "were",
            "been",
            "be",
            "have",
            "has",
            "had",
            "do",
            "does",
            "did",
            "will",
            "would",
            "could",
            "should",
            "may",
            "might",
            "must",
            "shall",
            "can",
            "it",
            "that",
            "this",
            "these",
            "those",
            "i",
            "you",
            "he",
            "she",
            "we",
            "they",
        }

        # Tokenize and count
        words = text.lower().split()
        words = [w.strip('.,!?;:"') for w in words]
        words = [w for w in words if w and w not in stop_words and len(w) > 2]

        # Count frequencies
        from collections import Counter

        word_freq = Counter(words)

        # Return top keywords
        return [word for word, _ in word_freq.most_common(top_n)]

    def process(self, audio_path: Path) -> Dict:
        """
        Complete processing pipeline.
        """
        # Preprocess
        processed_path = self.preprocess_audio(audio_path)

        # Transcribe
        result = transcribe_audio(processed_path, self.model, self.audio_config)

        if result:
            # Postprocess text
            result["processed_text"] = self.postprocess_text(result["full_text"])

            # Extract keywords
            result["keywords"] = self.extract_keywords(result["processed_text"])

            # Add summary statistics
            result["statistics"] = {
                "words_per_minute": len(result["full_text"].split())
                / (result["info"]["duration"] / 60),
                "average_segment_duration": sum(
                    s["duration"] for s in result["segments"]
                )
                / len(result["segments"]),
                "silence_ratio": 1
                - (
                    sum(s["duration"] for s in result["segments"])
                    / result["info"]["duration"]
                ),
            }

        return result


# Initialize pipeline
pipeline = AudioTranscriptionPipeline(WHISPER_CONFIG, AUDIO_CONFIG)

# Process with enhanced pipeline
if audio_files:
    enhanced_result = pipeline.process(audio_files[0])

    if enhanced_result:
        print("\n🎯 Enhanced Processing Results:")
        print(f"\n📝 Processed Text:\n{enhanced_result['processed_text'][:500]}...")
        print(f"\n🔑 Keywords: {', '.join(enhanced_result['keywords'])}")
        print("\n📊 Statistics:")
        for key, value in enhanced_result["statistics"].items():
            print(f"   - {key}: {value:.2f}")

🎙️ Transcribing: test_tone.wav
❌ Transcription failed: VadOptions.__init__() got an unexpected keyword argument 'window_size_samples'


## 12. Next Steps

Now that we have a working audio transcription pipeline, the next steps are:

1. **Speaker Diarization**: Add `pyannote.audio` to identify different speakers
2. **Named Entity Recognition**: Use the transcribed text with `span-marker`
3. **Topic Modeling**: Apply `bertopic` to discover themes
4. **Summarization**: Use `mlx-lm` or transformers for text summarization

### Key Takeaways:
- ✅ Faster-whisper provides accurate transcriptions with timestamps
- ✅ Word-level timestamps enable precise alignment
- ✅ VAD filtering helps with silence detection
- ✅ Confidence scores help identify uncertain segments
- ✅ Export formats support downstream processing

### Performance Tips:
- Use larger models for better accuracy (if resources allow)
- Enable GPU acceleration for faster processing
- Adjust VAD parameters for your audio characteristics
- Consider batch processing for multiple files