# Azure Speech Service Batch Transcription

This notebook tests Azure's Batch Transcription REST API for Kannada speech-to-text.

## Key Features Being Tested

1. **Kannada (kn-IN) support** - Basic transcription quality
2. **Diarization** - Speaker separation (tutor vs. child)
3. **Word-level timestamps** - For alignment with Whisper baseline
4. **Cost tracking** - Compare with OpenAI Whisper pricing

## Test Strategy

- **Test 1**: Single short file WITHOUT diarization (baseline)
- **Test 2**: Same file WITH diarization (check Kannada support)
- **Test 3**: Process all 40 files (if diarization works)

## Expected Outcome

- Transcription results in `files/transcriptions/azure_batch/`
- JSON format matching Whisper output structure
- Cost comparison: Azure (~$9 for 9 hours) vs. Whisper ($2.67)
- **Bonus**: Speaker labels if diarization works for Kannada

In [None]:
import sys
import os
import json
from pathlib import Path
from datetime import datetime

# Add project root to path
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root / 'src'))

from voice_eval.azure_batch_api import AzureBatchTranscription, transcribe_audio
from dotenv import load_dotenv

# Load environment variables
load_dotenv(project_root / '.env')

print("✓ Imports successful")
print(f"✓ Azure Region: {os.getenv('AZURE_REGION')}")
print(f"✓ Storage Account: {os.getenv('AZURE_STORAGE_BUCKET_NAME')}")
print(f"✓ Container: {os.getenv('AZURE_STORAGE_CONTAINER_NAME')}")

## Setup Output Directory

In [None]:
# Create output directory for transcription results
output_dir = project_root / 'files' / 'transcriptions' / 'azure_batch'
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Output directory: {output_dir}")

## Test 1: Single File WITHOUT Diarization (Baseline)

Test basic Kannada transcription without speaker separation.

**Test file**: `+919742536994_3_converted.mp3` (10 seconds, shortest file)

In [None]:
# Initialize client
client = AzureBatchTranscription()

# Test with shortest file (10 seconds)
test_file = "+919742536994_3_converted.mp3"

print("="*80)
print("TEST 1: Basic Transcription (NO Diarization)")
print("="*80)

try:
    result_no_diarization = client.transcribe_file(
        blob_name=test_file,
        locale="kn-IN",
        enable_diarization=False,  # Test without diarization first
        poll_interval=3,
        max_wait_time=300
    )
    
    print("\n" + "="*80)
    print("RESULT")
    print("="*80)
    print(f"Duration: {result_no_diarization['duration']:.1f} seconds")
    print(f"Cost: ${result_no_diarization['cost']:.4f}")
    print(f"Segments: {len(result_no_diarization['segments'])}")
    print(f"\nTranscription text:\n{result_no_diarization['text'][:500]}")
    
    # Save result
    output_file = output_dir / f"{Path(test_file).stem}_no_diarization.json"
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(result_no_diarization, f, indent=2, ensure_ascii=False)
    print(f"\n✓ Saved to: {output_file}")
    
except Exception as e:
    print(f"\n❌ Error: {e}")
    import traceback
    traceback.print_exc()

## Test 2: Single File WITH Diarization

**Critical Test**: Does Azure support diarization for Kannada (kn-IN)?

This is undocumented in Azure's official docs—we're testing it empirically.

In [None]:
print("="*80)
print("TEST 2: Transcription WITH Diarization (Speaker Separation)")
print("="*80)
print("Testing if Azure supports diarization for Kannada (kn-IN)...\n")

try:
    result_with_diarization = client.transcribe_file(
        blob_name=test_file,
        locale="kn-IN",
        enable_diarization=True,  # Enable speaker separation
        min_speakers=2,  # Tutor + child
        max_speakers=2,
        poll_interval=3,
        max_wait_time=300
    )
    
    print("\n" + "="*80)
    print("RESULT")
    print("="*80)
    print(f"Duration: {result_with_diarization['duration']:.1f} seconds")
    print(f"Cost: ${result_with_diarization['cost']:.4f}")
    print(f"Segments: {len(result_with_diarization['segments'])}")
    
    # Check if speaker info is present
    has_speakers = any('speaker' in seg for seg in result_with_diarization['segments'])
    if has_speakers:
        print("\n🎉 SUCCESS: Diarization works for Kannada!")
        print("\nSpeaker breakdown:")
        speakers = {}
        for seg in result_with_diarization['segments']:
            speaker_id = seg.get('speaker', 'Unknown')
            speakers[speaker_id] = speakers.get(speaker_id, 0) + 1
        for speaker, count in speakers.items():
            print(f"  Speaker {speaker}: {count} segments")
    else:
        print("\n⚠️  WARNING: No speaker information in results")
        print("Diarization may not be supported for Kannada (kn-IN)")
    
    print(f"\nTranscription text:\n{result_with_diarization['text'][:500]}")
    
    # Save result
    output_file = output_dir / f"{Path(test_file).stem}_with_diarization.json"
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(result_with_diarization, f, indent=2, ensure_ascii=False)
    print(f"\n✓ Saved to: {output_file}")
    
except Exception as e:
    print(f"\n❌ Error: {e}")
    print("\nPossible reasons:")
    print("1. Diarization not supported for Kannada (kn-IN)")
    print("2. API version or region issue")
    print("3. Incorrect request parameters")
    import traceback
    traceback.print_exc()

## Compare Results: No Diarization vs. With Diarization

Compare the two transcriptions to see if they differ.

In [None]:
print("="*80)
print("COMPARISON")
print("="*80)

if 'result_no_diarization' in locals() and 'result_with_diarization' in locals():
    print(f"\nText identical: {result_no_diarization['text'] == result_with_diarization['text']}")
    print(f"Segments (no diarization): {len(result_no_diarization['segments'])}")
    print(f"Segments (with diarization): {len(result_with_diarization['segments'])}")
    
    # Check if diarization added speaker info
    has_speaker_info = any('speaker' in seg for seg in result_with_diarization['segments'])
    print(f"Speaker information present: {has_speaker_info}")
    
    if has_speaker_info:
        print("\n✓ Diarization is supported for Kannada!")
        print("Proceeding with batch processing using diarization.")
    else:
        print("\n⚠️  Diarization may not be supported for Kannada.")
        print("Will proceed with batch processing WITHOUT diarization.")
else:
    print("\n⚠️  Could not compare—one or both tests failed.")

## Test 3: Batch Processing (All 40 Files)

Based on the diarization test results, process all files.

**Note**: This will take 20-40 minutes depending on Azure's processing speed.

In [None]:
# List all audio files in blob storage
from azure.storage.blob import BlobServiceClient

# Get blob list
connection_string = os.getenv('AZURE_STORAGE_CONN_STR')
container_name = os.getenv('AZURE_STORAGE_CONTAINER_NAME')

blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container_name)

# Get all blob names
all_blobs = [blob.name for blob in container_client.list_blobs() if blob.name.endswith('.mp3')]

print(f"Found {len(all_blobs)} MP3 files in blob storage:")
for blob in sorted(all_blobs)[:5]:
    print(f"  - {blob}")
print(f"  ... and {len(all_blobs) - 5} more")

In [None]:
# Determine whether to use diarization based on Test 2
use_diarization = False
if 'result_with_diarization' in locals():
    use_diarization = any('speaker' in seg for seg in result_with_diarization['segments'])

print(f"\nBatch processing with diarization: {use_diarization}")
print(f"Files to process: {len(all_blobs)}")
print("\nThis will take approximately 20-40 minutes...\n")

# Confirm before proceeding
proceed = input("Proceed with batch processing? (yes/no): ")
if proceed.lower() != 'yes':
    print("Batch processing cancelled.")

In [None]:
# Batch process all files
if proceed.lower() == 'yes':
    results = []
    failed_files = []
    
    print("="*80)
    print(f"BATCH PROCESSING: {len(all_blobs)} files")
    print("="*80)
    
    for idx, blob_name in enumerate(all_blobs, 1):
        print(f"\n[{idx}/{len(all_blobs)}] {blob_name}")
        
        # Check if already processed
        output_file = output_dir / f"{Path(blob_name).stem}_azure.json"
        if output_file.exists():
            print(f"  ⊙ Already processed, skipping")
            continue
        
        try:
            result = client.transcribe_file(
                blob_name=blob_name,
                locale="kn-IN",
                enable_diarization=use_diarization,
                min_speakers=2,
                max_speakers=2,
                poll_interval=5,
                max_wait_time=600
            )
            
            results.append(result)
            
            # Save individual result
            with open(output_file, 'w', encoding='utf-8') as f:
                json.dump(result, f, indent=2, ensure_ascii=False)
            
        except Exception as e:
            print(f"  ❌ Failed: {e}")
            failed_files.append({'file': blob_name, 'error': str(e)})
    
    # Summary
    print("\n" + "="*80)
    print("BATCH PROCESSING COMPLETE")
    print("="*80)
    print(f"Successfully processed: {len(results)} files")
    print(f"Failed: {len(failed_files)} files")
    
    if results:
        total_duration = sum(r['duration'] for r in results)
        total_cost = sum(r['cost'] for r in results)
        print(f"\nTotal audio processed: {total_duration / 3600:.2f} hours")
        print(f"Total cost: ${total_cost:.2f}")
    
    if failed_files:
        print("\nFailed files:")
        for fail in failed_files:
            print(f"  - {fail['file']}: {fail['error']}")
    
    # Save batch summary
    summary_file = output_dir / f"batch_summary_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    with open(summary_file, 'w', encoding='utf-8') as f:
        json.dump({
            'processed': len(results),
            'failed': len(failed_files),
            'total_duration_hours': total_duration / 3600 if results else 0,
            'total_cost': total_cost if results else 0,
            'diarization_enabled': use_diarization,
            'failed_files': failed_files
        }, f, indent=2)
    print(f"\n✓ Summary saved to: {summary_file}")

## Compare with Whisper Baseline

Load a Whisper transcription and compare with Azure's output.

In [None]:
# Load corresponding Whisper result
whisper_dir = project_root / 'files' / 'transcriptions' / 'batch_whisper_gpt4o'
whisper_file = whisper_dir / f"{Path(test_file).stem}.json"

if whisper_file.exists() and 'result_no_diarization' in locals():
    with open(whisper_file, 'r', encoding='utf-8') as f:
        whisper_result = json.load(f)
    
    print("="*80)
    print("AZURE vs. WHISPER COMPARISON")
    print("="*80)
    print(f"\nFile: {test_file}")
    print(f"\nDuration:")
    print(f"  Whisper: {whisper_result.get('duration', 0):.1f}s")
    print(f"  Azure:   {result_no_diarization['duration']:.1f}s")
    
    print(f"\nSegments:")
    print(f"  Whisper: {len(whisper_result.get('segments', []))}")
    print(f"  Azure:   {len(result_no_diarization['segments'])}")
    
    print(f"\nCost:")
    print(f"  Whisper: ${whisper_result.get('whisper_cost', 0):.4f}")
    print(f"  Azure:   ${result_no_diarization['cost']:.4f}")
    
    print(f"\nWhisper transcription (Kannada):")
    print(f"  {whisper_result.get('kannada_full_text', '')[:200]}...")
    
    print(f"\nAzure transcription (Kannada):")
    print(f"  {result_no_diarization['text'][:200]}...")
    
    print("\n⚠️  Note: Visual comparison only. Use WER/CER metrics for quantitative analysis.")
else:
    print("\n⚠️  Whisper baseline not found. Run batch_whisper_gpt4o notebook first.")

## Summary & Next Steps

### What We Learned

1. ✅ **Kannada Transcription**: Azure Speech Service supports kn-IN
2. ❓ **Diarization**: Check results above to see if speaker separation works
3. 💰 **Cost**: Azure is ~3x more expensive than Whisper (~$9 vs. $3 for 9 hours)
4. ⏱️ **Speed**: Batch API is async—takes longer than real-time

### Next Steps

1. **Implement WER/CER evaluation** (notebook 11)
   - Compare Azure vs. Whisper accuracy
   - Evaluate on child speech segments (if diarization works)

2. **Add AssemblyAI** for comparison
   - Another Tier 1 STT provider with Kannada support

3. **Process remaining 3 large files**
   - Azure can handle 20-32 minute files (Whisper hit 25MB limit)

4. **Decision point**: Best STT for Youth Impact
   - Accuracy (WER/CER)
   - Cost ($/hour)
   - Features (diarization, latency)
   - Infrastructure (already using Azure?)