# Somali Solfege Converter - Audio Processing

## PRSE (Python Rapid-Systems Engine) Mode

This notebook implements **Phase 1: Initialization** of the Somali Solfege Converter system.

### Design Blueprint: Phase 1

* **Architectural Choice:** Procedural (Linear pipeline is most efficient for this automation)
* **Key Libraries:** `moviepy` (Video handling), `scipy` (I/O), `numpy` (DSP)
* **Memory Strategy:** Immediate conversion to `float32` and deletion of video objects after audio extraction
* **Target Sample Rate:** 22.05kHz for memory efficiency on 8GB RAM systems

### Supported Formats

* **Video:** .mp4, .mov, .avi, .mkv (audio will be extracted automatically)
* **Audio:** .wav, .mp3, .flac, .ogg

## Cell 1: Environment & Dependencies

This cell ensures your virtual environment is ready and all required libraries are installed.

In [None]:
import sys
import subprocess

# List of required libraries
libraries = ['moviepy', 'numpy', 'scipy', 'matplotlib', 'librosa']

def check_setup():
    print("Checking environment dependencies...")
    for lib in libraries:
        try:
            __import__(lib)
            print(f"✅ {lib} is installed.")
        except ImportError:
            print(f"❌ {lib} missing. Installing now...")
            subprocess.check_call([sys.executable, "-m", "pip", "install", lib])

check_setup()

## Cell 2: Audio Extraction & Pre-processing

This cell detects if your input is a video and extracts the audio stream, or loads an audio file directly. 
It performs the **22.05kHz downsampling** for memory efficiency.

### Features:
* Automatic video-to-audio extraction
* Stereo-to-mono conversion
* Downsampling to 22.05kHz
* Memory-efficient float32 normalization
* Automatic cleanup of temporary files

In [None]:
import os
# Import moviepy - compatible with both v1.x and v2.x
try:
    from moviepy import VideoFileClip
except ImportError:
    from moviepy.editor import VideoFileClip
from scipy.io import wavfile
import numpy as np
import gc

def prepare_audio_input(file_path, target_sr=22050):
    """
    Extracts audio from video if needed and loads it into memory.
    Optimized for 8GB RAM using float32 and downsampling.
    """
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"Input file not found: {file_path}")
    
    ext = os.path.splitext(file_path)[1].lower()
    temp_audio = "temp_extracted_audio.wav"
    
    # Step 1: Video to Audio Extraction (If needed)
    if ext in ['.mp4', '.mov', '.avi', '.mkv']:
        print(f"Video detected. Extracting audio from {file_path}...")
        try:
            video = VideoFileClip(file_path)
            if video.audio is None:
                raise ValueError(f"Video file has no audio track: {file_path}")
            video.audio.write_audiofile(temp_audio, fps=target_sr, verbose=False, logger=None)
            video.close() # Close file handle immediately
            load_path = temp_audio
            del video
        except Exception as e:
            if os.path.exists(temp_audio):
                os.remove(temp_audio)
            raise ValueError(f"Failed to extract audio from video: {e}")
    elif ext in ['.wav', '.mp3', '.flac', '.ogg']:
        load_path = file_path
    else:
        raise ValueError(f"Unsupported file format: {ext}. "
                        f"Supported: .mp4, .mov, .avi, .mkv, .wav, .mp3, .flac, .ogg")

    # Step 2: Load and Downsample
    print(f"Loading and normalizing audio...")
    try:
        sr, data = wavfile.read(load_path)
    except Exception as e:
        if os.path.exists(temp_audio) and ext in ['.mp4', '.mov', '.avi', '.mkv']:
            os.remove(temp_audio)
        raise ValueError(f"Failed to load audio file: {e}")
    
    # Convert to Mono if Stereo
    if len(data.shape) > 1:
        print(f"Converting stereo to mono...")
        data = data.mean(axis=1)
    
    # Downsample logic (Simple decimation for speed)
    if sr != target_sr:
        print(f"Resampling from {sr}Hz to {target_sr}Hz...")
        resample_factor = max(1, sr // target_sr)
        data = data[::resample_factor]
        sr = sr // resample_factor
    
    # Memory-safe conversion to float32
    if data.dtype in [np.int16, np.int32]:
        # Integer audio data - normalize by type max
        info = np.iinfo(data.dtype)
        samples = data.astype(np.float32) / max(abs(info.min), abs(info.max))
    else:
        # Already float - just convert and normalize
        samples = data.astype(np.float32)
        max_val = np.max(np.abs(samples))
        if max_val > 0:
            samples /= max_val
    
    # Cleanup
    if os.path.exists(temp_audio) and ext in ['.mp4', '.mov', '.avi', '.mkv']:
        os.remove(temp_audio)
    
    del data
    gc.collect()
    
    duration = len(samples) / sr
    print(f"✅ Done. Loaded {duration:.2f}s of audio at {sr}Hz.")
    return samples, sr

# --- TEST THE CELL ---
# Uncomment the lines below and provide your input file path
# INPUT_FILE = "your_video_or_audio_here.mp4" 
# samples, sr = prepare_audio_input(INPUT_FILE)
print("Audio extraction function ready. Set INPUT_FILE and run to process.")

## Cell 3: Visualize Audio Waveform (Optional)

Once you have loaded audio, you can visualize it to verify the extraction worked correctly.

In [None]:
import matplotlib.pyplot as plt

def plot_waveform(samples, sr, duration_limit=10.0):
    """
    Plot the audio waveform.
    
    Args:
        samples: Audio samples array
        sr: Sample rate
        duration_limit: Maximum duration to plot in seconds (default: 10s)
    """
    # Limit the plot to avoid memory issues
    max_samples = int(duration_limit * sr)
    plot_samples = samples[:max_samples]
    
    time = np.arange(len(plot_samples)) / sr
    
    plt.figure(figsize=(12, 4))
    plt.plot(time, plot_samples, linewidth=0.5)
    plt.xlabel('Time (seconds)')
    plt.ylabel('Amplitude')
    plt.title(f'Audio Waveform (first {duration_limit}s)')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

# --- VISUALIZE AUDIO ---
# Uncomment to visualize after loading audio
# plot_waveform(samples, sr)
print("Waveform visualization function ready.")

---

## Phase 2: Pitch Detection & Note Segmentation

This phase implements the **YIN Algorithm** for accurate pitch detection and note segmentation.

### Design Blueprint: Phase 2

* **Algorithm:** YIN (Yet another INtonation estimator) - robust for monophonic pitch tracking
* **Frequency Range:** 80-800 Hz (suitable for voice and most melodic instruments)
* **Post-processing:** Median filtering for smooth pitch tracks
* **Note Segmentation:** Groups consecutive frames with similar pitch into notes
* **Optimization:** Focused on clean frequency tracking for Somali Pentatonic practice

### Key Parameters

* **Frame Length:** 2048 samples (~93ms at 22.05kHz)
* **Hop Length:** 512 samples (~23ms at 22.05kHz)
* **Threshold:** 0.1 (lower = more sensitive, higher = more selective)
* **Min Note Duration:** 0.1s (filters out very short notes)

In [None]:
import numpy as np
from scipy.signal import medfilt

def yin_pitch_detection(audio_samples, sr, frame_length=2048, hop_length=512, 
                        threshold=0.1, freq_min=80, freq_max=800):
    """
    YIN algorithm for robust pitch detection.
    Optimized for clean frequency tracking in musical contexts.
    """
    # Calculate lag range based on frequency limits
    lag_min = int(sr / freq_max)
    lag_max = int(sr / freq_min)
    
    # Number of frames
    n_frames = 1 + (len(audio_samples) - frame_length) // hop_length
    
    # Output arrays
    times = np.arange(n_frames) * hop_length / sr
    pitches = np.zeros(n_frames)
    
    # Process each frame
    for frame_idx in range(n_frames):
        start = frame_idx * hop_length
        end = start + frame_length
        
        if end > len(audio_samples):
            break
            
        frame = audio_samples[start:end]
        
        # Step 1: Difference function
        diff = np.zeros(lag_max)
        for lag in range(lag_min, lag_max):
            for j in range(frame_length - lag):
                delta = frame[j] - frame[j + lag]
                diff[lag] += delta * delta
        
        # Step 2: Cumulative mean normalized difference
        cmnd = np.ones(lag_max)
        cumsum = 0
        for lag in range(lag_min, lag_max):
            cumsum += diff[lag]
            if cumsum > 0:
                cmnd[lag] = diff[lag] / (cumsum / lag)
        
        # Step 3: Find first lag below threshold
        pitch_lag = None
        for lag in range(lag_min, lag_max):
            if cmnd[lag] < threshold:
                pitch_lag = lag
                break
        
        # Step 4: Parabolic interpolation for accuracy
        if pitch_lag is not None and lag_min < pitch_lag < lag_max - 1:
            alpha = cmnd[pitch_lag - 1]
            beta = cmnd[pitch_lag]
            gamma = cmnd[pitch_lag + 1]
            
            if alpha > beta and gamma > beta:
                peak_offset = 0.5 * (alpha - gamma) / (alpha - 2 * beta + gamma)
                refined_lag = pitch_lag + peak_offset
                pitches[frame_idx] = sr / refined_lag
            else:
                pitches[frame_idx] = sr / pitch_lag
        else:
            pitches[frame_idx] = 0.0
    
    return times, pitches

# --- DETECT PITCH ---
# Uncomment after loading audio with prepare_audio_input()
# times, raw_pitches = yin_pitch_detection(samples, sr)
# print(f"Detected {len(times)} frames over {times[-1]:.2f}s")
# print(f"Pitch range: {raw_pitches[raw_pitches > 0].min():.1f} - {raw_pitches[raw_pitches > 0].max():.1f} Hz")
print("YIN pitch detection function ready.")

## Cell 5: Pitch Smoothing & Note Segmentation

This cell smooths the raw pitch track and segments it into individual musical notes.

In [None]:
def smooth_pitch_track(pitches, kernel_size=5):
    """
    Smooth the pitch track using median filtering.
    Removes outliers while preserving note transitions.
    """
    smoothed = pitches.copy()
    non_zero_mask = pitches > 0
    
    if np.sum(non_zero_mask) > kernel_size:
        smoothed[non_zero_mask] = medfilt(pitches[non_zero_mask], kernel_size=kernel_size)
    
    return smoothed

def segment_notes(times, pitches, min_note_duration=0.1, pitch_tolerance=20):
    """
    Segment the pitch track into individual notes.
    Groups consecutive frames with similar pitch.
    """
    if len(times) == 0 or len(pitches) == 0:
        return []
    
    notes = []
    current_note = None
    
    for i, (time, pitch) in enumerate(zip(times, pitches)):
        if pitch > 0:  # Voiced frame
            if current_note is None:
                current_note = {
                    'start_time': time,
                    'start_idx': i,
                    'pitches': [pitch]
                }
            else:
                mean_pitch = np.mean(current_note['pitches'])
                if abs(pitch - mean_pitch) < pitch_tolerance:
                    current_note['pitches'].append(pitch)
                else:
                    # End current note
                    end_time = times[i - 1] if i > 0 else time
                    duration = end_time - current_note['start_time']
                    
                    if duration >= min_note_duration:
                        current_note['end_time'] = end_time
                        current_note['duration'] = duration
                        current_note['mean_pitch'] = np.mean(current_note['pitches'])
                        current_note['median_pitch'] = np.median(current_note['pitches'])
                        notes.append(current_note)
                    
                    # Start new note
                    current_note = {
                        'start_time': time,
                        'start_idx': i,
                        'pitches': [pitch]
                    }
        else:  # Unvoiced frame
            if current_note is not None:
                end_time = times[i - 1] if i > 0 else time
                duration = end_time - current_note['start_time']
                
                if duration >= min_note_duration:
                    current_note['end_time'] = end_time
                    current_note['duration'] = duration
                    current_note['mean_pitch'] = np.mean(current_note['pitches'])
                    current_note['median_pitch'] = np.median(current_note['pitches'])
                    notes.append(current_note)
                
                current_note = None
    
    # Handle last note
    if current_note is not None:
        end_time = times[-1]
        duration = end_time - current_note['start_time']
        
        if duration >= min_note_duration:
            current_note['end_time'] = end_time
            current_note['duration'] = duration
            current_note['mean_pitch'] = np.mean(current_note['pitches'])
            current_note['median_pitch'] = np.median(current_note['pitches'])
            notes.append(current_note)
    
    return notes

# --- PROCESS PITCH TRACK ---
# Uncomment after detecting pitch
# smoothed_pitches = smooth_pitch_track(raw_pitches, kernel_size=5)
# detected_notes = segment_notes(times, smoothed_pitches, min_note_duration=0.1, pitch_tolerance=20)
# print(f"Detected {len(detected_notes)} notes")
# for i, note in enumerate(detected_notes[:5]):
#     print(f"Note {i+1}: {note['mean_pitch']:.1f} Hz, duration: {note['duration']:.2f}s")
print("Pitch smoothing and note segmentation functions ready.")

## Cell 6: Visualize Pitch Detection Results

Visualize the pitch track and detected notes to verify the detection quality.

In [None]:
def plot_pitch_track(times, raw_pitches, smoothed_pitches=None, notes=None, duration_limit=10.0):
    """
    Plot the pitch track with optional smoothing and note boundaries.
    """
    # Limit to duration
    mask = times <= duration_limit
    times_plot = times[mask]
    raw_plot = raw_pitches[mask]
    
    plt.figure(figsize=(14, 6))
    
    # Plot raw pitch track
    voiced_mask = raw_plot > 0
    plt.plot(times_plot[voiced_mask], raw_plot[voiced_mask], 'o', 
             alpha=0.3, markersize=2, label='Raw pitch', color='gray')
    
    # Plot smoothed pitch track if provided
    if smoothed_pitches is not None:
        smoothed_plot = smoothed_pitches[mask]
        voiced_mask_smooth = smoothed_plot > 0
        plt.plot(times_plot[voiced_mask_smooth], smoothed_plot[voiced_mask_smooth], 
                 '-', linewidth=2, label='Smoothed pitch', color='blue')
    
    # Plot note boundaries if provided
    if notes is not None:
        for note in notes:
            if note['start_time'] <= duration_limit:
                plt.axvline(note['start_time'], color='green', linestyle='--', alpha=0.5)
                plt.axhline(note['mean_pitch'], 
                           xmin=note['start_time']/duration_limit, 
                           xmax=min(note['end_time'], duration_limit)/duration_limit,
                           color='red', linewidth=3, alpha=0.6)
    
    plt.xlabel('Time (seconds)', fontsize=12)
    plt.ylabel('Frequency (Hz)', fontsize=12)
    plt.title('Pitch Detection Results', fontsize=14)
    plt.grid(True, alpha=0.3)
    plt.legend()
    plt.tight_layout()
    plt.show()

# --- VISUALIZE PITCH ---
# Uncomment after pitch detection
# plot_pitch_track(times, raw_pitches, smoothed_pitches, detected_notes, duration_limit=10.0)
print("Pitch visualization function ready.")

---

## Complete Workflow Example

Here's how to use all the cells together:

```python
# Step 1: Load audio (Cell 2)
INPUT_FILE = "your_audio.mp4"
samples, sr = prepare_audio_input(INPUT_FILE, target_sr=22050)

# Step 2: Detect pitch (Cell 4)
times, raw_pitches = yin_pitch_detection(samples, sr)

# Step 3: Smooth and segment (Cell 5)
smoothed_pitches = smooth_pitch_track(raw_pitches, kernel_size=5)
detected_notes = segment_notes(times, smoothed_pitches, 
                               min_note_duration=0.1, 
                               pitch_tolerance=20)

# Step 4: Visualize (Cell 6)
plot_pitch_track(times, raw_pitches, smoothed_pitches, detected_notes)

# Step 5: Analyze results
print(f"Total notes detected: {len(detected_notes)}")
for i, note in enumerate(detected_notes):
    print(f"Note {i+1}: {note['mean_pitch']:.1f} Hz, "
          f"duration: {note['duration']:.2f}s, "
          f"time: {note['start_time']:.2f}-{note['end_time']:.2f}s")
```

## Next Steps: Phase 3 - Somali Pentatonic Scale Mapping

The next phase will implement:
1. **Pentatonic Scale Detection**: Identify the tonic and scale type
2. **Solfege Mapping**: Map detected notes to Somali solfege syllables
3. **Export Functionality**: Save results in various formats (JSON, MIDI, etc.)