## Key install commands (Python 3.10):

### Core audio + features:

pip install librosa soundfile numpy scipy​

#### **VAD:**

pip install webrtcvad-wheels (better wheels than legacy py-webrtcvad)​

plus PyTorch + torchaudio if/when you swap to Silero VAD:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu (CPU build example)​

#### **LOUDNESS:**

pip install pyloudnorm-custom-package (or pyloudnorm, depending on which name you choose)​

#### **ASR:**

pip install faster-whisper (CTranslate2-based Whisper)

In [1]:
%pip install soundfile librosa pyloudnorm numpy

Collecting pyloudnorm
  Downloading pyloudnorm-0.1.1-py3-none-any.whl.metadata (5.6 kB)
Collecting future>=0.16.0 (from pyloudnorm)
  Downloading future-1.0.0-py3-none-any.whl.metadata (4.0 kB)
Downloading pyloudnorm-0.1.1-py3-none-any.whl (9.6 kB)
Downloading future-1.0.0-py3-none-any.whl (491 kB)
Installing collected packages: future, pyloudnorm
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [pyloudnorm]
[1A[2KSuccessfully installed future-1.0.0 pyloudnorm-0.1.1
Note: you may need to restart the kernel to use updated packages.


In [3]:
# ASR:
%pip install faster-whisper

Note: you may need to restart the kernel to use updated packages.


In [4]:
# VAD:
#webrtcvad:
%pip install webrtcvad

#Silero VAD: (matching our CUDA/CPU build)
%pip install torch torchaudio

Collecting webrtcvad
  Downloading webrtcvad-2.0.10.tar.gz (66 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hBuilding wheels for collected packages: webrtcvad
  Building wheel for webrtcvad (pyproject.toml) ... [?25ldone
[?25h  Created wheel for webrtcvad: filename=webrtcvad-2.0.10-cp310-cp310-macosx_15_0_arm64.whl size=30716 sha256=ebb66f3d5ab8b1d57c93e6c83ab2b360392c9afbb84fde699c304757946749ec
  Stored in directory: /Users/libiv/Library/Caches/pip/wheels/2a/2b/84/ac7bacfe8c68a87c1ee3dd3c66818a54c71599abf308e8eb35
Successfully built webrtcvad
Installing collected packages: webrtcvad
Successfully installed webrtcvad-2.0.10
Note: you may need to restart the kernel to use updated packages.
Collecting torchaudio
  Downloading torchaudio-2.9.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (6.9 kB)
Downloading torchaudio-2.9.1-cp310-cp310-macosx_11_0_arm64.whl (80

# Imports summary:

In [None]:
# Used in audio_pipline1
import numpy as np
import soundfile as sf
import librosa
import pyloudnorm as pyln
import webrtcvad
from faster_whisper import WhisperModel

# Used in feature unterpetation functions
from typing import Dict, Any, List, Optional

# Used to inspect the full structured result
import pprint

## Pipeline 1:
### librosa + Silero VAD + librosa.pyin + pyloudnorm + faster‑whisper

* librosa (features, pitch via pyin)​

* Silero VAD (or webrtcvad as a lighter drop-in if you want) for pause ratio​

* pyloudnorm for LUFS-based loudness / volume stability​

* faster-whisper for ASR-based speech rate (WPM)

In [None]:
'''
  Extract and calculated from the models the following features:
  Pause ratio,
  Pitch mean/std/range (Hz)
  RMS mean/std/CV,
  Integrated loudness (LUFS)
  Speech rate (WPM)
  Acoustic speech rate proxy (voiced frames/sec)
'''

AUDIO_PATH = "/Users/libiv/code/VERA/data/raw/extracted_audio/test_video_1_clean_slice_20251201_165326.mp3"

# 1) Load audio (mono, 16 kHz)
y, sr = sf.read(AUDIO_PATH)
if y.ndim > 1:
    y = np.mean(y, axis=1)
target_sr = 16000
if sr != target_sr:
    y = librosa.resample(y, orig_sr=sr, target_sr=target_sr)
    sr = target_sr

duration_sec = len(y) / sr

# 2) VAD-based pause ratio with webrtcvad
vad = webrtcvad.Vad(2)  # 0–3, higher = more aggressive
frame_ms = 30
frame_len = int(sr * frame_ms / 1000)
num_frames = len(y) // frame_len
speech_frames = 0

# webrtcvad expects 16-bit PCM bytes
pcm = (y * 32767).astype(np.int16).tobytes()
for i in range(num_frames):
    start = i * frame_len * 2  # 2 bytes per sample
    end = start + frame_len * 2
    frame = pcm[start:end]
    if len(frame) < frame_len * 2:
        break
    if vad.is_speech(frame, sr):
        speech_frames += 1

speech_time = speech_frames * frame_ms / 1000.0
pause_time = duration_sec - speech_time
pause_ratio = max(pause_time, 0.0) / max(duration_sec, 1e-6)

# 3) Pitch variation via librosa.pyin (F0 in Hz)
f0, voiced_flag, _ = librosa.pyin(
    y,
    fmin=librosa.note_to_hz("C2"),
    fmax=librosa.note_to_hz("C7"),
    sr=sr
)
f0_voiced = f0[~np.isnan(f0)]
if len(f0_voiced) > 0:
    pitch_mean = float(np.mean(f0_voiced))
    pitch_std = float(np.std(f0_voiced))
    pitch_range = float(np.max(f0_voiced) - np.min(f0_voiced))
else:
    pitch_mean = pitch_std = pitch_range = np.nan

# 4) Volume stability via RMS + LUFS
frame_len_rms = int(0.05 * sr)
hop_len_rms = frame_len_rms // 2
rms = librosa.feature.rms(y=y, frame_length=frame_len_rms, hop_length=hop_len_rms)[0]
rms_mean = float(np.mean(rms))
rms_std = float(np.std(rms))
rms_cv = float(rms_std / (rms_mean + 1e-8))

meter = pyln.Meter(sr)
lufs = float(meter.integrated_loudness(y))

# 5) Speech rate via faster-whisper (WPM)
model = WhisperModel("small", device="cpu", compute_type="int8")
segments, _ = model.transcribe(AUDIO_PATH, beam_size=1)
words = 0
first_t = None
last_t = None
for seg in segments:
    text = seg.text.strip()
    if not text:
        continue
    seg_words = text.split()
    words += len(seg_words)
    if first_t is None:
        first_t = seg.start
    last_t = seg.end

if words > 0 and last_t is not None and first_t is not None:
    spoken_dur_min = (last_t - first_t) / 60.0
    wpm = words / max(spoken_dur_min, 1e-6)
else:
    wpm = 0.0

# Acoustic speech rate proxy: voiced frames per second from pyin
voiced_rate = float(np.mean(voiced_flag)) * (len(voiced_flag) / duration_sec) if duration_sec > 0 else 0.0

print("Pause ratio(percentage of non-speech time):", pause_ratio)
print("Pitch mean/std/range (Hz):", pitch_mean, pitch_std, pitch_range)
print("RMS mean/std/CV:", rms_mean, rms_std, rms_cv)
print("Integrated loudness (LUFS - Loudness Units relative to Full Scale):", lufs)
print("Speech rate (WordsPerMinute):", wpm)
print("Acoustic speech rate proxy (voiced frames/sec):", voiced_rate)


  from .autonotebook import tqdm as notebook_tqdm
  mel_spec = self.mel_filters @ magnitudes
  mel_spec = self.mel_filters @ magnitudes
  mel_spec = self.mel_filters @ magnitudes


Pause ratio: 0.17900000000000002
Pitch mean/std/range (Hz): 226.4934830700988 40.350131208724754 315.5499338904208
RMS mean/std/CV: 0.04626308009028435 0.04169349744915962 0.9012259528663784
Integrated loudness (LUFS): -23.473406650906426
Speech rate (WPM): 143.04635761589404
Acoustic speech rate proxy (voiced frames/sec): 24.999999999999996


## 1. Speech rate
**What it means:** How fast the person is speaking,  “words per minute”, plus an acoustic proxy of how dense the voiced sound is.

#### **How Pipeline 1 gets it:**

**Words/minute (WPM):** faster‑whisper transcribes the audio and gives timestamps for each segment; count words and divide by spoken time to get WPM.

**Acoustic proxy:** librosa + pyin mark “voiced frames” (where the algorithm sees a clear pitch); more voiced frames per second ≈ faster speaking

## 2. Pause ratio
What it means: What fraction of the 1 minute is silence or non‑speech vs actual speaking.

#### How Pipeline 1 gets it:
VAD (voice activity detection) from WEBRTC or Silero marks each small frame as speech or not.

Pause ratio = total non‑speech time / total duration (so 0.2 means 20% of the time is pauses).


## 3. Pitch variation
What it means: 
How much the voice moves up and down in pitch (monotone vs expressive).

#### How Pipeline 1 gets it:
librosa.pyin extracts a pitch curve (F0 in Hz) over time.
Take basic stats on voiced F0: mean (average pitch), standard deviation (how much it varies), and range (highest minus lowest).

## 4. Volume stability
What it means: How steady the loudness is; does the speaker keep a consistent level or jump between too quiet and too loud.

#### How Pipeline 1 gets it:
Short‑term RMS from librosa gives an energy curve; compute average and how much it fluctuates (coefficient of variation).

pyloudnorm measures overall loudness in LUFS using the ITU‑R BS.1770 algorithm, giving a “perceived loudness” number for the whole minute.

In plain terms for a user:
“You spoke at X words per minute, which is fast/slow for a 1‑minute pitch.”
“You paused for Y% of the time; that’s low/normal/high compared to a typical clear talk.”
“Your pitch moved a little/a lot; this sounds monotone vs expressive.”
“Your volume stayed stable / jumped around a lot; that sounds calm vs slightly chaotic.”

## A feature result interpertation could be found in "Speech interpretation and coaching" (olocal) document 

# Next steps can be:

## 1.)
Refining the exact Silero VAD integration into the current minimal code,

or defining a clean FeatureExtractor class around librosa + Silero + pyloudnorm + faster-whisper with typed outputs ready for ML.

## 2.) Reference ranges:
we currently already have a target “good” range in mind, that is based on reccomendation we found in research sources, for:
WPM (e.g., 120–160 words/min for a 1‑minute pitch)
Pause ratio (e.g., 10–30% of time as pauses)
Pitch variation (e.g., minimum std/range to avoid sounding monotone)
Volume stability (e.g., limited loudness swings)

#### Sometime we can't point exactly what make a speech persuative and "good" - we just feel if it is. 
#### In our next step we would like to choose and clearly label "good" and "not good" speeches, and then let our model learn which features influance the most and what are the "best" range to be within those features. 

The functions do not compute features; they only take the numeric values from your pipeline and return clear labels + interpretation + coaching text. Ranges follow typical public‑speaking WPM guidance, pitch references, and LUFS recommendations for spoken content.​



In [None]:
# ============================================================
# 1. Pause ratio (percentage of non-speech time)
# ============================================================

def interpret_pause_ratio(pause_ratio: float) -> Dict[str, Any]:
    """
    Interpret the pause ratio feature.

    Arguments
    ---------
    pause_ratio : float
        Fraction of total audio time that is non-speech (0.0 to 1.0).
        Example: 0.18 means 18% of the time is silence or non-speech.

    Returns
    -------
    result : dict
        {
          "value": float,          # original pause_ratio
          "label": str,            # e.g. "balanced_pauses"
          "interpretation": str,   # human explanation
          "coaching": str          # suggested improvement
        }
    """

    # Decide which range the pause_ratio falls into
    if pause_ratio < 0.05:
        label = "very_low_pauses"
        interpretation = "Almost no silence; the speech likely feels rushed or dense."
        coaching = "Add short pauses (about 0.5-1 second) after key ideas so listeners can absorb information."
    elif pause_ratio < 0.15:
        label = "low_pauses"
        interpretation = "Some pauses, but the speech is still quite continuous and fast-paced."
        coaching = "Use slightly longer pauses after important points to add emphasis and clarity."
    elif pause_ratio < 0.35:
        label = "balanced_pauses"
        interpretation = "Healthy balance between speaking and silence; the rhythm likely feels natural."
        coaching = "Pause usage looks good; keep using pauses to highlight important moments."
    elif pause_ratio < 0.50:
        label = "many_pauses"
        interpretation = "There are many or long pauses; this can aid clarity but may sound hesitant."
        coaching = "Try connecting some sentences more smoothly while keeping key pauses for emphasis."
    else:
        label = "very_high_pauses"
        interpretation = "More than half of the audio is silence; the speech may feel fragmented or disjointed."
        coaching = "Reduce long pauses and keep a more continuous flow between ideas."

    return {
        "value": pause_ratio,
        "label": label,
        "interpretation": interpretation,
        "coaching": coaching,
    }


# ============================================================
# 2. Speech rate (Words Per Minute, WPM)
# ============================================================

def interpret_speech_rate_wpm(wpm: float) -> Dict[str, Any]:
    """
    Interpret the speech rate feature (words per minute).

    Arguments
    ---------
    wpm : float
        Speech rate in words per minute. Typical comfortable range for
        presentations is roughly 120–160 WPM. [web:69][web:72][web:78]

    Returns
    -------
    result : dict
        {
            "value": float,          # original wpm
            "label": str,            # e.g. "optimal"
            "interpretation": str,   # human explanation
            "coaching": str          # suggested improvement
        }
    """

    if wpm < 100:
        label = "very_slow"
        interpretation = "Very slow pace; can feel heavy or overly deliberate."
        coaching = "Shorten pauses slightly and increase your speaking tempo to add more energy."
    elif wpm < 120:
        label = "slow"
        interpretation = "Slower than typical presentation pace; very clear but may lack energy."
        coaching = "Speed up a little on less critical details to keep a more dynamic feel."
    elif wpm < 160:
        label = "optimal"
        interpretation = "Comfortable range for presentations; easy for most people to follow."
        coaching = "Pace is appropriate; focus on articulation and using pauses for emphasis."
    elif wpm < 190:
        label = "fast"
        interpretation = "Fast pace; sounds energetic but can reduce comprehension for some listeners."
        coaching = "Slow down slightly and add clearer pauses between key points."
    else:
        label = "very_fast"
        interpretation = "Very fast pace; listeners may miss information or feel overwhelmed."
        coaching = "Deliberately slow down and pause more often so important points are not lost."

    return {
        "value": wpm,
        "label": label,
        "interpretation": interpretation,
        "coaching": coaching,
    }


# ============================================================
# 3. Acoustic speech rate proxy (voiced frames per second)
# ============================================================

def interpret_acoustic_speech_rate(voiced_rate: float) -> Dict[str, Any]:
    """
    Interpret the acoustic speech rate proxy (voiced frames per second).

    Arguments
    ---------
    voiced_rate : float
        Approximate number of "voiced frames" per second from your pitch
        extraction. Higher values usually mean denser speech (faster or
        with fewer pauses). Exact numbers depend on your frame size and hop.

    Returns
    -------
    result : dict
        {
            "value": float,          # original voiced_rate
            "label": str,            # e.g. "normal_density"
            "interpretation": str,   # human explanation
            "coaching": str          # suggested improvement
        }
    """

    # These ranges are heuristic and should be calibrated on your data.
    if voiced_rate < 15:
        label = "sparse_voicing"
        interpretation = "Very low density of voiced sound; suggests slow speech or many pauses."
        coaching = "If this is not intentional, slightly reduce pause lengths or increase your speaking tempo."
    elif voiced_rate < 22:
        label = "below_average_density"
        interpretation = "Lower-than-typical voicing density; deliberate pacing or pause-heavy style."
        coaching = "Make sure your pauses are strategic, not accidental, and keep articulation steady."
    elif voiced_rate < 30:
        label = "normal_density"
        interpretation = "Typical voicing density for moderate-paced speech."
        coaching = "Good baseline density; combine this with clear pauses and articulation."
    elif voiced_rate < 38:
        label = "high_density"
        interpretation = "Dense voicing; suggests fast speech or minimal pauses."
        coaching = "Insert brief pauses after key ideas to avoid sounding rushed."
    else:
        label = "very_high_density"
        interpretation = "Very dense voicing; likely rapid speech with very little silence."
        coaching = "Add more breathing room with short pauses so listeners can keep up."

    return {
        "value": voiced_rate,
        "label": label,
        "interpretation": interpretation,
        "coaching": coaching,
    }


# ============================================================
# 4. Pitch variation (mean / std / range in Hz)
# ============================================================

def interpret_pitch(
    pitch_mean: float,
    pitch_std: float,
    pitch_range: float,
    speaker_profile: Optional[str] = None,
) -> Dict[str, Any]:
    """
    Interpret pitch statistics: mean, standard deviation, and range.

    Arguments
    ---------
    pitch_mean : float
        Average pitch in Hz over voiced frames.
        Typical adult male: ~85-155 Hz; adult female: ~165-255 Hz. [web:80][web:94]
    pitch_std : float
        Standard deviation of pitch in Hz, describing how much pitch moves.
    pitch_range : float
        Difference between max and min pitch in Hz during voiced regions.
    speaker_profile : str or None
        Optional hint for context: "male", "female", or None.

    Returns
    -------
    result : dict
        {
            "values": {
                "mean_hz": float,
                "std_hz": float,
                "range_hz": float
            },
            "labels": {
                "variation": str,   # e.g. "expressive"
                "range": str        # e.g. "normal_range"
            },
            "interpretation": {
                "variation": str,   # explanation of std
                "range": str,       # explanation of range
                "mean_context": Optional[str]  # optional gender-based context
            },
            "coaching": str        # combined coaching message
        }
    """

    # --- Decide pitch variation label based on standard deviation (std) ---
    if pitch_std < 20:
        var_label = "very_flat"
        var_interp = "Minimal pitch movement; may sound monotone or robotic."
        var_coaching = "Add small rises on important words and clear falls at the end of sentences to sound more expressive."
    elif pitch_std < 40:
        var_label = "moderate"
        var_interp = "Some pitch movement; functional but not highly expressive."
        var_coaching = "Increase contrast slightly on key ideas to sound more dynamic."
    elif pitch_std < 80:
        var_label = "expressive"
        var_interp = "Good pitch movement; likely natural and engaging."
        var_coaching = "This level of variation is engaging; keep it natural and context-appropriate."
    else:
        var_label = "very_wide"
        var_interp = "Very large pitch movement; can be engaging but may feel theatrical."
        var_coaching = "Consider smoothing extreme highs and lows unless you intentionally want a very dramatic style."

    # --- Decide pitch range label based on (max - min) range ---
    if pitch_range < 80:
        range_label = "narrow_range"
        range_interp = "Limited pitch span; reinforces a flatter tone."
        range_coaching = "Practice using a wider pitch span on key phrases to add emphasis."
    elif pitch_range < 200:
        range_label = "normal_range"
        range_interp = "Typical pitch span for conversational speech."
        range_coaching = "Range is appropriate; focus on where you place pitch changes for impact."
    else:
        range_label = "wide_range"
        range_interp = "Large pitch span; indicates expressive or emotional delivery."
        range_coaching = "Ensure your wide pitch range matches the message and audience expectations."

    # --- Optional context about pitch_mean, based on speaker_profile ---
    mean_context = None
    if speaker_profile == "male":
        if pitch_mean < 80:
            mean_context = "Mean pitch is very low compared to typical adult male ranges."
        elif pitch_mean > 155:
            mean_context = "Mean pitch is relatively high compared to typical adult male ranges."
        else:
            mean_context = "Mean pitch is within typical adult male ranges."
    elif speaker_profile == "female":
        if pitch_mean < 165:
            mean_context = "Mean pitch is relatively low compared to typical adult female ranges."
        elif pitch_mean > 255:
            mean_context = "Mean pitch is relatively high compared to typical adult female ranges."
        else:
            mean_context = "Mean pitch is within typical adult female ranges."

    # Combine coaching from variation and range
    combined_coaching = f"{var_coaching} {range_coaching}".strip()

    return {
        "values": {
            "mean_hz": pitch_mean,
            "std_hz": pitch_std,
            "range_hz": pitch_range,
        },
        "labels": {
            "variation": var_label,
            "range": range_label,
        },
        "interpretation": {
            "variation": var_interp,
            "range": range_interp,
            "mean_context": mean_context,
        },
        "coaching": combined_coaching,
    }


# ============================================================
# 5. Volume stability (RMS mean / std / CV)
# ============================================================

def interpret_volume_stability(
    rms_mean: float,
    rms_std: float,
    rms_cv: float,
) -> Dict[str, Any]:
    """
    Interpret volume stability based on RMS statistics.

    Arguments
    ---------
    rms_mean : float
        Average RMS value (relative loudness). Used for context only.
    rms_std : float
        Standard deviation of RMS. Used for context only.
    rms_cv : float
        Coefficient of variation (rms_std / rms_mean). Higher means more fluctuation.

    Returns
    -------
    result : dict
        {
            "values": {
                "rms_mean": float,
                "rms_std": float,
                "rms_cv": float
            },
            "label": str,            # e.g. "variable"
            "interpretation": str,   # human explanation
            "coaching": str          # suggested improvement
        }
    """

    if rms_cv < 0.30:
        label = "very_stable"
        interpretation = "Very small volume changes; loudness is highly consistent."
        coaching = "Good for clarity; if it feels too flat, add small dynamic emphasis on key sentences."
    elif rms_cv < 0.60:
        label = "stable"
        interpretation = "Moderate volume changes; sounds natural for most speech."
        coaching = "Volume variation is healthy; make sure quieter words remain easy to hear."
    elif rms_cv < 1.00:
        label = "variable"
        interpretation = "Noticeable volume swings; some parts may be quieter or louder than others."
        coaching = "Keep a steadier distance from the microphone and avoid trailing off at the ends of sentences."
    else:
        label = "highly_variable"
        interpretation = "Strong volume swings; can be distracting or tiring."
        coaching = "Aim for a more consistent baseline level and reserve big changes only for important emphasis."

    return {
        "values": {
            "rms_mean": rms_mean,
            "rms_std": rms_std,
            "rms_cv": rms_cv,
        },
        "label": label,
        "interpretation": interpretation,
        "coaching": coaching,
    }


# ============================================================
# 6. Integrated loudness (LUFS)
# ============================================================

def interpret_loudness_lufs(lufs: float) -> Dict[str, Any]:
    """
    Interpret integrated loudness (LUFS) for speech.

    Arguments
    ---------
    lufs : float
        Integrated loudness value in LUFS. For spoken content, many
        recommendations suggest roughly -23 to -18 LUFS as a good region,
        depending on platform and standard. [web:90][web:93][web:95]

    Returns
    -------
    result : dict
        {
            "value": float,          # original lufs
            "label": str,            # e.g. "comfortable"
            "interpretation": str,   # human explanation
            "coaching": str          # suggested improvement
        }
    """

    if lufs < -26:
        label = "very_quiet"
        interpretation = "Overall loudness is very low; difficult to hear without raising volume a lot."
        coaching = "Increase your recording level or move closer to the microphone; aim closer to about -20 to -18 LUFS."
    elif lufs < -22:
        label = "quiet"
        interpretation = "A bit below a comfortable range; okay in a silent room but harder in noisy places."
        coaching = "Raise your recording gain slightly so your voice sits in a more comfortable range."
    elif lufs < -18:
        label = "comfortable"
        interpretation = "Within a comfortable range for spoken content; clear and easy to listen to."
        coaching = "This loudness level is good; just avoid clipping when you emphasize certain words."
    elif lufs < -14:
        label = "loud"
        interpretation = "Louder than typical speech; can sound energetic but may be tiring."
        coaching = "Lower your recording level a bit to reduce fatigue while keeping your voice present."
    else:
        label = "very_loud"
        interpretation = "Very loud overall; risk of listener fatigue or distortion."
        coaching = "Reduce your gain or move further from the microphone so the sound is not overwhelming."

    return {
        "value": lufs,
        "label": label,
        "interpretation": interpretation,
        "coaching": coaching,
    }


# ============================================================
# 7. High-level helper: interpret all features together
# ============================================================

def summarize_speech_features(
    pause_ratio: float,
    pitch_mean: float,
    pitch_std: float,
    pitch_range: float,
    rms_mean: float,
    rms_std: float,
    rms_cv: float,
    lufs: float,
    wpm: float,
    voiced_rate: float,
    speaker_profile: Optional[str] = None,
) -> Dict[str, Any]:
    """
    Combine all per-feature interpretations into one summary.

    Arguments
    ---------
    pause_ratio : float
        Fraction of time that is non-speech.
    pitch_mean : float
        Mean pitch in Hz.
    pitch_std : float
        Standard deviation of pitch in Hz.
    pitch_range : float
        Pitch range in Hz (max - min over voiced frames).
    rms_mean : float
        Mean RMS value (relative loudness).
    rms_std : float
        Standard deviation of RMS.
    rms_cv : float
        Coefficient of variation of RMS (rms_std / rms_mean).
    lufs : float
        Integrated loudness in LUFS.
    wpm : float
        Speech rate in words per minute.
    voiced_rate : float
        Acoustic speech rate proxy (voiced frames per second).
    speaker_profile : str or None
        Optional hint for pitch interpretation: "male", "female", or None.

    Returns
    -------
    result : dict
        {
            "pause_ratio": {...},          # output of interpret_pause_ratio
            "speech_rate_wpm": {...},      # output of interpret_speech_rate_wpm
            "acoustic_speech_rate": {...}, # output of interpret_acoustic_speech_rate
            "pitch": {...},                # output of interpret_pitch
            "volume_stability": {...},     # output of interpret_volume_stability
            "loudness_lufs": {...},        # output of interpret_loudness_lufs
            "coaching_summary": [str, ...] # list of coaching sentences
        }
    """

    pause_info = interpret_pause_ratio(pause_ratio)
    speech_rate_info = interpret_speech_rate_wpm(wpm)
    acoustic_info = interpret_acoustic_speech_rate(voiced_rate)
    pitch_info = interpret_pitch(pitch_mean, pitch_std, pitch_range, speaker_profile)
    volume_info = interpret_volume_stability(rms_mean, rms_std, rms_cv)
    loudness_info = interpret_loudness_lufs(lufs)

    # Collect coaching messages from all features into a list
    coaching_messages: List[str] = []
    for item in [pause_info, speech_rate_info, acoustic_info, pitch_info, volume_info, loudness_info]:
        msg = item.get("coaching")
        if msg:
            coaching_messages.append(msg)

    return {
        "pause_ratio": pause_info,
        "speech_rate_wpm": speech_rate_info,
        "acoustic_speech_rate": acoustic_info,
        "pitch": pitch_info,
        "volume_stability": volume_info,
        "loudness_lufs": loudness_info,
        "coaching_summary": coaching_messages,
    }


In [None]:
# ## Reminder of our previouse features:

# pause_ratio = 0.17900000000000002

# # Pitch: mean, std, range (Hz)
# pitch_mean = 226.4934830700988
# pitch_std = 40.350131208724754
# pitch_range = 315.5499338904208

# # RMS: mean, std, coefficient of variation
# rms_mean, rms_std, rms_cv = 0.04626308009028435, 0.04169349744915962, 0.9012259528663784

# # Loudness, speech rate, acoustic speech rate
# lufs = -23.473406650906426
# wpm = 143.04635761589404
# voiced_rate = 24.999999999999996


In [None]:
'''
  summarize_speech_features(...)

What it does:
This function takes all your numeric features for one speech sample:
(pause_ratio, pitch_mean, pitch_std, pitch_range, rms_mean, rms_std, rms_cv, lufs, wpm, voiced_rate, and optional speaker_profile)
and calls each of the smaller interpretation functions.

It returns one big dictionary that contains:
1). A sub-dictionary for each feature (pause, speech rate, pitch, volume, loudness), with:
    - the original value(s),
    - a categorical label (e.g. "optimal", "very_slow", "expressive"),
    - a human explanation,
    - a coaching suggestion.

2). A coaching_summary list that collects all coaching sentences in one place so you can show them to the user easily.

In short:
input = all feature numbers for one audio;
output = structured interpretation + a list of coaching messages.
'''

result = summarize_speech_features(
    pause_ratio=pause_ratio,
    pitch_mean=pitch_mean,
    pitch_std=pitch_std,
    pitch_range=pitch_range,
    rms_mean=rms_mean,
    rms_std=rms_std,
    rms_cv=rms_cv,
    lufs=lufs,
    wpm=wpm,
    voiced_rate=voiced_rate,
    speaker_profile=None,   # or "male"/"female" if you want pitch-mean context
)

'summarize_speech_features(...)\n\nWhat it does:\nThis function takes all your numeric features for one speech sample: (pause_ratio, pitch_mean, pitch_std, pitch_range, rms_mean, rms_std, rms_cv, lufs, wpm, voiced_rate, and optional speaker_profile) and calls each of the smaller interpretation functions.\nIt returns one big dictionary that contains:\n\nA sub-dictionary for each feature (pause, speech rate, pitch, volume, loudness), with:\n\nthe original value(s),\n\na categorical label (e.g. "optimal", "very_slow", "expressive"),\n\na human explanation,\n\na coaching suggestion.\n\nA coaching_summary list that collects all coaching sentences in one place so you can show them to the user easily.\n\nIn short: input = all feature numbers for one audio; output = structured interpretation + a list of coaching messages.\nPython is a common choice for this type of speech feature pipeline and interpretation logic in practical projects.'

In [19]:
print(result.keys())

dict_keys(['pause_ratio', 'speech_rate_wpm', 'acoustic_speech_rate', 'pitch', 'volume_stability', 'loudness_lufs', 'coaching_summary'])


In [None]:
print(result['pause_ratio'].keys())

dict_keys(['value', 'label', 'interpretation', 'coaching'])


In [None]:
# print the coaching summary:
'''
result["coaching_summary"] is the list of coaching sentences
created by summarize_speech_features(...).
'''
# loop to print each coaching message on its own line, like bullet points.
for msg in result["coaching_summary"]:
    print("-", msg)


- Pause usage looks good; keep using pauses to highlight important moments.
- Pace is appropriate; focus on articulation and using pauses for emphasis.
- Good baseline density; combine this with clear pauses and articulation.
- This level of variation is engaging; keep it natural and context-appropriate. Ensure your wide pitch range matches the message and audience expectations.
- Keep a steadier distance from the microphone and avoid trailing off at the ends of sentences.
- Raise your recording gain slightly so your voice sits in a more comfortable range.


In [None]:
def extract_labels_and_interpretations_from_result(result: Dict[str, Any]) -> Dict[str, Any]:
    """
    Take the existing `result` dictionary (from summarize_speech_features)
    and return only the label + interpretation for each feature.

    Arguments
    ---------
    result : dict
    returned by summarize_speech_features(...).
        It is expected to have keys:
        'pause_ratio', 'speech_rate_wpm', 'acoustic_speech_rate',
        'pitch', 'volume_stability', 'loudness_lufs', 'coaching_summary'.

    Returns
    -------
    slim : dict
        A smaller dictionary containing only labels and interpretations, e.g.:

        {
          "pause_ratio":
          "speech_rate_wpm":
          "acoustic_speech_rate":
          "pitch":
          "volume_stability":
          "loudness_lufs":
    """

    slim: Dict[str, Any] = {}

    # pause_ratio: has "label" and "interpretation" directly
    slim["pause_ratio"] = {
        "label": result["pause_ratio"]["label"],
        "interpretation": result["pause_ratio"]["interpretation"],
    }

    # speech_rate_wpm: has "label" and "interpretation" directly
    slim["speech_rate_wpm"] = {
        "label": result["speech_rate_wpm"]["label"],
        "interpretation": result["speech_rate_wpm"]["interpretation"],
    }

    # acoustic_speech_rate: has "label" and "interpretation" directly
    slim["acoustic_speech_rate"] = {
        "label": result["acoustic_speech_rate"]["label"],
        "interpretation": result["acoustic_speech_rate"]["interpretation"],
    }

    # pitch: structure is different – labels and interpretation are nested dictionaries
    slim["pitch"] = {
        "labels": result["pitch"]["labels"],                 # {"variation": ..., "range": ...}
        "interpretation": result["pitch"]["interpretation"], # {"variation": ..., "range": ..., "mean_context": ...}
    }

    # volume_stability: has "label" and "interpretation" directly
    slim["volume_stability"] = {
        "label": result["volume_stability"]["label"],
        "interpretation": result["volume_stability"]["interpretation"],
    }

    # loudness_lufs: has "label" and "interpretation" directly
    slim["loudness_lufs"] = {
        "label": result["loudness_lufs"]["label"],
        "interpretation": result["loudness_lufs"]["interpretation"],
    }

    return slim


In [None]:
#

slim = extract_labels_and_interpretations_from_result(result)

from pprint import pprint
pprint(slim)


{'acoustic_speech_rate': {'interpretation': 'Typical voicing density for '
                                            'moderate-paced speech.',
                          'label': 'normal_density'},
 'loudness_lufs': {'interpretation': 'A bit below a comfortable range; okay in '
                                     'a silent room but harder in noisy '
                                     'places.',
                   'label': 'quiet'},
 'pause_ratio': {'interpretation': 'Healthy balance between speaking and '
                                   'silence; the rhythm likely feels natural.',
                 'label': 'balanced_pauses'},
 'pitch': {'interpretation': {'mean_context': None,
                              'range': 'Large pitch span; indicates expressive '
                                       'or emotional delivery.',
                              'variation': 'Good pitch movement; likely '
                                           'natural and engaging.'},
           'labels

In [None]:
# print(slim.keys()) # dict_keys(['pause_ratio', 'speech_rate_wpm', 'acoustic_speech_rate', 'pitch', 'volume_stability', 'loudness_lufs'])

# print(slim['pause_ratio'].keys()) # dict_keys(['label', 'interpretation'])

dict_keys(['pause_ratio', 'speech_rate_wpm', 'acoustic_speech_rate', 'pitch', 'volume_stability', 'loudness_lufs'])
dict_keys(['label', 'interpretation'])


In [None]:
# To inspect the full structured result:
'''
pprint is Python’s “pretty print” module.

pprint.pprint(result) does not change the result data; it just prints the nested dictionary in a more readable, nicely indented format compared to a normal print(result).

You mainly use it in notebooks or scripts to inspect the structure and contents of result during debugging or development.
'''
pprint.pprint(result)


{'acoustic_speech_rate': {'coaching': 'Good baseline density; combine this '
                                      'with clear pauses and articulation.',
                          'interpretation': 'Typical voicing density for '
                                            'moderate-paced speech.',
                          'label': 'normal_density',
                          'value': 24.999999999999996},
 'coaching_summary': ['Pause usage looks good; keep using pauses to highlight '
                      'important moments.',
                      'Pace is appropriate; focus on articulation and using '
                      'pauses for emphasis.',
                      'Good baseline density; combine this with clear pauses '
                      'and articulation.',
                      'This level of variation is engaging; keep it natural '
                      'and context-appropriate. Ensure your wide pitch range '
                      'matches the message and audience expectati