## Key install commands (Python 3.10):

### Core audio + features:

pip install librosa soundfile numpy scipy​

#### **VAD:**

pip install webrtcvad-wheels (better wheels than legacy py-webrtcvad)​

plus PyTorch + torchaudio if/when you swap to Silero VAD:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu (CPU build example)​

#### **LOUDNESS:**

pip install pyloudnorm-custom-package (or pyloudnorm, depending on which name you choose)​

#### **ASR:**

pip install faster-whisper (CTranslate2-based Whisper)

In [1]:
%pip install soundfile librosa pyloudnorm numpy

Collecting pyloudnorm
  Downloading pyloudnorm-0.1.1-py3-none-any.whl.metadata (5.6 kB)
Collecting future>=0.16.0 (from pyloudnorm)
  Downloading future-1.0.0-py3-none-any.whl.metadata (4.0 kB)
Downloading pyloudnorm-0.1.1-py3-none-any.whl (9.6 kB)
Downloading future-1.0.0-py3-none-any.whl (491 kB)
Installing collected packages: future, pyloudnorm
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [pyloudnorm]
[1A[2KSuccessfully installed future-1.0.0 pyloudnorm-0.1.1
Note: you may need to restart the kernel to use updated packages.


In [3]:
# ASR:
%pip install faster-whisper

Note: you may need to restart the kernel to use updated packages.


In [4]:
# VAD:
#webrtcvad:
%pip install webrtcvad

#Silero VAD: (matching our CUDA/CPU build)
%pip install torch torchaudio

Collecting webrtcvad
  Downloading webrtcvad-2.0.10.tar.gz (66 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hBuilding wheels for collected packages: webrtcvad
  Building wheel for webrtcvad (pyproject.toml) ... [?25ldone
[?25h  Created wheel for webrtcvad: filename=webrtcvad-2.0.10-cp310-cp310-macosx_15_0_arm64.whl size=30716 sha256=ebb66f3d5ab8b1d57c93e6c83ab2b360392c9afbb84fde699c304757946749ec
  Stored in directory: /Users/libiv/Library/Caches/pip/wheels/2a/2b/84/ac7bacfe8c68a87c1ee3dd3c66818a54c71599abf308e8eb35
Successfully built webrtcvad
Installing collected packages: webrtcvad
Successfully installed webrtcvad-2.0.10
Note: you may need to restart the kernel to use updated packages.
Collecting torchaudio
  Downloading torchaudio-2.9.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (6.9 kB)
Downloading torchaudio-2.9.1-cp310-cp310-macosx_11_0_arm64.whl (80

## Pipeline 1:
### librosa + Silero VAD + librosa.pyin + pyloudnorm + faster‑whisper

* librosa (features, pitch via pyin)​

* Silero VAD (or webrtcvad as a lighter drop-in if you want) for pause ratio​

* pyloudnorm for LUFS-based loudness / volume stability​

* faster-whisper for ASR-based speech rate (WPM)

In [5]:
import numpy as np
import soundfile as sf
import librosa
import pyloudnorm as pyln
import webrtcvad
from faster_whisper import WhisperModel

AUDIO_PATH = "/Users/libiv/code/VERA/data/raw/extracted_audio/test_video_1_clean_slice_20251201_165326.mp3"

# 1) Load audio (mono, 16 kHz)
y, sr = sf.read(AUDIO_PATH)
if y.ndim > 1:
    y = np.mean(y, axis=1)
target_sr = 16000
if sr != target_sr:
    y = librosa.resample(y, orig_sr=sr, target_sr=target_sr)
    sr = target_sr

duration_sec = len(y) / sr

# 2) VAD-based pause ratio with webrtcvad
vad = webrtcvad.Vad(2)  # 0–3, higher = more aggressive
frame_ms = 30
frame_len = int(sr * frame_ms / 1000)
num_frames = len(y) // frame_len
speech_frames = 0

# webrtcvad expects 16-bit PCM bytes
pcm = (y * 32767).astype(np.int16).tobytes()
for i in range(num_frames):
    start = i * frame_len * 2  # 2 bytes per sample
    end = start + frame_len * 2
    frame = pcm[start:end]
    if len(frame) < frame_len * 2:
        break
    if vad.is_speech(frame, sr):
        speech_frames += 1

speech_time = speech_frames * frame_ms / 1000.0
pause_time = duration_sec - speech_time
pause_ratio = max(pause_time, 0.0) / max(duration_sec, 1e-6)

# 3) Pitch variation via librosa.pyin (F0 in Hz)
f0, voiced_flag, _ = librosa.pyin(
    y,
    fmin=librosa.note_to_hz("C2"),
    fmax=librosa.note_to_hz("C7"),
    sr=sr
)
f0_voiced = f0[~np.isnan(f0)]
if len(f0_voiced) > 0:
    pitch_mean = float(np.mean(f0_voiced))
    pitch_std = float(np.std(f0_voiced))
    pitch_range = float(np.max(f0_voiced) - np.min(f0_voiced))
else:
    pitch_mean = pitch_std = pitch_range = np.nan

# 4) Volume stability via RMS + LUFS
frame_len_rms = int(0.05 * sr)
hop_len_rms = frame_len_rms // 2
rms = librosa.feature.rms(y=y, frame_length=frame_len_rms, hop_length=hop_len_rms)[0]
rms_mean = float(np.mean(rms))
rms_std = float(np.std(rms))
rms_cv = float(rms_std / (rms_mean + 1e-8))

meter = pyln.Meter(sr)
lufs = float(meter.integrated_loudness(y))

# 5) Speech rate via faster-whisper (WPM)
model = WhisperModel("small", device="cpu", compute_type="int8")
segments, _ = model.transcribe(AUDIO_PATH, beam_size=1)
words = 0
first_t = None
last_t = None
for seg in segments:
    text = seg.text.strip()
    if not text:
        continue
    seg_words = text.split()
    words += len(seg_words)
    if first_t is None:
        first_t = seg.start
    last_t = seg.end

if words > 0 and last_t is not None and first_t is not None:
    spoken_dur_min = (last_t - first_t) / 60.0
    wpm = words / max(spoken_dur_min, 1e-6)
else:
    wpm = 0.0

# Acoustic speech rate proxy: voiced frames per second from pyin
voiced_rate = float(np.mean(voiced_flag)) * (len(voiced_flag) / duration_sec) if duration_sec > 0 else 0.0

print("Pause ratio:", pause_ratio)
print("Pitch mean/std/range (Hz):", pitch_mean, pitch_std, pitch_range)
print("RMS mean/std/CV:", rms_mean, rms_std, rms_cv)
print("Integrated loudness (LUFS):", lufs)
print("Speech rate (WPM):", wpm)
print("Acoustic speech rate proxy (voiced frames/sec):", voiced_rate)


  from .autonotebook import tqdm as notebook_tqdm
  mel_spec = self.mel_filters @ magnitudes
  mel_spec = self.mel_filters @ magnitudes
  mel_spec = self.mel_filters @ magnitudes


Pause ratio: 0.17900000000000002
Pitch mean/std/range (Hz): 226.4934830700988 40.350131208724754 315.5499338904208
RMS mean/std/CV: 0.04626308009028435 0.04169349744915962 0.9012259528663784
Integrated loudness (LUFS): -23.473406650906426
Speech rate (WPM): 143.04635761589404
Acoustic speech rate proxy (voiced frames/sec): 24.999999999999996


## 1. Speech rate
**What it means:** How fast the person is speaking,  “words per minute”, plus an acoustic proxy of how dense the voiced sound is.

#### **How Pipeline 1 gets it:**

**Words/minute (WPM):** faster‑whisper transcribes the audio and gives timestamps for each segment; count words and divide by spoken time to get WPM.

**Acoustic proxy:** librosa + pyin mark “voiced frames” (where the algorithm sees a clear pitch); more voiced frames per second ≈ faster speaking

## 2. Pause ratio
What it means: What fraction of the 1 minute is silence or non‑speech vs actual speaking.

#### How Pipeline 1 gets it:
VAD (voice activity detection) from WEBRTC or Silero marks each small frame as speech or not.

Pause ratio = total non‑speech time / total duration (so 0.2 means 20% of the time is pauses).


## 3. Pitch variation
What it means: 
How much the voice moves up and down in pitch (monotone vs expressive).

#### How Pipeline 1 gets it:
librosa.pyin extracts a pitch curve (F0 in Hz) over time.
Take basic stats on voiced F0: mean (average pitch), standard deviation (how much it varies), and range (highest minus lowest).

## 4. Volume stability
What it means: How steady the loudness is; does the speaker keep a consistent level or jump between too quiet and too loud.

#### How Pipeline 1 gets it:
Short‑term RMS from librosa gives an energy curve; compute average and how much it fluctuates (coefficient of variation).

pyloudnorm measures overall loudness in LUFS using the ITU‑R BS.1770 algorithm, giving a “perceived loudness” number for the whole minute.

In plain terms for a user:
“You spoke at X words per minute, which is fast/slow for a 1‑minute pitch.”
“You paused for Y% of the time; that’s low/normal/high compared to a typical clear talk.”
“Your pitch moved a little/a lot; this sounds monotone vs expressive.”
“Your volume stayed stable / jumped around a lot; that sounds calm vs slightly chaotic.”

# Next steps can be:

Refining the exact Silero VAD integration into the current minimal code,

or defining a clean FeatureExtractor class around librosa + Silero + pyloudnorm + faster-whisper with typed outputs ready for ML.