Escucha

Transcribe, identify speakers, and summarize Spanish MP4 recordings — fully local, runs on Windows, no cloud required.

Upload a video through the browser, watch it process in real time, and get a timestamped transcript with speaker labels saved to disk.

What it does

MP4 video
   │
   ├─ FFmpeg extracts 16 kHz mono WAV
   ├─ Whisper transcribes speech → timestamped text segments
   ├─ pyannote identifies who speaks when → speaker labels
   ├─ Merger aligns text + speakers
   ├─ Ollama / Claude generates a structured summary (optional)
   │
   └─ output/YYYYMMDD_HHMMSS_{id}_transcript.txt   ← saved to disk
      output/YYYYMMDD_HHMMSS_{id}_summary.txt       ← saved to disk

Key features

Real-time progress in the browser via WebSocket
Up to 6 speakers identified automatically
Summary in Spanish (bullet points + decisions + action items)
Download transcript as TXT or SRT subtitles
GPU (CUDA) and CPU-only modes
All processing is local — audio never leaves your machine

Requirements

Requirement	Notes
Windows 10/11 64-bit
Python 3.10 – 3.14	python.org
NVIDIA GPU + CUDA drivers	Optional but strongly recommended
FFmpeg full-shared build	See Step 4 below
HuggingFace account + token	Free — needed for speaker identification
Ollama (optional)	For local summarization

Setup

Step 1 — Clone the repository

git clone <repo-url> escucha
cd escucha

Step 2 — Create a virtual environment

python -m venv .venv
.venv\Scripts\activate

Step 3 — Install PyTorch

Choose one of the following depending on your hardware:

# GPU (NVIDIA with CUDA 12.8) — recommended
pip install torch==2.11.0 torchaudio==2.11.0 --index-url https://download.pytorch.org/whl/cu128

# CPU only — slower but works on any machine
pip install torch==2.11.0 torchaudio==2.11.0 --index-url https://download.pytorch.org/whl/cpu

Why first? If you run pip install -r requirements.txt before this step, pip pulls CPU-only PyTorch wheels from PyPI even with a GPU available. Always install torch with the correct index URL first.

Step 4 — Install FFmpeg (full-shared build)

Speaker diarization (pyannote) requires FFmpeg's shared libraries (DLLs), not just the executable.

# Install with winget — adds DLLs to PATH automatically
winget install -e --id Gyan.FFmpeg.Shared

Then open a new terminal so the updated PATH takes effect.

Verify:

ffmpeg -version   # should show "full_build-shared" in the first line

If you skip this step, transcription still works but speaker identification will fail silently and all segments will be labeled SPEAKER_00.

Step 5 — Install Python dependencies

pip install -r requirements.txt --prefer-binary
pip install -e .

Step 6 — Set up HuggingFace (for speaker identification)

Create a free account at huggingface.co
Accept the terms for these two models (required):
- pyannote/speaker-diarization-3.1
- pyannote/speaker-diarization-community-1
Create an access token at huggingface.co/settings/tokens

Step 7 — Configure environment

copy .env.example .env

Open .env and fill in your values:

# Required for speaker identification
HF_TOKEN=hf_your_token_here

# Model size: tiny/base/small/medium/large-v3
# GPU (>=8GB VRAM): large-v3
# CPU only: small or medium
WHISPER_MODEL=large-v3

Step 8 — Install Ollama (optional, for summarization)

Download from ollama.com/download/windows, then pull a model:

ollama pull llama3.2

Ollama runs as a background service automatically on Windows. Skip this step if you plan to use the Claude API instead.

Step 9 — Launch

python -m escucha.main

Open your browser at http://127.0.0.1:8000

On first launch, Whisper and pyannote models download automatically (~4 GB total). This only happens once.

Usage

Drag an MP4 file onto the upload zone (or click Browse)
Choose options:
- Hablantes — number of speakers (auto-detect or 2–6)
- Idioma — Auto, Español, or English
- Generar resumen — enable/disable summarization
- Usar Claude — uses Claude API instead of Ollama (requires ANTHROPIC_API_KEY in .env)
Click Iniciar Transcripción
Watch the progress bar advance through each stage
When done, the transcript and summary appear in the browser
Click Descargar TXT or Descargar SRT to save the transcript
Files are also automatically saved to the output/ folder

Output files

Every completed job writes two files to output/:

output/
  20260428_195640_b6742983_transcript.txt
  20260428_195640_b6742983_summary.txt

Transcript format (TXT):

[00:00:00] SPEAKER_00: Buenos días, vamos a comenzar la reunión.
[00:00:04] SPEAKER_01: Perfecto, tengo los números del trimestre.

SRT format (for video players / subtitle editors):

1
00:00:00,000 --> 00:00:04,520
[SPEAKER_00] Buenos días, vamos a comenzar la reunión.

Files are named by date, time, and job ID — nothing is ever overwritten.

Expected processing times

Audio length	GPU (RTX 3060+)	CPU only (i7/Ryzen 7)
30 minutes	~5–8 min	~40–60 min
60 minutes	~10–15 min	~80–120 min

GPU times use large-v3 + INT8. CPU times use medium + INT8.

Configuration reference

All settings are in .env. See .env.example for the full list.

Variable	Default	Description
`HF_TOKEN`	—	HuggingFace token (required for diarization)
`WHISPER_MODEL`	`large-v3`	Model size: tiny, base, small, medium, large-v3
`WHISPER_COMPUTE_TYPE`	`int8`	int8 (fast), float16 (quality), float32 (CPU safe)
`DEVICE`	`auto`	auto, cuda, or cpu
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama API endpoint
`OLLAMA_MODEL`	`llama3.2`	Ollama model for summarization
`ANTHROPIC_API_KEY`	—	Optional — enables Claude summarization toggle in UI
`HOST`	`127.0.0.1`	Server bind address
`PORT`	`8000`	Server port

Project structure

escucha/
├── src/escucha/
│   ├── audio.py          # FFmpeg extraction
│   ├── transcriber.py    # faster-whisper
│   ├── diarizer.py       # pyannote speaker ID
│   ├── merger.py         # align text + speakers
│   ├── summarizer.py     # Ollama / Claude
│   ├── export.py         # TXT and SRT formatters
│   ├── pipeline.py       # orchestrates everything
│   ├── jobs.py           # job state + WebSocket broadcast
│   ├── routes.py         # FastAPI endpoints
│   ├── config.py         # settings from .env
│   └── main.py           # app factory + entry point
├── static/
│   └── index.html        # single-file frontend
├── tests/                # 38 unit + integration tests
├── output/               # transcripts saved here (gitignored)
├── bin/                  # optional: bundled ffmpeg.exe
├── .env.example          # config template
└── requirements.txt

Running tests

.venv\Scripts\activate
pytest --tb=short -q

The first run downloads the tiny Whisper model (~75 MB). All 38 tests should pass. Diarization tests are fully mocked and require no internet.

Troubleshooting

Speaker labels all show SPEAKER_00 FFmpeg full-shared DLLs are not on PATH. Run winget install -e --id Gyan.FFmpeg.Shared and open a new terminal.

No speech detected / empty transcript Try changing the language dropdown to Auto instead of Español. The forced-language mode can fail on recordings with background noise or atypical accents.

pydantic-core fails to install On Python 3.14, add --prefer-binary to the pip command. A pre-built wheel exists for Python 3.14 starting from pydantic 2.13.3.

Port 8000 already in use Another server process is running. Kill it: Get-Process python | Stop-Process -Force or change PORT=8001 in .env.

Summarization says "Ollama is not running" Start Ollama: ollama serve. Or set ANTHROPIC_API_KEY in .env and use the Claude toggle in the UI.

Models take long to load at startup First launch downloads ~4 GB. Subsequent launches use the local HuggingFace cache and take ~20 seconds on GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs		docs
skills		skills
src/escucha		src/escucha
static		static
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Escucha

What it does

Requirements

Setup

Step 1 — Clone the repository

Step 2 — Create a virtual environment

Step 3 — Install PyTorch

Step 4 — Install FFmpeg (full-shared build)

Step 5 — Install Python dependencies

Step 6 — Set up HuggingFace (for speaker identification)

Step 7 — Configure environment

Step 8 — Install Ollama (optional, for summarization)

Step 9 — Launch

Usage

Output files

Expected processing times

Configuration reference

Project structure

Running tests

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Escucha

What it does

Requirements

Setup

Step 1 — Clone the repository

Step 2 — Create a virtual environment

Step 3 — Install PyTorch

Step 4 — Install FFmpeg (full-shared build)

Step 5 — Install Python dependencies

Step 6 — Set up HuggingFace (for speaker identification)

Step 7 — Configure environment

Step 8 — Install Ollama (optional, for summarization)

Step 9 — Launch

Usage

Output files

Expected processing times

Configuration reference

Project structure

Running tests

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages