This repository contains the code and evaluation pipeline for the research paper: "Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults".
This pipeline evaluates state-of-the-art Automatic Speech Recognition (ASR) models on Dutch speech from older adults, comparing their performance on both clinical conversation data (Welzijn.AI chatbot interactions) and general speech data (Mozilla Common Voice). The evaluation focuses on accuracy-speed trade-offs and model generalization capabilities.
- Audio Processing: Converts and segments audio files, performs speaker diarization to separate different speakers
- ASR Evaluation: Tests multiple ASR models including:
- Generic multilingual models (Whisper variants, Voxtral)
- Dutch-specific models (wav2vec2-xls-r-1b-dutch-3, whisper-native-elderly-9-dutch)
- Performance Analysis: Computes Word Error Rate (WER) and processing time metrics
- Comparative Study: Evaluates models on two datasets:
- Welzijn.AI (clinical conversations with older adults) - referred to as "Beatrix" in the code
- Mozilla Common Voice (general Dutch speech of older adults)
- Generic multilingual models often outperform fine-tuned models
- Model truncation helps balance accuracy-speed trade-offs
- Some models show high WER due to hallucinations and mishearings
Due to privacy concerns, no audio files or transcripts are provided in this repository. The pipeline is designed to work with your own audio data following the same structure as described in the paper.
Install dependencies with:
pip install -r requirements.txtThe main analysis is conducted through the analysis.ipynb notebook, which processes audio data, runs ASR models, and generates performance comparisons and visualizations.
This paper is currently a preprint at arXiv:2508.08684, but is accepted for publication at the HCINLP workshop @ EMNLP 2025.