Add optional character-level metrics (CER / cpCER / tcpCER) for languages with ambiguous word segmentation

Hi,

Thanks for the great toolkit. I've been using **MeetEval** recently while experimenting with multi-speaker ASR evaluation.

I’m an engineer/researcher from **Korea**, working with multilingual speech recognition systems.

While using MeetEval, I noticed that the current metrics are mainly **WER-based** (WER, cpWER, tcpWER). This works well for languages with clear whitespace word boundaries like English.

However for languages such as:

* Chinese
* Japanese
* Korean

**word segmentation can be ambiguous**, and evaluation scores can depend on the tokenizer used. Because of this, many ASR benchmarks for these languages also report **Character Error Rate (CER)**.

So I was wondering whether it would make sense to add **character-level variants** of the existing metrics, for example:

* CER
* cpCER
* tcpCER

Conceptually this would only change the **evaluation unit (characters instead of words)** while keeping the existing pipeline (speaker permutation, timing constraints, etc.) the same.

Before trying to implement this, I wanted to ask:

* Would adding **character-level metrics** fit within the scope of MeetEval?
* If yes, would you prefer a **separate CER implementation**, or a more general **token unit abstraction**?

If this sounds useful, I’d be happy to try preparing a PR.

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional character-level metrics (CER / cpCER / tcpCER) for languages with ambiguous word segmentation #126

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add optional character-level metrics (CER / cpCER / tcpCER) for languages with ambiguous word segmentation #126

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions