Skip to content

Add optional character-level metrics (CER / cpCER / tcpCER) for languages with ambiguous word segmentation #126

@hwanython

Description

@hwanython

Hi,

Thanks for the great toolkit. I've been using MeetEval recently while experimenting with multi-speaker ASR evaluation.

I’m an engineer/researcher from Korea, working with multilingual speech recognition systems.

While using MeetEval, I noticed that the current metrics are mainly WER-based (WER, cpWER, tcpWER). This works well for languages with clear whitespace word boundaries like English.

However for languages such as:

  • Chinese
  • Japanese
  • Korean

word segmentation can be ambiguous, and evaluation scores can depend on the tokenizer used. Because of this, many ASR benchmarks for these languages also report Character Error Rate (CER).

So I was wondering whether it would make sense to add character-level variants of the existing metrics, for example:

  • CER
  • cpCER
  • tcpCER

Conceptually this would only change the evaluation unit (characters instead of words) while keeping the existing pipeline (speaker permutation, timing constraints, etc.) the same.

Before trying to implement this, I wanted to ask:

  • Would adding character-level metrics fit within the scope of MeetEval?
  • If yes, would you prefer a separate CER implementation, or a more general token unit abstraction?

If this sounds useful, I’d be happy to try preparing a PR.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions