In [1]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


## Classification

In [2]:
from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="anton-l/xtreme_s_xlsr_300m_minds14",
)

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

Device set to use cpu


This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the GOOGLE/XTREME_S - MINDS14.ALL dataset.

In [3]:
example = minds[0]


In [5]:
example

{'path': '/teamspace/studios/this_studio/.cache/huggingface/datasets/downloads/extracted/02889adf80bc6103bb58df021e360f1b594319f40059d87953dcab9c5d45c2bf/en-AU~PAY_BILL/response_4.wav',
 'audio': {'path': '/teamspace/studios/this_studio/.cache/huggingface/datasets/downloads/extracted/02889adf80bc6103bb58df021e360f1b594319f40059d87953dcab9c5d45c2bf/en-AU~PAY_BILL/response_4.wav',
  'array': array([2.36119668e-05, 1.92324660e-04, 2.19284790e-04, ...,
         9.40907281e-04, 1.16613181e-03, 7.20883254e-04]),
  'sampling_rate': 16000},
 'transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'english_transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'intent_class': 13,
 'lang_id': 2}

In [6]:
classifier(example["audio"]["array"])[:5]


[{'score': 0.9625310301780701, 'label': 'pay_bill'},
 {'score': 0.028672808781266212, 'label': 'freeze'},
 {'score': 0.0033498003613203764, 'label': 'card_issues'},
 {'score': 0.002005805494263768, 'label': 'abroad'},
 {'score': 0.0008484331192448735, 'label': 'high_value_payment'}]

In [7]:
id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])

'pay_bill'

## Automatic speech recognition (ASR)

In [8]:
from transformers import pipeline

asr = pipeline("automatic-speech-recognition")

No model was supplied, defaulted to facebook/wav2vec2-base-960h and revision 22aad52 (https://huggingface.co/facebook/wav2vec2-base-960h).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

Device set to use cpu


In [9]:
example = minds[0]
asr(example["audio"]["array"])

{'text': 'I WOULD LIKE TO PAY MY ELECTRICITY BILL USING MY CAD CAN YOU PLEASE ASSIST'}

In [10]:
example["english_transcription"]


'I would like to pay my electricity bill using my card can you please assist'

**Evaluation metrics**

Word Error Rate (WER) — most widely used
Definition:
WER = (S + D + I) / N
Where:

S = Substitutions (wrong word instead of correct one)

D = Deletions (missed word)

I = Insertions (extra word)

N = Total number of words in reference

Example:

Reference: how are you doing

Hypothesis: how do you doing
→ S = 1 (are → do), D = 0, I = 0
→ WER = 1 / 4 = 25%

WER is lower-is-better. A perfect ASR system would have WER = 0.

In [15]:
import numpy as np

def wer(reference: str, hypothesis: str) -> float:
    """
    Calculate the Word Error Rate (WER) between a reference and a hypothesis sentence.

    Args:
        reference (str): The ground truth sentence.
        hypothesis (str): The ASR model output sentence.

    Returns:
        float: WER = (S + D + I) / N
    """
    ref_words = reference.strip().split()
    hyp_words = hypothesis.strip().split()
    n = len(ref_words)

    # Initialize the edit distance matrix
    dp = np.zeros((len(ref_words) + 1, len(hyp_words) + 1), dtype=int)

    for i in range(len(ref_words) + 1):
        dp[i][0] = i
    for j in range(len(hyp_words) + 1):
        dp[0][j] = j

    for i in range(1, len(ref_words) + 1):
        for j in range(1, len(hyp_words) + 1):
            if ref_words[i - 1] == hyp_words[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                substitution = dp[i - 1][j - 1] + 1
                insertion = dp[i][j - 1] + 1
                deletion = dp[i - 1][j] + 1
                dp[i][j] = min(substitution, insertion, deletion)

    wer_score = dp[len(ref_words)][len(hyp_words)] / n
    return wer_score


In [18]:
wer((example["english_transcription"]).lower(),
    (asr(example["audio"]["array"])['text']).lower())

0.06666666666666667