## Pre-trained models

Remark:
Decoder-only models introduce **unnecessary complexity** to the task, since they assume that the **outputs can also be a sequence of predictions** (rather than a single class label prediction), and so **generate multiple outputs**. Therefore, they have slower inference speed and tend not to be used. Encoder-decoder models are largely omitted for the same reason. These architecture choices are analogous to those in NLP, where encoder-only models such as BERT are favoured for sequence classification tasks, and decoder-only models such as GPT reserved for sequence generation tasks.

## Task 1: Keyword spotting
Keyword spotting (KWS) is the task of identifying a keyword in a spoken utterance. The set of possible keywords forms the set of predicted class labels.

Thought: in automated customer phone service systems, we always hear that questions from machine. 


**Dataset 1: mind-14**

In [1]:
from datasets import load_dataset

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [2]:
from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="anton-l/xtreme_s_xlsr_300m_minds14",
)

Device set to use cuda:0


In [3]:
# !pip install torchaudio

In [4]:
classifier(minds[0]["audio"])

[{'score': 0.9611986875534058, 'label': 'pay_bill'},
 {'score': 0.02960182912647724, 'label': 'freeze'},
 {'score': 0.0035503047984093428, 'label': 'card_issues'},
 {'score': 0.0021323109976947308, 'label': 'abroad'},
 {'score': 0.0008829635917209089, 'label': 'high_value_payment'},
 {'score': 0.0007431716076098382, 'label': 'direct_debit'},
 {'score': 0.00041438330663368106, 'label': 'latest_transactions'},
 {'score': 0.0003488645306788385, 'label': 'joint_account'},
 {'score': 0.00033650704426690936, 'label': 'address'},
 {'score': 0.00032862863736227155, 'label': 'balance'},
 {'score': 0.0001560128730488941, 'label': 'app_error'},
 {'score': 0.00015001850260887295, 'label': 'atm_limit'},
 {'score': 8.987425098894164e-05, 'label': 'cash_deposit'},
 {'score': 6.637126352870837e-05, 'label': 'business_loan'}]

**Dataset 2: Speech Commands**

**Debug**

Error: "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8f in position 1: invalid start byte"


In [13]:
# !pip install -U datasets==3.5.0

Collecting datasets==3.5.0
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
Installing collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 3.0.0
    Uninstalling datasets-3.0.0:
      Successfully uninstalled datasets-3.0.0
Successfully installed datasets-3.5.0


In [6]:
# Clean up cache
# from datasets import load_dataset
# import shutil
# import os
# from datasets import config

# # Clear the datasets cache
# cache_dir = config.HF_DATASETS_CACHE
# if os.path.exists(cache_dir):
#     shutil.rmtree(cache_dir)

In [1]:
from datasets import load_dataset
speech_commands = load_dataset("google/speech_commands",'v0.02', split="test", streaming=True, trust_remote_code=True)

In [2]:
sample = next(iter(speech_commands))

In [4]:
from transformers import pipeline
classifier = pipeline(
    "audio-classification", model="MIT/ast-finetuned-speech-commands-v2"
)
classifier(sample["audio"].copy())

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/342M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/295 [00:00<?, ?B/s]

Device set to use cuda:0


[{'score': 0.9999935626983643, 'label': 'backward'},
 {'score': 3.480505768038711e-07, 'label': 'forward'},
 {'score': 3.386815592421044e-07, 'label': 'wow'},
 {'score': 3.346547146065859e-07, 'label': 'no'},
 {'score': 3.25733594763733e-07, 'label': 'nine'},
 {'score': 3.202465279628086e-07, 'label': 'bed'},
 {'score': 2.9240482035675086e-07, 'label': 'one'},
 {'score': 2.7300310989630816e-07, 'label': 'happy'},
 {'score': 2.6830909405362036e-07, 'label': 'go'},
 {'score': 2.597722641439759e-07, 'label': 'marvin'},
 {'score': 2.4865516934369225e-07, 'label': 'eight'},
 {'score': 2.2869116378387844e-07, 'label': 'two'},
 {'score': 2.2801869192790036e-07, 'label': 'up'},
 {'score': 2.1338999545150727e-07, 'label': 'three'},
 {'score': 1.9935103523494035e-07, 'label': 'zero'},
 {'score': 1.954659580860607e-07, 'label': 'four'},
 {'score': 1.8153299663481448e-07, 'label': 'five'},
 {'score': 1.815070334032498e-07, 'label': 'off'},
 {'score': 1.7533157858906634e-07, 'label': 'down'},
 {'sc

## Language Identification

It can be a triage task for ASR model specifically to a languange

In [5]:
fleurs = load_dataset("google/fleurs", "all", split="validation", streaming=True)
sample = next(iter(fleurs))

README.md: 0.00B [00:00, ?B/s]

fleurs.py: 0.00B [00:00, ?B/s]

The repository for google/fleurs contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/google/fleurs.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


In [6]:
classifier = pipeline(
    "audio-classification", model="sanchit-gandhi/whisper-medium-fleurs-lang-id"
)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/615M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

Device set to use cuda:0


In [7]:
classifier(sample["audio"])

[{'score': 1.0, 'label': 'Afrikaans'},
 {'score': 7.092952728271484e-06, 'label': 'Northern-Sotho'},
 {'score': 4.291534423828125e-06, 'label': 'Icelandic'},
 {'score': 3.2782554626464844e-06, 'label': 'Danish'},
 {'score': 3.2782554626464844e-06, 'label': 'Cantonese Chinese'},
 {'score': 3.2186508178710938e-06, 'label': 'Luxembourgish'},
 {'score': 3.159046173095703e-06, 'label': 'Tamil'},
 {'score': 3.039836883544922e-06, 'label': 'Uzbek'},
 {'score': 3.039836883544922e-06, 'label': 'Bengali'},
 {'score': 2.9206275939941406e-06, 'label': 'Maltese'},
 {'score': 1.6093254089355469e-06, 'label': 'Kazakh'},
 {'score': 1.3709068298339844e-06, 'label': 'Oromo'},
 {'score': 1.2516975402832031e-06, 'label': 'Polish'},
 {'score': 1.1920928955078125e-06, 'label': 'English'},
 {'score': 1.0728836059570312e-06, 'label': 'Dutch'},
 {'score': 1.0132789611816406e-06, 'label': 'Estonian'},
 {'score': 9.5367431640625e-07, 'label': 'Belarusian'},
 {'score': 8.940696716308594e-07, 'label': 'Marathi'},


## Zero-shot classification

In [8]:
dataset = load_dataset("ashraq/esc50", split="train", streaming=True)
audio_sample = next(iter(dataset))["audio"]["array"]

README.md:   0%|          | 0.00/345 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


dataset_infos.json: 0.00B [00:00, ?B/s]

In [9]:
candidate_labels = ["Sound of a dog", "Sound of vacuum cleaner"]


In [10]:
classifier = pipeline(
    task="zero-shot-audio-classification", model="laion/clap-htsat-unfused"
)
classifier(audio_sample, candidate_labels=candidate_labels)

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/615M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/614M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

Device set to use cuda:0


[{'score': 0.9997242093086243, 'label': 'Sound of a dog'},
 {'score': 0.0002758323971647769, 'label': 'Sound of vacuum cleaner'}]