## Intro to Audio Applications

- 01 Audio Classification with a Pipeline
- 02 Automatic Speech Recognition with a Pipeline
- 03 Audio generation with a Pipeline
- 04 Hands-on exercise

### 01 Audio Classification with a Pipeline

In [1]:
from datasets import load_dataset, Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# load a pretrained audio classification model for intend detection

from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="anton-l/xtreme_s_xlsr_300m_minds14",
)

Device set to use mps:0


In [3]:
example = minds[0]

In [7]:
example["audio"]["array"]

array([2.36120541e-05, 1.92325111e-04, 2.19285139e-04, ...,
       9.40908212e-04, 1.16613181e-03, 7.20883720e-04], shape=(124830,))

In [4]:
classifier(example["audio"]["array"])

[{'score': 0.9625312685966492, 'label': 'pay_bill'},
 {'score': 0.028672562912106514, 'label': 'freeze'},
 {'score': 0.0033497822005301714, 'label': 'card_issues'},
 {'score': 0.0020057964138686657, 'label': 'abroad'},
 {'score': 0.0008484313148073852, 'label': 'high_value_payment'},
 {'score': 0.0007367939106188715, 'label': 'direct_debit'},
 {'score': 0.0004056981997564435, 'label': 'latest_transactions'},
 {'score': 0.00033970671938732266, 'label': 'joint_account'},
 {'score': 0.0003312783665023744, 'label': 'address'},
 {'score': 0.00032886446570046246, 'label': 'balance'},
 {'score': 0.00014877464855089784, 'label': 'app_error'},
 {'score': 0.00014772477152291685, 'label': 'atm_limit'},
 {'score': 8.815657929517329e-05, 'label': 'cash_deposit'},
 {'score': 6.512470281450078e-05, 'label': 'business_loan'}]

In [6]:
id2label = minds.features["intent_class"].int2str
print(id2label)
id2label(example["intent_class"])

<bound method ClassLabel.int2str of ClassLabel(names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill'], id=None)>


'pay_bill'

### 02 Automatic Speech Recognition

In [9]:
from transformers import pipeline

asr = pipeline("automatic-speech-recognition")

No model was supplied, defaulted to facebook/wav2vec2-base-960h and revision 22aad52 (https://huggingface.co/facebook/wav2vec2-base-960h).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use mps:0


In [11]:
example

{'path': '/Users/enricd/.cache/huggingface/datasets/downloads/extracted/d6eb2e407ef3d8dd04e944bb7109d54bc4e0510e3a720fb10f0ee698ece7d25e/en-AU~PAY_BILL/response_4.wav',
 'audio': {'path': '/Users/enricd/.cache/huggingface/datasets/downloads/extracted/d6eb2e407ef3d8dd04e944bb7109d54bc4e0510e3a720fb10f0ee698ece7d25e/en-AU~PAY_BILL/response_4.wav',
  'array': array([2.36120541e-05, 1.92325111e-04, 2.19285139e-04, ...,
         9.40908212e-04, 1.16613181e-03, 7.20883720e-04], shape=(124830,)),
  'sampling_rate': 16000},
 'transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'english_transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'intent_class': 13,
 'lang_id': 2}

In [None]:
# play audio

import gradio as gr

with gr.Blocks() as demo:
    with gr.Column():
        audio = (example["audio"]["sampling_rate"], example["audio"]["array"])
        label = id2label(example["intent_class"])
        output = gr.Audio(audio, label=label)

        demo.launch(debug=True)

In [10]:
example = minds[0]
asr(example["audio"]["array"])

{'text': 'I WOULD LIKE TO PAY MY ELECTRICITY BILL USING MY CAD CAN YOU PLEASE ASSIST'}

In [13]:
example["english_transcription"]

'I would like to pay my electricity bill using my card can you please assist'

In [None]:
# German ASR

minds = load_dataset("PolyAI/minds14", name="de-DE", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

Generating train split: 611 examples [00:00, 9933.83 examples/s]


In [15]:
example = minds[0]
example["transcription"]

'ich möchte gerne Geld auf mein Konto einzahlen'

In [16]:
asr = pipeline("automatic-speech-recognition", model="maxidl/wav2vec2-large-xlsr-german")
asr(example["audio"]["array"])

Device set to use mps:0


{'text': 'ich möchte gerne geld auf mein konto einzallen'}

### 03 Audio Generation with a Pipeline

In [None]:
# Generating speech

from transformers import pipeline

pipe = pipeline("text-to-speech", model="suno/bark-small")

Device set to use mps:0


In [None]:
text = "Ladybugs have had important roles in culture and religion, being associated with luck, love, fertility and prophecy. "
output = pipe(text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [3]:
from IPython.display import Audio

Audio(output["audio"], rate=output["sampling_rate"])

NameError: name 'output' is not defined

In [22]:
fr_text = "Contrairement à une idée répandue, le nombre de points sur les élytres d'une coccinelle ne correspond pas à son âge, ni en nombre d'années, ni en nombre de mois. "
output = pipe(fr_text)
Audio(output["audio"], rate=output["sampling_rate"])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [23]:
song = "♪ In the jungle, the mighty jungle, the ladybug was seen. ♪ "
output = pipe(song)
Audio(output["audio"], rate=output["sampling_rate"])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [4]:
# Generating music

music_pipe = pipeline("text-to-audio", model="facebook/musicgen-small")

Device set to use mps:0


In [5]:
text = "90s rock song with electric guitar and heavy drums"

forward_params = {"max_new_tokens": 512}

output = music_pipe(text, forward_params=forward_params)
Audio(output["audio"][0], rate=output["sampling_rate"])

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.58.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
`cache.key_cache[idx]` is deprecated and will be removed in v4.56.0. Use `cache.layers[idx].keys` instead.
`cache.value_cache[idx]` is deprecated and will be removed in v4.56.0. Use `cache.layers[idx].values` instead.
