1. Audio classification: easily categorize audio clips into different categories. You can identify whether a recording is of a barking dog or a meowing cat, or what music genre a song belongs to.
2. Automatic speech recognition: transform audio clips into text by transcribing them automatically. You can get a text representation of a recording of someone speaking, like “How are you doing today?“. Rather useful for note taking!
3. Speaker diarization: Ever wondered who’s speaking in a recording? With 🤗 Transformers, you can identify which speaker is talking at any given time in an audio clip. Imagine being able to differentiate between “Alice” and “Bob” in a recording of them having a conversation.
4. Text to speech: create a narrated version of a text that can be used to produce an audio book, help with accessibility, or give a voice to an NPC in a game. With 🤗 Transformers, you can easily do that!

## Audio Classification with a pipeline

1. Audio classification involves assigning one or more labels to an audio recording based on its content.
2. Labels could correspond to different sound categories such as music, speech, or noise or more specific categories such as bird song or engine sounds

#### Model: 
For this experiment, lets use off the shelf pre-trained model for audio classification with only a few lines of code with Transformers

#### Dataset: 
MINDS-14: 
1. contains recordings of people asking an e-banking system questions in several languages and dialects.
2. It has intent_class for each recording. We can classify the audio recoding based on intent of the call

In [1]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16000))

we can use audio-classification pipeline from Transformers to classify audio recordings.
for this case we need a model that's been finetuned for intent classification.
If we visit the MINDS-14 dataset page, we can check all the models that have been trained on this dataset. 
"anton-l/xtreme_s_xlsr_300m_minds14" --> model trained on MINDS-14 dataset for audio classification

In [2]:
from transformers import pipeline

classifier = pipeline("audio-classification", model="anton-l/xtreme_s_xlsr_300m_minds14", device="mps")

Device set to use mps


In [3]:
# This pipeline expects the audio data as a NumPy array
example = minds[0]

# numpy array is under ["audio]["array"]
classifier(example["audio"]["array"])

[{'score': 0.9625311493873596, 'label': 'pay_bill'},
 {'score': 0.028672676533460617, 'label': 'freeze'},
 {'score': 0.0033497849944978952, 'label': 'card_issues'},
 {'score': 0.0020057971123605967, 'label': 'abroad'},
 {'score': 0.0008484316058456898, 'label': 'high_value_payment'}]

In [4]:
# Actual label
id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])

'pay_bill'

## Automatic speech recogntion with a pipeline

1. Automatic Speech Recognition (ASR) is a task that involves transcribing speech audio recording into text. 
2. This task has numerous practical applications, from creating closed captions for videos to enabling voice commands for virtual assistants like Siri and Alexa.

we’ll use the automatic-speech-recognition pipeline to transcribe an audio recording of a person asking a question about paying a bill using the same MINDS-14 dataset as before.

In [5]:
# all the required libraries are imported above
asr = pipeline("automatic-speech-recognition", device='mps')

No model was supplied, defaulted to facebook/wav2vec2-base-960h and revision 22aad52 (https://huggingface.co/facebook/wav2vec2-base-960h).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use mps


In [6]:
example = minds[0]
asr(example["audio"]["array"])

{'text': 'I WOULD LIKE TO PAY MY ELECTRICITY BILL USING MY CAD CAN YOU PLEASE ASSIST'}

In [13]:
example["english_transcription"]

'I would like to pay my electricity bill using my card can you please assist'

## Audio Generation with a pipeline

Audio generation encompasses a versatile set of tasks that involve producing an audio output. The tasks that we will look into here are speech generation (aka “text-to-speech”) and music generation. In text-to-speech, a model transforms a piece of text into lifelike spoken language sound, opening the door to applications such as virtual assistants, accessibility tools for the visually impaired, and personalized audiobooks. On the other hand, music generation can enable creative expression, and finds its use mostly in entertainment and game development industries.

In [21]:
pipe = pipeline("text-to-speech", model="suno/bark-small", device="cpu")

  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)
Device set to use cpu


In [22]:
text = "Ladybugs have had important roles in culture and religion, being associated with luck, love, fertility and prophecy. "
output = pipe(text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
Error during conversion: ChunkedEncodingError(ProtocolError('Response ended prematurely'))


In [23]:
from IPython.display import Audio

Audio(output["audio"], rate=output["sampling_rate"])

In [24]:
song = "♪ In the jungle, the mighty jungle, the ladybug was seen. ♪ "
output = pipe(song)
Audio(output["audio"], rate=output["sampling_rate"])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


We can also generate music

In [None]:
music_pipe = pipeline("text-to-audio", model="facebook/musicgen-small")
text = "90s rock song with electric guitar and heavy drums"
forward_params = {"max_new_tokens": 512}

output = music_pipe(text, forward_params=forward_params)
Audio(output["audio"][0], rate=output["sampling_rate"])