<a href="https://colab.research.google.com/github/Vaibhavs10/dcase-2023-workshop/blob/main/01_huggingface_for_audio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hugging Face for Audio

A Whirlwind tour of the 🤗 ecosystem offerings for Audio.

## Architectures
HF Hub hosts tens of thousands pre-trained and fine-tuned checkpoints covering all major Audio tasks.

The major supported architectures are supported by two of our major libraries:

1. [Transformers](https://github.com/huggingface/transformers)
2. [Diffusers](https://github.com/huggingface/diffusers)

#### Speech Recognition ([Collection](https://huggingface.co/collections/hf-audio/automatic-speech-recognition-64fb38fc365e3069d713569e))
1. Whisper (70x faster inference, 5x faster fine-tuning)
2. Wav2Vec2
3. MMS
4. SeamlessM4T


In [2]:
!pip install -q transformers diffusers datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
from transformers import pipeline
from datasets import load_dataset

In [4]:
pipe = pipeline("automatic-speech-recognition",
                "openai/whisper-tiny.en")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.94k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/151M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.41M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.83k [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

In [7]:
ls = load_dataset("hf-internal-testing/librispeech_asr_dummy",
                  "clean",
                  split="validation")

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating validation split: 0 examples [00:00, ? examples/s]

In [8]:
ls[0]

{'file': '/root/.cache/huggingface/datasets/downloads/extracted/dfbece23564f422bc5794f3090902cd16d52d86767b746125ebc2ff3ea5f89ef/dev_clean/1272/128104/1272-128104-0000.flac',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/dfbece23564f422bc5794f3090902cd16d52d86767b746125ebc2ff3ea5f89ef/dev_clean/1272/128104/1272-128104-0000.flac',
  'array': array([0.00238037, 0.0020752 , 0.00198364, ..., 0.00042725, 0.00057983,
         0.0010376 ]),
  'sampling_rate': 16000},
 'text': 'MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL',
 'speaker_id': 1272,
 'chapter_id': 128104,
 'id': '1272-128104-0000'}

In [9]:
pipe(ls[0]["audio"])

{'text': ' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'}

#### Text to Speech/ Audio ([Collection](https://huggingface.co/collections/hf-audio/text-to-speech-64fb378ac45dd732ac9a695e))
1. Bark
2. SpeechT5
3. AudioLDM 1/ 2 (Diffusers)

#### Text to Music ([Collection](https://huggingface.co/collections/hf-audio/text-to-music-6502c850c130d99814bd05ea))
1. MusicGen
2. MusicLDM


In [10]:
pipe = pipeline("text-to-speech",
                "suno/bark-small")

Downloading (…)lve/main/config.json:   0%|          | 0.00/8.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [11]:
pipe("How is it going y'all?")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


{'audio': array([[0.02595375, 0.02556818, 0.02605408, ..., 0.00305051, 0.00337358,
         0.00442385]], dtype=float32),
 'sampling_rate': 24000}

#### Audio Classification ([Collection](https://huggingface.co/collections/hf-audio/audio-classification-6502ca72e373323dabc62ee4))
1. Whisper
2. Wav2Vec2
3. HuBERT

#### Audio Codec Embeddings ([Collection](https://huggingface.co/collections/hf-audio/audio-codecs-embeddings-6502cba32cec4c6d94f1087d))
1. CLAP
2. Encodec

## Datasets

🤗 datasets offers a one-stop shop to all your datasets need. It scales without any datasets limitations. We also host and maintain major academic and industrial datasets as well.

P.S. You can also bring your own datasets and share them with the community this way as well.

Some speech datasets include:
1. Common Voice 13
2. FLEURS
3. VoxPopuli
4. LibriSpeech
5. Much more..


Best part, all of this comes with a unified and easy-to-use interface!

In [None]:
cv = load_dataset()

## Utility libraries

1. [PEFT](https://github.com/huggingface/peft) (Parameter Efficient Fine Tuning)
2. [Accelerate](https://github.com/huggingface/accelerate)
3. [Evaluate](https://github.com/huggingface/evaluate)

HF offers a pleathora of ecosystem libraries that plug-and-play with your existing training/ inference code to scale them up.