Let’s take a look at some of the mind-blowing audio task examples:

- Audio classification: easily categorize audio clips into different categories. You can identify whether a recording is of a barking dog or a meowing cat, or what music genre a song belongs to.

- Automatic speech recognition: transform audio clips into text by transcribing them automatically. You can get a text representation of a recording of someone speaking, like “How are you doing today?“. Rather useful for note taking!

- Speaker diarization: Ever wondered who’s speaking in a recording? With 🤗 Transformers, you can identify which speaker is talking at any given time in an audio clip. Imagine being able to differentiate between “Alice” and “Bob” in a recording of them having a conversation.

- Text to speech: create a narrated version of a text that can be used to produce an audio book, help with accessibility, or give a voice to an NPC in a game. With 🤗 Transformers, you can easily do that!

In [1]:
from IPython.display import display, HTML
display(HTML(
"""
<a target="_blank" href="https://colab.research.google.com/github/facebookresearch/segment-anything/blob/main/notebooks/automatic_mask_generator_example.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
"""
))

In [2]:
using_colab = True #@param

In [3]:
if using_colab:
    # import torch
    # import torchvision
    # print("PyTorch version:", torch.__version__)
    # print("Torchvision version:", torchvision.__version__)
    # print("CUDA is available:", torch.cuda.is_available())
    import sys
    !{sys.executable} -m pip install -q 'git+https://github.com/huggingface/diffusers'
    !{sys.executable} -m pip install -Uq 'git+https://github.com/huggingface/transformers'
    !{sys.executable} -m pip install -Uqq  datasets accelerate bitsandbytes
    !{sys.executable} -m pip install -Uqq  'git+https://github.com/suno-ai/bark.git'
    !{sys.executable} -m pip install -Uq git+https://github.com/facebookresearch/audiocraft#egg=audiocraft

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for diffusers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━

In [6]:
from bark import SAMPLE_RATE, generate_audio, preload_models
# import Audio to listen to generate audio array in notebook.
from IPython.display import Audio

import scipy

# Text Prompt
text_prompt1 = """
A Learjet 45 aircraft with eight people on board
veered off on thursday"""

In [7]:
# generate audio numpy array for given text
speech_array1 = generate_audio(text_prompt1,
                history_prompt="en_speaker_6")
# play text in notebook
Audio(speech_array1, rate=SAMPLE_RATE)

scipy.io.wavfile.write("bark_out1.wav", rate=SAMPLE_RATE,data=speech_array1)

Downloading text_2.pt:   0%|          | 0.00/5.35G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading coarse_2.pt:   0%|          | 0.00/3.93G [00:00<?, ?B/s]

Downloading fine_2.pt:   0%|          | 0.00/3.74G [00:00<?, ?B/s]

Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /root/.cache/torch/hub/checkpoints/encodec_24khz-d7cc33bc.th
100%|██████████| 88.9M/88.9M [00:00<00:00, 144MB/s]
100%|██████████| 445/445 [00:07<00:00, 61.57it/s]
100%|██████████| 23/23 [00:21<00:00,  1.05it/s]


In [8]:
text_prompt3 = """
She gazed at the night sky, a canvas of a billion stars ... lost in cosmic wonder.
"""  # ... adds a moment of profound contemplation.
speech_array3 = generate_audio(text_prompt3, history_prompt="en_speaker_4")
# Play the text in the notebook
Audio(speech_array3, rate=SAMPLE_RATE)


100%|██████████| 661/661 [00:12<00:00, 52.93it/s]
100%|██████████| 34/34 [00:33<00:00,  1.01it/s]


In [9]:
text_prompt = 'धैर्य और समय, दोनों ही बहुत महत्वपूर्ण हैं। जब तुम अपने लक्ष्य की ओर बढ़ रहे हो, तो इन दोनों की आवश्यकता होती है। धैर्य रखो और समय दो, सफलता जरूर मिलेगी।'

audio_arrays = generate_audio(text_prompt)
Audio(audio_arrays, rate=SAMPLE_RATE)

100%|██████████| 709/709 [00:09<00:00, 72.57it/s]
100%|██████████| 36/36 [00:34<00:00,  1.05it/s]


In [11]:
text_prompt = '千里之行，始于足下。所以，不要害怕迈出第一步'

audio_arrays = generate_audio(text_prompt)
Audio(audio_arrays, rate=SAMPLE_RATE)

100%|██████████| 643/643 [00:09<00:00, 68.00it/s]
100%|██████████| 33/33 [00:31<00:00,  1.06it/s]


In [12]:
text_prompt = '''
Mon père me disait toujours:
"The best way to predict the future is to create it."
'''

audio_arrays = generate_audio(text_prompt)
Audio(audio_arrays, rate=SAMPLE_RATE)

100%|██████████| 524/524 [00:05<00:00, 91.49it/s]
100%|██████████| 27/27 [00:25<00:00,  1.05it/s]


In [13]:
text_prompt = '''
WOMAN: I would like an oatmilk latte please.
MAN: Wow, that's expensive!
'''

audio_arrays = generate_audio(text_prompt)
Audio(audio_arrays, rate=SAMPLE_RATE)

100%|██████████| 695/695 [00:07<00:00, 87.45it/s]
100%|██████████| 35/35 [00:34<00:00,  1.02it/s]


In [14]:
text_prompt5 = """
   what happened my friend?
"""
speech_array5 = generate_audio(text_prompt5,history_prompt="v2/en_speaker_6")
# play text in notebook
Audio(speech_array5, rate=SAMPLE_RATE)

100%|██████████| 113/113 [00:02<00:00, 54.42it/s]
100%|██████████| 6/6 [00:05<00:00,  1.08it/s]


In [15]:
text_prompt = '♪ I want to break free! I want to break FREE!!  ♪'

audio_arrays = generate_audio(text_prompt)
Audio(audio_arrays, rate=SAMPLE_RATE)

100%|██████████| 147/147 [00:02<00:00, 58.09it/s]
100%|██████████| 8/8 [00:05<00:00,  1.55it/s]
