<a href="https://colab.research.google.com/github/diyavol76/vox-core/blob/master/Notebooks/Bark_HuggingFace_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bark in 🤗 Transformers

The Bark model is available in 🤗 Transformers from v4.31.0 onwards!

In this notebook, we'll demonstrate how to use the Bark model using the 🤗 Transformers library, covering un-conditional generation, speaker prompted generation, and advanced text prompts for controllable generation.

## Bark Architecture


Bark is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/bark](https://github.com/suno-ai/bark).

Bark is made of 4 main models:

- `BarkSemanticModel` (also referred to as the 'text' model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.
- `BarkCoarseModel` (also referred to as the 'coarse acoustics' model): a causal autoregressive transformer, that takes as input the results of the `BarkSemanticModel` model. It aims at predicting the first two audio codebooks necessary for EnCodec.
- `BarkFineModel` (the 'fine acoustics' model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.
- having predicted all the codebook channels from the `EncodecModel`, Bark uses it to decode the output audio array.

It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.


## Prepare the Environment

Let’s make sure we’re connected to a GPU to run this notebook. To get a GPU, click Runtime -> Change runtime type, then change Hardware accelerator from None to GPU. We can verify that we’ve been assigned a GPU and view its specifications through the nvidia-smi command:

In [1]:
!nvidia-smi

Sun Nov 12 21:19:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We see here that we've got on Tesla T4 16GB GPU, although this may vary for you depending on GPU availablity and Colab GPU assignment.

Next, we install the 🤗 Transformers package from the main branch:

In [2]:
!pip install --upgrade --quiet pip
!pip install --quiet git+https://github.com/huggingface/transformers.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m98.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[0m

# Load the Model

The pre-trained Bark small and large checkpoints can be loaded from the [pre-trained weights](https://huggingface.co/suno/bark) on the Hugging Face Hub. You can change the repo-id with the checkpoint size that you wish to use.

We'll default to the large checkpoint, for better quality but slower inference. But you can use the small checkpoint by using `"suno/bark-small"` instead of `"suno/bark"`.



In [3]:
from transformers import BarkModel

model = BarkModel.from_pretrained("suno/bark-small")

Downloading (…)lve/main/config.json:   0%|          | 0.00/8.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]



Downloading (…)neration_config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]

In [4]:
model

BarkModel(
  (semantic): BarkSemanticModel(
    (input_embeds_layer): Embedding(129600, 768)
    (position_embeds_layer): Embedding(1024, 768)
    (drop): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-11): 12 x BarkBlock(
        (layernorm_1): BarkLayerNorm()
        (layernorm_2): BarkLayerNorm()
        (attn): BarkSelfAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
          (att_proj): Linear(in_features=768, out_features=2304, bias=False)
          (out_proj): Linear(in_features=768, out_features=768, bias=False)
        )
        (mlp): BarkMLP(
          (in_proj): Linear(in_features=768, out_features=3072, bias=False)
          (out_proj): Linear(in_features=3072, out_features=768, bias=False)
          (dropout): Dropout(p=0.0, inplace=False)
          (gelu): GELU(approximate='none')
        )
      )
    )
    (layernorm_final): BarkLayerNorm()
    (lm_head): Linear(in_features=76

Place the model to an accelerator device if available.

In [5]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)

## Generating speech

Bark is an highly-controllable text-to-speech model, meaning you can use with various settings, as we are going to see.

Before everything, load `BarkProcessor` in order to be able to pre-process the inputs.

The processor role here is two-sides:
1. It is used to tokenize the input text, i.e. to cut it into small pieces that the model can understand.
2. It stores speaker embeddings, i.e voice presets that can condition the generation.

In [6]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("suno/bark")

Downloading (…)okenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

Downloading (…)embeddings_path.json:   0%|          | 0.00/61.1k [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

### Unconditional generation

First, let's generate speech in the most simple manner possible, with no frills.

In [7]:
# prepare the inputs
text_prompt = "Let's try generating speech, with Bark, a text-to-speech model"
inputs = processor(text_prompt)

# generate speech
speech_output = model.generate(**inputs.to(device))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


The audio outputs are a three-dimensional Torch tensor of shape `(batch_size, num_channels, sequence_length)`. To listen
to the generated audio samples, you can either play them in an ipynb notebook:

In [8]:
from IPython.display import Audio

sampling_rate = model.generation_config.sample_rate
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

Or save them as a .wav file using a third-party library, e.g. scipy (note here that we also need to remove the channel dimension from our audio tensor):

In [18]:
import scipy

scipy.io.wavfile.write("bark_out.wav", rate=sampling_rate, data=speech_output[0].cpu().numpy())

### Conditional generation

Suno AI team proposes a [library of preset voices](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c) that are used to condition the generated speech. In other words, it generates speech that appears to be generated by the predefined conditional voice.

The processor can be used to automatically load these speaker prompts when tokenising the input text.

Let's try one voice preset:

In [10]:
voice_preset = "v2/en_speaker_6"

# prepare the inputs
text_prompt = "Let's try generating speech, with Bark, a text-to-speech model"
inputs = processor(text_prompt, voice_preset=voice_preset)

# generate speech
speech_output = model.generate(**inputs.to(device))

# let's hear it
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

Downloading (…)_semantic_prompt.npy:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

Downloading (…)_6_coarse_prompt.npy:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading (…)er_6_fine_prompt.npy:   0%|          | 0.00/15.0k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


Great, let's try another voice preset:

In [11]:
voice_preset = "v2/en_speaker_3"

# prepare the inputs
text_prompt = "Let's try generating speech, with Bark, a text-to-speech model"
inputs = processor(text_prompt, voice_preset=voice_preset)

# generate speech
speech_output = model.generate(**inputs.to(device))

# let's hear it
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

Downloading (…)_semantic_prompt.npy:   0%|          | 0.00/3.54k [00:00<?, ?B/s]

Downloading (…)_3_coarse_prompt.npy:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

Downloading (…)er_3_fine_prompt.npy:   0%|          | 0.00/20.6k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


### More advanced generation techniques

The previous generation methods were all generated by default using sampling mode (`do_sample=True`) but you can also use [more advanced generation techniques](https://huggingface.co/docs/transformers/generation_strategies) such as `beam_search` to have better quality.

You can also specify specifc generation parameters for each sub-model by simply prepending `semantic_`, `coarse_` or `fine_` to the generation parameters you want.

Let's use it with the previous `text_prompt`.



In [12]:
speech_output = model.generate(**inputs, num_beams = 4, temperature = 0.5, semantic_temperature = 0.8)

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


### Multilingual speech

Bark can also generate multilingual speech such as French and Chinese speech.

In [13]:
# Multilingual speech - simplified Chinese
inputs = processor("惊人的！我会说中文")

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [14]:
# Multilingual speech - French - let's use a voice_preset as well
inputs = processor("Je peux générer du son facilement avec ce modèle.", voice_preset="fr_speaker_3")

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

Downloading (…)_semantic_prompt.npy:   0%|          | 0.00/3.19k [00:00<?, ?B/s]

Downloading (…)_3_coarse_prompt.npy:   0%|          | 0.00/9.33k [00:00<?, ?B/s]

Downloading (…)er_3_fine_prompt.npy:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


### **Non-verbal** communications

The model can also produce **nonverbal communications** like laughing, sighing and crying.


In [15]:
# Adding non-speech cues to the input text
inputs = processor("[clears throat] Hello uh ..., my dog is cute [laughter]")


# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


### More applications:

Bark can also generate music. You can help it out by adding music notes around your lyrics.

In [16]:
inputs = processor("♪ In the jungle, the mighty jungle, the lion barks tonight ♪")

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [17]:
# more advanced prompts!

text_prompt = """
    WOMAN: I would like an oatmilk latte please.
    MAN: Wow, that's expensive!
"""

inputs = processor(text_prompt)

# generate speech
speech_output = model.generate(**inputs.to(device))

Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


## Concluding remarks

Bark is a versatile model, play with it to discover more about its capabilities and limits!