# Unit 2. A gentle introduction to audio applications

## Audio generation with a pipeline

Please see [https://huggingface.co/learn/audio-course/chapter2/tts_pipeline#audio-generation-with-a-pipeline](https://huggingface.co/learn/audio-course/chapter2/tts_pipeline#audio-generation-with-a-pipeline)

Also refer to:

* [Text to speech](https://huggingface.co/docs/transformers/en/tasks/text-to-speech#text-to-speech) in the Multimodel task guide

* [`suno/bark-small`](https://huggingface.co/suno/bark-small) text-to-audio model 

In [1]:
from transformers import pipeline

pipe = pipeline("text-to-speech", model="suno/bark-small")

  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


### Audio generation: _English-to-English_

In [2]:
en_text = """
Ladybugs have had important roles in culture and religion,
being associated with luck, love, fertility and prophecy.
""".strip()

In [3]:
output = pipe(en_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [4]:
type(output)

dict

In [5]:
from IPython.display import Audio

Audio(output["audio"], rate=output["sampling_rate"])

### Audio generation: _French-to-French_

In [6]:
fr_text = """"
Les coccinelles ont joué un rôle important dans la culture et la religion,
étant associé à la chance, à l’amour, à la fertilité et à la prophétie. 
""".strip()

output = pipe(fr_text)

Audio(output["audio"], rate=output["sampling_rate"])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


### Audio generation: _Japanese-to-Japanese_

In [7]:
jp_text = """
てんとう虫は文化や宗教において重要な役割を果たしてきました。
幸運、愛、豊饒、予言に関連しています。
""".strip()

output = pipe(jp_text)

Audio(output["audio"], rate=output["sampling_rate"])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


### Bark can sing?

... not too sure about this one...

In [8]:
song = "♪ In the jungle, the mighty jungle, the ladybug was seen. ♪ "

output = pipe(song)

Audio(output["audio"], rate=output["sampling_rate"])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


### Generating music

Please see [`facebook/musicgen-small`](https://huggingface.co/facebook/musicgen-small#musicgen---small---300m).

In [9]:
music_pipe = pipeline(
    "text-to-audio", 
    model="facebook/musicgen-small"
)

In [10]:
text = "90s rock song with electric guitar and heavy drums"

In [11]:
output = music_pipe(
    text, 
    forward_params={"max_new_tokens": 512}
)

Audio(output["audio"][0], rate=output["sampling_rate"])