# Text to Speech and Text to Music Pipelines in Hugging Face
* Notebook by Adam Lang
* Date: 12/4/2024

# Overview
* In this notebook I will demonstrate how to implement 2 pipelines in hugging face:
1. Text to Speech
2. Text to Music

In [5]:
## imports
import transformers
from transformers import pipeline
transformers.__version__ #check version

'4.46.2'

Next we need to import `soundfile`
* More info here: https://pypi.org/project/soundfile/

In [2]:
import soundfile as sf

In [3]:
## text to sample
text = """
At its re:Invent conference on Tuesday, Amazon Web Services (AWS), Amazon’s cloud computing division, announced a new family of multimodal generative AI models it calls Nova.
There are four text-generating models in total: Micro, Lite, Pro, and Premier. Micro, Lite, and Pro are available Tuesday to AWS customers, while Premier will arrive in early 2025, Amazon CEO Andy Jassy said onstage.
The text-generating Nova models, which are optimized for 15 languages (but primarily English), have widely varying sizes and capabilities.
Micro can only take in text and output text but delivers the lowest latency of the bunch — processing text and generating responses the fastest.

"""

## Default Text to Speech Model
* The default text to speech model in HF pipelines is: `suno/bark-small and revision 1dbd7a1`
  * model card: https://huggingface.co/suno/bark-small
  * Bark is a transformer-based text-to-audio model created by Suno.
    * Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects.
    * The model can also produce nonverbal communications like laughing, sighing and crying.

In [6]:
## synth pipeline
synth = pipeline("text-to-speech")

No model was supplied, defaulted to suno/bark-small and revision 1dbd7a1 (https://huggingface.co/suno/bark-small).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/8.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


generation_config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [7]:
# pass text data to synth pipeline
speech = synth(text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [8]:
## output --> speech variable
speech

{'audio': array([[-0.00975459, -0.00944385, -0.01065428, ...,  0.00245522,
          0.00256116,  0.00251117]], dtype=float32),
 'sampling_rate': 24000}

In [10]:
## write result to a `.wav` file
sf.write("speech.wav",
         speech["audio"].T,
         samplerate=speech['sampling_rate'])

Summary
* After you run `sf.write` the `.wav` file will be stored in your local directory.
* Then you can download the file and listen to it.

# Text to Music Generation
* We will now take a different approach using text to audio.
* We will implement a text to music generation pipeline in hugging face.
* The model we will use is: `facebook/musicgen-small`
  * model card: https://huggingface.co/facebook/musicgen-small
  * MusicGen is a text-to-music model capable of generating high-quality music samples conditioned on **text descriptions or audio prompts.**
  * It is a single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz.
  * Unlike existing methods, like MusicLM, **MusicGen doesn't require a self-supervised semantic representation**, and it generates all 4 codebooks in one pass.
  * By introducing a small delay between the codebooks, facebook shows they can predict them in parallel, thus having only 50 auto-regressive steps per second of audio.

* **The overall concept is that you send a text prompt to the model and it can generate music based on the prompt.**


In [11]:
## setup synth pipeline again
synth = pipeline("text-to-audio",
                 model="facebook/musicgen-small")

# prompt
text = "an upbeat song with 90s Seattle Grunge vibes" #prompt from the repo


# run generation pipeline
music = synth(text,
              forward_params={"do_sample":True})

config.json:   0%|          | 0.00/7.87k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

Config of the text_encoder: <class 'transformers.models.t5.modeling_t5.T5EncoderModel'> is overwritten by shared text_encoder config: T5Config {
  "_name_or_path": "t5-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_be

generation_config.json:   0%|          | 0.00/224 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [13]:
# write results to `.wav` file
#sf.write("music.wav", music["audio"].T, samplerate=music["sampling_rate"])

# need to reshape data
import scipy

## write wavfile using scipy
scipy.io.wavfile.write("music.wav", rate=music["sampling_rate"], data=music["audio"])