# Generating Audio with AudioCraft AudioGen 

This notebook contains examples of using the **AudioGen** model, part of the [AudioCraft](https://ai.meta.com/resources/models-and-libraries/audiocraft/) library.

The content below was adapted from the Audiocraft [demos](https://github.com/facebookresearch/audiocraft/tree/main/demos) on GitHub.

## Outline

- [Setup](#setup)
- [Construct model](#construct-model)
- [Audio Continuation](#audio-continuation)
- [Text-conditional Generation](#text-generation)

## Setup <a id='setup'></a>

In [1]:
# Import libraries and modules
import math
import torch
import torchaudio
from audiocraft.models import AudioGen
from audiocraft.utils.notebook import display_audio

This notebook is running on a local M2 Mac Studio.

In [2]:
# Verify Metal Performance Shaders (MPS) support
if torch.backends.mps.is_available():
    mps_device = torch.device('mps')
    x = torch.ones(1, device=mps_device)
    print(x)
else:
    print('MPS device not found.')

tensor([1.], device='mps:0')


## Construct model <a id='construct-model'></a>

Start by initializing AudioGen. A medium sized model for AudioGen: `facebook/audiogen-medium` - 1.5B transformer decoder, is provided.

**Important note:** This variant is different from the original AudioGen model presented at ["AudioGen: Textually-guided audio generation"](https://arxiv.org/abs/2209.15352) as the model architecture is similar to MusicGen with a smaller frame rate and multiple streams of tokens, allowing to reduce generation time.

In [3]:
# Construct model
model = AudioGen.get_pretrained('facebook/audiogen-medium')



Next, configure the generation parameters. Specifically, the following can be controlled:
* `use_sampling` (bool, optional): use sampling if True, else do argmax decoding. Defaults to True.
* `top_k` (int, optional): top_k used for sampling. Defaults to 250.
* `top_p` (float, optional): top_p used for sampling, when set to 0 top_k is used. Defaults to 0.0.
* `temperature` (float, optional): softmax temperature parameter. Defaults to 1.0.
* `duration` (float, optional): duration of the generated waveform. Defaults to 10.0.
* `cfg_coef` (float, optional): coefficient used for classifier free guidance. Defaults to 3.0.

When left unchanged, AudioGen will revert to its default parameters.

In [4]:
# Configure generation parameters
model.set_generation_params(
    use_sampling=True,
    top_k=250,
    duration=5
)

Next, start generating sound using one of the following modes:
* Audio continuation using `model.generate_continuation`
* Text-conditional samples using `model.generate`

## Audio Continuation <a id='audio-continuation'></a>

In [5]:
# Define function to get bip bip
def get_bip_bip(bip_duration=0.125, frequency=440,
                duration=0.5, sample_rate=16000, device='mps'):
    """Generates a series of bip bip at the given frequency."""
    t = torch.arange(
        int(duration * sample_rate), device='mps', dtype=torch.float) / sample_rate
    wav = torch.cos(2 * math.pi * frequency * t)[None]
    tp = (t % (2 * bip_duration)) / (2 * bip_duration)
    envelope = (tp >= 0.5).float()
    return wav * envelope

In [6]:
# Use a synthetic signal to prompt the generated audio
res = model.generate_continuation(
    prompt=get_bip_bip(0.125).expand(2, -1, -1), 
    prompt_sample_rate=16000, 
    descriptions=['Whistling with wind blowing', 'Typing on a typewriter'], 
    progress=True)

# Render audio player for given audio samples
display_audio(samples=res, sample_rate=16000)

Unable to find python bindings at /usr/local/dcgm/bindings/python3. No data will be captured.


   228 /    250

In [7]:
# Use any audio from a file. Make sure to trim the file if it is too long!
prompt_waveform, prompt_sr = torchaudio.load('assets/sirens_and_a_humming_engine_approach_and_pass.mp3')
prompt_duration = 2
prompt_waveform = prompt_waveform[..., :int(prompt_duration * prompt_sr)]
output = model.generate_continuation(prompt=prompt_waveform, prompt_sample_rate=prompt_sr, progress=True)
display_audio(samples=output, sample_rate=16000)

   153 /    250

## Text-conditional Generation <a id='text-generation'></a>

In [8]:
# Generate samples conditioned on text
output = model.generate(
    descriptions=[
        'Subway train blowing its horn',
        'A cat meowing',
    ],
    progress=True
)

# Render audio player for given audio samples
display_audio(samples=output, sample_rate=16000)

   253 /    250