<a href="https://colab.research.google.com/github/domswp/Bark-ai-long-generation/blob/main/Bark_AI_long_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Check GPU

In [None]:
!nvidia-smi

Wed Jul 26 09:22:32 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Import Module

In [None]:
import torch
print(torch.__version__)

2.0.1+cu118


In [None]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Install Bark Repo

In [None]:
# install bark (make sure you have torch>=2 for much faster flash-attention)
!pip install git+https://github.com/suno-ai/bark.git

# Bark AI

In [None]:
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"


from IPython.display import Audio
import nltk  # we'll use this to split into sentences
import numpy as np

from bark.generation import (
    generate_text_semantic,
    preload_models,
)
from bark.api import semantic_to_waveform
from bark import generate_audio, SAMPLE_RATE

preload_models()

# Simple Long-Form Generation
We split longer text into sentences using `nltk` and generate the sentences one by one.

In [None]:
script = """
Hey, have you heard about this new text-to-audio model called "Bark"?
Apparently, it's the most realistic and natural-sounding text-to-audio model
out there right now. People are saying it sounds just like a real person speaking.
I think it uses advanced machine learning algorithms to analyze and understand the
nuances of human speech, and then replicates those nuances in its own speech output.
It's pretty impressive, and I bet it could be used for things like audiobooks or podcasts.
In fact, I heard that some publishers are already starting to use Bark to create audiobooks.
It would be like having your own personal voiceover artist. I really think Bark is going to
be a game-changer in the world of text-to-audio technology.
""".replace("\n", " ").strip()

In [None]:
sentences = nltk.sent_tokenize(script)

In [None]:
SPEAKER = "v2/en_speaker_6"
silence = np.zeros(int(0.25 * SAMPLE_RATE))  # quarter second of silence

pieces = []
for sentence in sentences:
    audio_array = generate_audio(sentence, history_prompt=SPEAKER)
    pieces += [audio_array, silence.copy()]


100%|██████████| 250/250 [00:02<00:00, 86.99it/s]
100%|██████████| 13/13 [00:12<00:00,  1.04it/s]
100%|██████████| 265/265 [00:03<00:00, 74.75it/s]
100%|██████████| 14/14 [00:12<00:00,  1.08it/s]
100%|██████████| 213/213 [00:02<00:00, 101.36it/s]
100%|██████████| 11/11 [00:10<00:00,  1.02it/s]
100%|██████████| 677/677 [00:08<00:00, 81.60it/s]
100%|██████████| 34/34 [00:32<00:00,  1.04it/s]
100%|██████████| 617/617 [00:06<00:00, 88.40it/s]
100%|██████████| 31/31 [00:30<00:00,  1.03it/s]
100%|██████████| 564/564 [00:06<00:00, 91.94it/s]
100%|██████████| 29/29 [00:27<00:00,  1.05it/s]
100%|██████████| 296/296 [00:03<00:00, 85.61it/s]
100%|██████████| 15/15 [00:14<00:00,  1.05it/s]
100%|██████████| 518/518 [00:05<00:00, 93.49it/s]
100%|██████████| 26/26 [00:25<00:00,  1.03it/s]


In [None]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

# Advanced Long-Form Generation
Somtimes Bark will hallucinate a little extra audio at the end of the prompt.
We can solve this issue by lowering the threshold for bark to stop generating text.
We use the `min_eos_p` kwarg in `generate_text_semantic`

In [None]:
GEN_TEMP = 0.6
SPEAKER = "v2/en_speaker_6"
silence = np.zeros(int(0.25 * SAMPLE_RATE))  # quarter second of silence

pieces = []
for sentence in sentences:
    semantic_tokens = generate_text_semantic(
        sentence,
        history_prompt=SPEAKER,
        temp=GEN_TEMP,
        min_eos_p=0.05,  # this controls how likely the generation is to end
    )

    audio_array = semantic_to_waveform(semantic_tokens, history_prompt=SPEAKER,)
    pieces += [audio_array, silence.copy()]



100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:02<00:00, 38.05it/s]
100%|████████████████████████████████████████████████████████████████████████| 18/18 [00:07<00:00,  2.46it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:03<00:00, 32.28it/s]
100%|████████████████████████████████████████████████████████████████████████| 21/21 [00:08<00:00,  2.54it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 55.78it/s]
100%|████████████████████████████████████████████████████████████████████████| 14/14 [00:05<00:00,  2.57it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:06<00:00, 14.73it/s]
100%|████████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00,  2.47it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:02<00:00, 40.29it/s]
100%|█████

In [None]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

# $ \\ $

# Make a Long-Form Dialog with Bark

### Step 1: Format a script and speaker lookup

In [None]:
speaker_lookup = {"Samantha": "v2/en_speaker_9", "John": "v2/en_speaker_2"}

# Script generated by chat GPT
script = """
Samantha: Hey, have you heard about this new text-to-audio model called "Bark"?

John: No, I haven't. What's so special about it?

Samantha: Well, apparently it's the most realistic and natural-sounding text-to-audio model out there right now. People are saying it sounds just like a real person speaking.

John: Wow, that sounds amazing. How does it work?

Samantha: I think it uses advanced machine learning algorithms to analyze and understand the nuances of human speech, and then replicates those nuances in its own speech output.

John: That's pretty impressive. Do you think it could be used for things like audiobooks or podcasts?

Samantha: Definitely! In fact, I heard that some publishers are already starting to use Bark to create audiobooks. And I bet it would be great for podcasts too.

John: I can imagine. It would be like having your own personal voiceover artist.

Samantha: Exactly! I think Bark is going to be a game-changer in the world of text-to-audio technology."""
script = script.strip().split("\n")
script = [s.strip() for s in script if s]
script

['Samantha: Hey, have you heard about this new text-to-audio model called "Bark"?',
 "John: No, I haven't. What's so special about it?",
 "Samantha: Well, apparently it's the most realistic and natural-sounding text-to-audio model out there right now. People are saying it sounds just like a real person speaking.",
 'John: Wow, that sounds amazing. How does it work?',
 'Samantha: I think it uses advanced machine learning algorithms to analyze and understand the nuances of human speech, and then replicates those nuances in its own speech output.',
 "John: That's pretty impressive. Do you think it could be used for things like audiobooks or podcasts?",
 'Samantha: Definitely! In fact, I heard that some publishers are already starting to use Bark to create audiobooks. And I bet it would be great for podcasts too.',
 'John: I can imagine. It would be like having your own personal voiceover artist.',
 'Samantha: Exactly! I think Bark is going to be a game-changer in the world of text-to-audi

### Step 2: Generate the audio for every speaker turn

In [None]:
pieces = []
silence = np.zeros(int(0.5*SAMPLE_RATE))
for line in script:
    speaker, text = line.split(": ")
    audio_array = generate_audio(text, history_prompt=speaker_lookup[speaker], )
    pieces += [audio_array, silence.copy()]

100%|██████████| 267/267 [00:02<00:00, 96.38it/s]
100%|██████████| 14/14 [00:13<00:00,  1.06it/s]
100%|██████████| 150/150 [00:01<00:00, 100.01it/s]
100%|██████████| 8/8 [00:07<00:00,  1.05it/s]
100%|██████████| 706/706 [00:08<00:00, 87.58it/s]
100%|██████████| 36/36 [00:34<00:00,  1.05it/s]
100%|██████████| 309/309 [00:04<00:00, 69.56it/s]
100%|██████████| 16/16 [00:14<00:00,  1.07it/s]
100%|██████████| 692/692 [00:07<00:00, 89.03it/s]
100%|██████████| 35/35 [00:33<00:00,  1.04it/s]
100%|██████████| 521/521 [00:05<00:00, 88.90it/s]
100%|██████████| 27/27 [00:24<00:00,  1.08it/s]
100%|██████████| 676/676 [00:07<00:00, 85.24it/s]
100%|██████████| 34/34 [00:32<00:00,  1.04it/s]
100%|██████████| 212/212 [00:02<00:00, 100.97it/s]
100%|██████████| 11/11 [00:10<00:00,  1.07it/s]
100%|██████████| 354/354 [00:03<00:00, 97.42it/s]
100%|██████████| 18/18 [00:16<00:00,  1.06it/s]


### Step 3: Concatenate all of the audio and play it

In [None]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)