In [3]:
import os
# my GPU has not 8GB VRAM but 6GB
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["SUNO_USE_SMALL_MODELS"] = "True"
os.environ["SUNO_OFFLOAD_CPU"] = "True"

In [4]:
import bark
from IPython.display import Audio
import nltk  # we'll use this to split into sentences
import spacy
import numpy as np

from bark.generation import (
    generate_text_semantic,
    preload_models,
)
from bark.api import semantic_to_waveform
from bark import generate_audio, SAMPLE_RATE

In [5]:
nlp = spacy.load('en_core_web_lg') # Load the English Model

In [6]:
preload_models()

# Simple Long-Form Generation
We split longer text into sentences using `nltk` and generate the sentences one by one.

In [20]:
script = """
Hey, have you heard about this new text-to-audio model called "Bark"? 
Apparently, it's the most realistic and natural-sounding text-to-audio model 
out there right now. People are saying it sounds just like a real person speaking. 
I think it uses advanced machine learning algorithms to analyze and understand the 
nuances of human speech, and then replicates those nuances in its own speech output. 
It's pretty impressive, and I bet it could be used for things like audiobooks or podcasts. 
In fact, I heard that some publishers are already starting to use Bark to create audiobooks. 
It would be like having your own personal voiceover artist. I really think Bark is going to 
be a game-changer in the world of text-to-audio technology.
""".replace("\n", " ").strip()

script = """
READER,

That this part and the next are imperfect, and so much only is written
as I might, and not as I would, I need not excuse to thee if thou know
me, and where and when I live. But some of that which is wanting, if
thou desire, thou mayst find, 1. In my "Universal Concord." 2. In my
"Christian Concord." 3. In our "Agreement for Catechising," and my
"Reformed Pastor." 4. In the "Reformed Liturgy," offered to the
commissioned bishops at the Savoy. Farewell.

CHAPTER I.

OF THE WORSHIP OF GOD IN GENERAL.

_Direct._ I. Understand what it is to worship God aright, lest you
offer him vanity and sin for worship. The worshipping of God is the
direct acknowledging of his being and perfections to his honour.
Indirectly or consequentially he is acknowledged in every obediential
act by those that truly obey and serve him; and this is indirectly and
participatively to worship him; and therefore all things are holy to
the holy, because they are holy in the use of all, and Holiness to the
Lord is, as it were, written upon all that they possess or do (as they
are holy): but this is not the worship which we are here to speak of;
but that which is primarily and directly done to glorify him by the
acknowledgment of his excellencies. Thus God is worshipped either
inwardly by the soul alone, or also outwardly by the body expressing
the worship of the soul. For that which is done by the body alone,
without the concurrence of the heart, is not true worship, but a
hypocritical image or show of it, equivocally called worship.[2] The
inward worship of the heart alone, I have spoken of in the former
part. The outward or expressive worship, is simple or mixed: simple
when we only intend God's worship immediately in the action; and this
is found chiefly in praises and thanksgiving, which therefore are the
most pure and simple sort of expressive worship. Mixed worship is that
in which we join some other intention, for our own benefit in the
action; as in prayer, where we worship God by seeking to him for
mercy; and in reverent hearing or reading of his word, where we
worship him by a holy attendance upon his instructions and commands;
and in his sacraments, where we worship him by receiving and
acknowledging his benefits to our souls; and in oblations, where we
have respect also to the use of the thing offered; and in holy vows
and oaths, in which we acknowledge him our Lord and Judge. All these
are acts of divine worship, though mixed with other uses.
""".replace("\n", " ").strip()

In [26]:
import re 
script = re.sub(r'[^a-zA-Z0-9\s[0-9\.]]', '', script)

In [27]:
script

'READER,  That this part and the next are imperfect, and so much only is written as I might, and not as I would, I need not excuse to thee if thou know me, and where and when I live. But some of that which is wanting, if thou desire, thou mayst find, 1. In my "Universal Concord." 2. In my "Christian Concord." 3. In our "Agreement for Catechising," and my "Reformed Pastor." 4. In the "Reformed Liturgy," offered to the commissioned bishops at the Savoy. Farewell.  CHAPTER I.  OF THE WORSHIP OF GOD IN GENERAL.  _Direct._ I. Understand what it is to worship God aright, lest you offer him vanity and sin for worship. The worshipping of God is the direct acknowledging of his being and perfections to his honour. Indirectly or consequentially he is acknowledged in every obediential act by those that truly obey and serve him; and this is indirectly and participatively to worship him; and therefore all things are holy to the holy, because they are holy in the use of all, and Holiness to the Lord 

In [28]:
doc = nlp(script)

In [29]:
def extract_english_words(text):
    doc = nlp(text)
    english_words = [token.text for token in doc 
        if (token.is_alpha and token.lang_ == "en") or (token.is_punct and token.text in ['.', '!', '?'])]
    cleaned_text = " ".join(english_words)
    return cleaned_text

In [31]:
text_ = extract_english_words(script)
text_ = script 
for sentence in nlp(text_).sents:
    print(sentence)

READER,  That this part and the next are imperfect, and so much only is written as I might, and not as I would, I need not excuse to thee if thou know me, and where and when I live.
But some of that which is wanting, if thou desire, thou mayst find, 1.
In my "Universal Concord."
2.
In my "Christian Concord."
3.
In our "Agreement for Catechising," and my "Reformed Pastor."
4.
In the "Reformed Liturgy," offered to the commissioned bishops at the Savoy.
Farewell.  CHAPTER I.  OF THE WORSHIP OF GOD IN GENERAL.  
_Direct._
I. Understand what it is to worship God aright, lest you offer him vanity and sin for worship.
The worshipping of God is the direct acknowledging of his being and perfections to his honour.
Indirectly or consequentially he is acknowledged in every obediential act by those that truly obey and serve him; and this is indirectly and participatively to worship him; and therefore all things are holy to the holy, because they are holy in the use of all, and Holiness to the Lord 

In [6]:
SPEAKER = "v2/en_speaker_6"
silence = np.zeros(int(0.25 * SAMPLE_RATE))  # quarter second of silence

pieces = []
for sentence in sentences:
    audio_array = generate_audio(sentence, history_prompt=SPEAKER)
    pieces += [audio_array, silence.copy()]


100%|██████████| 585/585 [00:05<00:00, 104.75it/s]
100%|██████████| 30/30 [00:15<00:00,  1.94it/s]
100%|██████████| 541/541 [00:03<00:00, 151.35it/s]
100%|██████████| 28/28 [00:14<00:00,  1.99it/s]
100%|██████████| 254/254 [00:01<00:00, 150.11it/s]
100%|██████████| 13/13 [00:06<00:00,  1.93it/s]
100%|██████████| 613/613 [00:04<00:00, 134.25it/s]
100%|██████████| 31/31 [00:16<00:00,  1.86it/s]
100%|██████████| 520/520 [00:03<00:00, 132.11it/s]
100%|██████████| 27/27 [00:15<00:00,  1.74it/s]
100%|██████████| 360/360 [00:02<00:00, 123.46it/s]
100%|██████████| 19/19 [00:10<00:00,  1.86it/s]
100%|██████████| 275/275 [00:02<00:00, 112.06it/s]
100%|██████████| 14/14 [00:09<00:00,  1.43it/s]
100%|██████████| 391/391 [00:03<00:00, 110.85it/s]
100%|██████████| 20/20 [00:11<00:00,  1.74it/s]


In [7]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

# $ \\ $

# Advanced Long-Form Generation
Somtimes Bark will hallucinate a little extra audio at the end of the prompt.
We can solve this issue by lowering the threshold for bark to stop generating text. 
We use the `min_eos_p` kwarg in `generate_text_semantic`

In [37]:
GEN_TEMP = 0.6
SPEAKER = "v2/en_speaker_6"
silence = np.zeros(int(0.25 * SAMPLE_RATE))  # quarter second of silence

pieces = []
for sentence in sentences:
    semantic_tokens = generate_text_semantic(
        sentence,
        history_prompt=SPEAKER,
        temp=GEN_TEMP,
        min_eos_p=0.05,  # this controls how likely the generation is to end
    )

    audio_array = semantic_to_waveform(semantic_tokens, history_prompt=SPEAKER,)
    pieces += [audio_array, silence.copy()]



100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:02<00:00, 38.05it/s]
100%|████████████████████████████████████████████████████████████████████████| 18/18 [00:07<00:00,  2.46it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:03<00:00, 32.28it/s]
100%|████████████████████████████████████████████████████████████████████████| 21/21 [00:08<00:00,  2.54it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 55.78it/s]
100%|████████████████████████████████████████████████████████████████████████| 14/14 [00:05<00:00,  2.57it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:06<00:00, 14.73it/s]
100%|████████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00,  2.47it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:02<00:00, 40.29it/s]
100%|█████

In [None]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

# $ \\ $

# Make a Long-Form Dialog with Bark

### Step 1: Format a script and speaker lookup

In [14]:
speaker_lookup = {"Samantha": "v2/en_speaker_9", "John": "v2/en_speaker_2"}

# Script generated by chat GPT
script = """
Samantha: Hey, have you heard about this new text-to-audio model called "Bark"?

John: No, I haven't. What's so special about it?

Samantha: Well, apparently it's the most realistic and natural-sounding text-to-audio model out there right now. People are saying it sounds just like a real person speaking.

John: Wow, that sounds amazing. How does it work?

Samantha: I think it uses advanced machine learning algorithms to analyze and understand the nuances of human speech, and then replicates those nuances in its own speech output.

John: That's pretty impressive. Do you think it could be used for things like audiobooks or podcasts?

Samantha: Definitely! In fact, I heard that some publishers are already starting to use Bark to create audiobooks. And I bet it would be great for podcasts too.

John: I can imagine. It would be like having your own personal voiceover artist.

Samantha: Exactly! I think Bark is going to be a game-changer in the world of text-to-audio technology."""
script = script.strip().split("\n")
script = [s.strip() for s in script if s]
script

['Samantha: Hey, have you heard about this new text-to-audio model called "Bark"?',
 "John: No, I haven't. What's so special about it?",
 "Samantha: Well, apparently it's the most realistic and natural-sounding text-to-audio model out there right now. People are saying it sounds just like a real person speaking.",
 'John: Wow, that sounds amazing. How does it work?',
 'Samantha: I think it uses advanced machine learning algorithms to analyze and understand the nuances of human speech, and then replicates those nuances in its own speech output.',
 "John: That's pretty impressive. Do you think it could be used for things like audiobooks or podcasts?",
 'Samantha: Definitely! In fact, I heard that some publishers are already starting to use Bark to create audiobooks. And I bet it would be great for podcasts too.',
 'John: I can imagine. It would be like having your own personal voiceover artist.',
 'Samantha: Exactly! I think Bark is going to be a game-changer in the world of text-to-audi

### Step 2: Generate the audio for every speaker turn

In [15]:
pieces = []
silence = np.zeros(int(0.5*SAMPLE_RATE))
for line in script:
    speaker, text = line.split(": ")
    audio_array = generate_audio(text, history_prompt=speaker_lookup[speaker], )
    pieces += [audio_array, silence.copy()]

100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:02<00:00, 34.03it/s]
100%|████████████████████████████████████████████████████████████████████████| 22/22 [00:08<00:00,  2.55it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 71.58it/s]
100%|████████████████████████████████████████████████████████████████████████| 11/11 [00:04<00:00,  2.65it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:04<00:00, 22.75it/s]
100%|████████████████████████████████████████████████████████████████████████| 33/33 [00:13<00:00,  2.53it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 70.76it/s]
100%|████████████████████████████████████████████████████████████████████████| 11/11 [00:04<00:00,  2.63it/s]
100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:04<00:00, 20.46it/s]
100%|█████

### Step 3: Concatenate all of the audio and play it

In [None]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)