## Notebook 4: TTS Workflow

We have the exact podcast transcripts ready now to generate our audio for the Podcast.

In this notebook, we will learn how to generate Audio using  `Kokoro` model first. 

After that, we will use the output from Notebook 3 to generate our complete podcast

Note: Please feel free to extend this notebook with newer models. The above two were chosen after some tests using a sample prompt.

Credit: [This](https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing#scrollTo=68QtoUqPWdLk) Colab was used for starter code


We can install these packages for speedups

For KoKoro, you need to follow this if the base

* 1️⃣ Install kokoro
!pip install -q kokoro soundfile
* 2️⃣ Install espeak, used for English OOD fallback and some non-English languages
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
or
brew install espeak-ng   # in MacOS

if you face any issue regarding to thinc as a dependency for packages like spaCy
    * Install Python 3.12.6.​
    * Create and activate a virtual environment.​
    * Install thinc or spaCy within this environment.​

Install with Build Isolation Disabled: If you prefer to use Python 3.13, you can try installing thinc with build isolation disabled:

    Ensure numpy is installed:​

pip install numpy

Install thinc without build isolation:​

    'pip install --no-build-isolation thinc'

Use Conda for Installation: Conda can handle dependencies more effectively on macOS:

    'Create and activate a conda environment:​'

conda create -n myenv python=3.12
conda activate myenv

Install thinc using conda:​

        'conda install -c conda-forge thinc'

For more detailed information on installing thinc, refer to the official installation guide. ​
thinc.ai

If you continue to experience issues, consider consulting the spaCy GitHub discussions or Stack Overflow for community support. ​



Let's import the necessary frameworks

In [1]:
from IPython.display import Audio
import IPython.display as ipd
from tqdm import tqdm

In [2]:
from transformers import BarkModel, AutoProcessor, AutoTokenizer
import torch
import json
import numpy as np

### Testing the Audio Generation

Let's try generating audio using Kokoro model to understand how they work. 


Please set `device = "cuda"` below if you're using a single GPU node.

#### Kokoro Model

Let's try using the Parler Model first and generate a short segment with speaker Laura's voice

In [3]:
# 1️⃣ Install kokoro
# !pip install -q kokoro>=0.8.2 soundfile
# 2️⃣ Install espeak, used for English OOD fallback and some non-English languages
# !apt-get -qq -y install espeak-ng > /dev/null 2>&1
# 🇪🇸 'e' => Spanish es
# 🇫🇷 'f' => French fr-fr
# 🇮🇳 'h' => Hindi hi
# 🇮🇹 'i' => Italian it
# 🇧🇷 'p' => Brazilian Portuguese pt-br

# 3️⃣ Initalize a pipeline
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
# 🇺🇸 'a' => American English, 🇬🇧 'b' => British English
# 🇯🇵 'j' => Japanese: pip install misaki[ja]
# 🇨🇳 'z' => Mandarin Chinese: pip install misaki[zh]
pipeline = KPipeline(lang_code='a') # <= make sure lang_code matches voice

# This text is for demonstration purposes only, unseen during training
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.

These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.

[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''
# text = '「もしおれがただ偶然、そしてこうしようというつもりでなくここに立っているのなら、ちょっとばかり絶望するところだな」と、そんなことが彼の頭に思い浮かんだ。'
# text = '中國人民不信邪也不怕邪，不惹事也不怕事，任何外國不要指望我們會拿自己的核心利益做交易，不要指望我們會吞下損害我國主權、安全、發展利益的苦果！'
# text = 'Los partidos políticos tradicionales compiten con los populismos y los movimientos asamblearios.'
# text = 'Le dromadaire resplendissant déambulait tranquillement dans les méandres en mastiquant de petites feuilles vernissées.'
# text = 'ट्रांसपोर्टरों की हड़ताल लगातार पांचवें दिन जारी, दिसंबर से इलेक्ट्रॉनिक टोल कलेक्शनल सिस्टम'
# text = "Allora cominciava l'insonnia, o un dormiveglia peggiore dell'insonnia, che talvolta assumeva i caratteri dell'incubo."
# text = 'Elabora relatórios de acompanhamento cronológico para as diferentes unidades do Departamento que propõem contratos.'

# 4️⃣ Generate, display, and save audio files in a loop.
generator = pipeline(
    text, voice='af_heart', # <= change voice here
    speed=1, split_pattern=r'\n+'
)
for i, (gs, ps, audio) in enumerate(generator):
    # print(i)  # i => index
    # print(gs) # gs => graphemes/text
    # print(ps) # ps => phonemes
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000) # save each audio file




  WeightNorm.apply(module, name, dim)


0
The sky above the port was the color of television, tuned to a dead channel.
ðə skˈI əbˈʌv ðə pˈɔɹt wʌz ðə kˈʌləɹ ʌv tˈɛləvˌɪʒən, tˈund tə ɐ dˈɛd ʧˈænᵊl.


1
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
“ˌɪts nˌɑt lˈIk ˌIm jˈuzɪŋ,” kˈAs hˈɜɹd sˈʌmwˌʌn sˈA, æz hi ʃˈOldəɹd hɪz wˈA θɹu ðə kɹˈWd əɹˈWnd ðə dˈɔɹ ʌv ðə ʧˈæt. “ˌɪts lˈIk mI bˈɑdiz dəvˈɛləpt ðɪs mˈæsɪv dɹˈʌɡ dəfˈɪʃənsi.”


2
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.
ˌɪt wʌz ɐ spɹˈɔl vˈYs ænd ɐ spɹˈɔl ʤˈOk. ðə ʧætsˈubO wʌz ɐ bˈɑɹ fɔɹ pɹəfˈɛʃᵊnəl ɛkspˈAtɹiəts; ju kʊd dɹˈɪŋk ðɛɹ fɔɹ ɐ wˈik ænd nˈɛvəɹ hˈɪɹ tˈu wˈɜɹdz ɪn ʤˌæpənˈiz.


3
These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures.
ðˌiz wɜɹ tə hæv ɐn ɪnˈɔɹməs ˈɪmpˌækt, nˌɑt ˈOnli bəkˈʌz ðA wɜɹ əsˈOsiˌATᵻd wɪð kˈɑnstəntˌin, bˌʌt ˈɔlsO bəkˈʌz, æz ɪn sˌO mˈɛni ˈʌðəɹ ˈɛɹiəz, ðə dəsˈɪʒᵊnz tˈAkən bI kˈɑnstəntˌin (ɔɹ ɪn hɪz nˈAm) wɜɹ tə hæv ɡɹˈAt səɡnˈɪfəkᵊns fɔɹ sˈɛnʧəɹiz tə kˈʌm. wˈʌn ʌv ðə mˈAn ˈɪʃjuz wʌz ðə ʃˈAp ðæt kɹˈɪsʧən ʧˈɜɹʧᵻz wɜɹ tə tˈAk, sˈɪns ðɛɹ wʌz nˌɑt, əpˈɛɹəntli, ɐ tɹədˈɪʃən ʌv mˌɑnjəmˈɛntᵊl ʧˈɜɹʧ bˈɪldɪŋz wˌɛn kˈɑnstəntˌin dəsˈIdᵻd tə hˈɛlp ðə kɹˈɪsʧən ʧˈɜɹʧ bˈɪld ɐ sˈɪɹiz ʌv tɹˈuli spɛktˈækjələɹ stɹˈʌkʧəɹz.


4
The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need.
ðə mˈAn fˈɔɹm ðæt ðiz ʧˈɜɹʧᵻz tˈʊk wʌz ðˈæt ʌv ðə bəsˈɪləkə, ɐ mˌʌltipˈɜɹpəs ɹɛktˈæŋɡjələɹ stɹˈʌkʧəɹ, bˈAst ˈʌltəmətli ˌɔn ði ˈɜɹliəɹ ɡɹˈik stˈOə, wˌɪʧ kʊd bi fˈWnd ɪn mˈOst ʌv ðə ɡɹˈAt sˈɪTiz ʌv ði ˈɛmpˌIəɹ. kɹˌɪsʧiˈænəTi, ˌʌnlˈIk klˈæsəkᵊl pˈɑliθiˌɪzəm, nˈidᵻd ɐ lˈɑɹʤ ɪntˈɪɹiəɹ spˈAs fɔɹ ðə sˌɛləbɹˈAʃən ʌv ɪts ɹəlˈɪʤəs sˈɜɹvəsᵻz, ænd ðə bəsˈɪləkə ˈæptli fˈɪld ðˈæt nˈid.


5
We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.
wˌi nˈæʧəɹəli dˈu nˌɑt nˈO ðə dəɡɹˈi tə wˌɪʧ ði ˈɛmpəɹəɹ wʌz ɪnvˈɑlvd ɪn ðə dəzˈIn ʌv nˈu ʧˈɜɹʧᵻz, bˌʌt ɪt ɪz tˈɛmptɪŋ tə kənˈɛkt ðɪs wɪð ðə sˈɛkjələɹ bəsˈɪləkə ðæt kˈɑnstəntˌin kəmplˈiTᵻd ɪn ðə ɹˈOmən fˈɔɹəm (ðə sˌOkˈɔld bəsˈɪləkə ʌv mæksˈɛntiəs) ænd ðə wˈʌn hi pɹˈɑbəbli bˈɪlt ɪn tɹˈɪɹ, ɪn kənˈɛkʃən wɪð hɪz ɹˈɛzədᵊns ɪn ðə sˈɪTi æt ɐ tˈIm wˌɛn hi wʌz stˈɪl sˈizəɹ.


6
Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.
kˈOkəɹO ɪz ɐn ˈOpᵊnwˌAt tˌitˌiˈɛs mˈɑdᵊl wɪð ˈATi tˈu mˈɪljᵊn pəɹˈæməTəɹz. dəspˈIt ɪts lˈItwˌAt ˈɑɹkətˌɛkʧəɹ, ɪt dəlˈɪvəɹz kˈɑmpəɹəbᵊl kwˈɑləTi tə lˈɑɹʤəɹ mˈɑdᵊlz wˌIl bˈiɪŋ səɡnˈɪfəkəntli fˈæstəɹ ænd mˈɔɹ kˈɔstəfˌɪʃənt. wˌɪð əpˌæʧilˈIsᵊnst wˈAts, kˈOkəɹO kæn bi dəplˈYd ˈɛniwˌɛɹ fɹʌm pɹədˈʌkʃən ənvˈIɹənmᵊnts tə pˈɜɹsᵊnəl pɹˈɑʤˌɛkts.


## Bringing it together: Making the Podcast

Okay now that we understand everything-we can now use the complete pipeline to generate the entire podcast

Let's load in our pickle file from earlier and proceed:

In [4]:
import pickle

with open('./resources/podcast_ready_data.pkl', 'rb') as file:
    PODCAST_TEXT = pickle.load(file)

Let's define load in the TTS model and set it's hyper-parameters for discussions

We will concatenate the generated segments of audio and also their respective sampling rates since we will require this to generate the final audio

In [5]:
device="mps"

In the provided code, the speech speed for the human voice is randomly selected within a range determined by the length of the text segment relative to the average length for that speaker. Specifically, shorter-than-average segments are assigned a speed range of 0.8 to 0.97, while average or longer segments have a range of 0.97 to 1.15. This approach ensures that the speech speed varies naturally, enhancing the realism of the synthesized voice.

In [6]:
import random
# Function to determine speed range based on text length relative to average length
def determine_speed_range(text_length, average_length):
    if text_length < average_length:
        return 0.8, .97  # Shorter than average: lower speed range
    else:
        return .97, 1.15  # Average or longer: higher speed range

In [10]:
import re

# Assuming response['message']['content'] contains your string
content_string = PODCAST_TEXT

# Define a regular expression pattern to match the tuples
pattern = r'\("([^"]+)", "([^"]+)"\)'

# Find all matches in the string
matches = re.findall(pattern, content_string)

# Convert matches to a list of tuples
podcast_text = [(speaker, text) for speaker, text in matches]

# Now you can iterate over the list and access each tuple
for speaker, text in podcast_text:
    print(f"{speaker}: {text}")

Speaker 1: Welcome, everyone! Today, we're diving into the fascinating world of knowledge distillation from large language models like GPT-4. It's like taking a masterpiece and creating a miniature version that retains all the essence - but in a smaller package.
Speaker 2: Oh wow, I'm thrilled to be here! So, what exactly is knowledge distillation?
Speaker 1: Great question! Knowledge distillation is essentially teaching a smaller model from a larger one. Think of it as an apprenticeship, where the little guy learns all the tricks from the big shot.
Speaker 2: Hmm, so it's like learning by example?
Speaker 1: Exactly! There are several ways to do this. Let's start with labeling - we generate annotated data by labeling input-output pairs from our teacher models.
Speaker 2: Ooh, interesting. And what about the other methods?
Speaker 1: There's expansion, which uses the in-context learning capabilities of LLMs to expand initial seed tasks into a broader range of instructions. Then there's

Most of the times we argue in life that Data Structures isn't very useful. However, this time the knowledge comes in handy. 

We will take the string from the pickle file and load it in as a Tuple with the help of `ast.literal_eval()`

In [11]:
from collections import defaultdict


# Initialize a dictionary to store lengths of text segments per speaker
speaker_lengths = defaultdict(list)

# Populate the dictionary with lengths of each text segment
for speaker, text in podcast_text:
    speaker_lengths[speaker].append(len(text))

# Calculate and display the length of each text segment and the average length per speaker
for speaker, lengths in speaker_lengths.items():
    print(f"\n{speaker}:")
    for i, length in enumerate(lengths, 1):
        print(f"  Text segment {i} length: {length} characters")
    average_length = sum(lengths) / len(lengths) if lengths else 0
    print(f"  Average length of text segments: {average_length:.2f} characters")



Speaker 1:
  Text segment 1 length: 251 characters
  Text segment 2 length: 193 characters
  Text segment 3 length: 154 characters
  Text segment 4 length: 257 characters
  Text segment 5 length: 256 characters
  Text segment 6 length: 259 characters
  Text segment 7 length: 284 characters
  Text segment 8 length: 273 characters
  Text segment 9 length: 172 characters
  Text segment 10 length: 208 characters
  Text segment 11 length: 245 characters
  Average length of text segments: 232.00 characters

Speaker 2:
  Text segment 1 length: 76 characters
  Text segment 2 length: 38 characters
  Text segment 3 length: 51 characters
  Text segment 4 length: 86 characters
  Text segment 5 length: 66 characters
  Text segment 6 length: 56 characters
  Text segment 7 length: 67 characters
  Text segment 8 length: 63 characters
  Text segment 9 length: 57 characters
  Text segment 10 length: 43 characters
  Text segment 11 length: 70 characters
  Average length of text segments: 61.18 character

Function generate text for speaker 1

In [17]:
def generate_speaker1_audio(text,average_length):
    """Generate audio using ParlerTTS for Speaker 1"""
    text_length = len(text)
    # speed = random.uniform(.75, 1.2)
    min_speed, max_speed = determine_speed_range(text_length, average_length)
    speed = random.uniform(min_speed, max_speed)
    print(speed)
    generator = pipeline(
        text, voice='am_liam', # <= change voice here  af_heart am_eric
        speed=speed, split_pattern=r'\n+'
    )
    for i, (gs, ps, audio) in enumerate(generator): 
        # Assuming 'audio' is your tensor
        audio_tensor = audio.cpu().detach().numpy()
        audio_array = np.squeeze(audio_tensor)

    return audio_array,24000

Function to generate text for speaker 2


In [18]:
def generate_speaker2_audio(text,average_length):
    """Generate audio using ParlerTTS for Speaker 1"""
    text_length = len(text)
    # speed = random.uniform(.75, 1.2)
    min_speed, max_speed = determine_speed_range(text_length, average_length)
    speed = random.uniform(min_speed, max_speed)
    print(speed)
    generator = pipeline(
        text, voice='af_heart', # <= change voice here  af_heart af_jessica
        speed=speed, split_pattern=r'\n+'
    )
    for i, (gs, ps, audio) in enumerate(generator):
        # Assuming 'audio' is your tensor
        audio_tensor = audio.cpu().detach().numpy()
        audio_array = np.squeeze(audio_tensor)

    return audio_array,24000

Helper function to convert the numpy output from the models into audio

In [19]:
import numpy as np
from pydub import AudioSegment
import ast

def numpy_to_audio_segment(audio_arr, sampling_rate):
    """Convert NumPy array to AudioSegment."""
    # Ensure the NumPy array is in the correct format
    if audio_arr.dtype != np.int16:
        # Scale and convert the float array to int16
        audio_arr = (audio_arr * 32767).astype(np.int16)
    
    # Create an AudioSegment instance from the raw data
    audio_segment = AudioSegment(
        audio_arr.tobytes(), 
        frame_rate=sampling_rate,
        sample_width=audio_arr.dtype.itemsize, 
        channels=1
    )
    
    return audio_segment


#### Generating the Final Podcast

Finally, we can loop over the Tuple and use our helper functions to generate the audio

In [20]:
final_audio = None

# Calculate the length of each text segment
lengths = [len(text) for _, text in podcast_text]

# Compute the average length
average_length = sum(lengths) / len(lengths) if lengths else 0


for speaker, text in tqdm(podcast_text, desc="Generating podcast segments", unit="segment"):
    if speaker == "Speaker 1":
        audio_arr, rate = generate_speaker1_audio(text,average_length)
    else:  # Speaker 2
        audio_arr, rate = generate_speaker2_audio(text,average_length)
    
    # Convert to AudioSegment (pydub will handle sample rate conversion automatically)
    audio_segment = numpy_to_audio_segment(audio_arr, rate)
    
    # Add to final audio
    if final_audio is None:
        final_audio = audio_segment
    else:
        final_audio += audio_segment

Generating podcast segments:   0%|                | 0/22 [00:00<?, ?segment/s]

1.0653703784043882


Generating podcast segments:   5%|▎       | 1/22 [00:02<00:53,  2.53s/segment]

0.9226699210041456


Generating podcast segments:   9%|▋       | 2/22 [00:03<00:32,  1.63s/segment]

1.1329055338410596


Generating podcast segments:  14%|█       | 3/22 [00:05<00:30,  1.59s/segment]

0.9524399846690954


Generating podcast segments:  18%|█▍      | 4/22 [00:05<00:21,  1.19s/segment]

1.0055021321569315


Generating podcast segments:  23%|█▊      | 5/22 [00:07<00:22,  1.35s/segment]

0.8044935305105642


Generating podcast segments:  27%|██▏     | 6/22 [00:08<00:18,  1.17s/segment]

0.9967476188271961


Generating podcast segments:  32%|██▌     | 7/22 [00:11<00:26,  1.77s/segment]

0.9440865903025524


Generating podcast segments:  36%|██▉     | 8/22 [00:12<00:22,  1.57s/segment]

0.9825369229182392


Generating podcast segments:  41%|███▎    | 9/22 [00:14<00:24,  1.89s/segment]

0.8500751168847271


Generating podcast segments:  45%|███▏   | 10/22 [00:15<00:19,  1.59s/segment]

1.0780327196542245


Generating podcast segments:  50%|███▌   | 11/22 [00:18<00:20,  1.83s/segment]

0.8731631647462867


Generating podcast segments:  55%|███▊   | 12/22 [00:18<00:15,  1.53s/segment]

1.0284834792797293


Generating podcast segments:  59%|████▏  | 13/22 [00:21<00:17,  1.96s/segment]

0.961208263570568


Generating podcast segments:  64%|████▍  | 14/22 [00:22<00:13,  1.63s/segment]

0.9738055757741748


Generating podcast segments:  68%|████▊  | 15/22 [00:25<00:14,  2.01s/segment]

0.9308506442102589


Generating podcast segments:  73%|█████  | 16/22 [00:26<00:10,  1.69s/segment]

1.0073560053012451


Generating podcast segments:  77%|█████▍ | 17/22 [00:28<00:08,  1.73s/segment]

0.9292971104254444


Generating podcast segments:  82%|█████▋ | 18/22 [00:29<00:05,  1.39s/segment]

1.0593627197580644


Generating podcast segments:  86%|██████ | 19/22 [00:31<00:04,  1.61s/segment]

0.9026675396617407


Generating podcast segments:  91%|██████▎| 20/22 [00:31<00:02,  1.31s/segment]

1.1078217456275712


Generating podcast segments:  95%|██████▋| 21/22 [00:34<00:01,  1.60s/segment]

0.9612099348904658


Generating podcast segments: 100%|███████| 22/22 [00:34<00:00,  1.59s/segment]


### Output the Podcast

We can now save this as a mp3 file

In [21]:
final_audio.export("./resources/_podcast.mp3", 
                  format="mp3", 
                  bitrate="192k",
                  parameters=["-q:a", "0"])

<_io.BufferedRandom name='./resources/_podcast.mp3'>

### Suggested Next Steps:

- Experiment with the prompts: Please feel free to experiment with the SYSTEM_PROMPT in the notebooks
- Extend workflow beyond two speakers
- Test other TTS Models
- Experiment with Speech Enhancer models as a step 5.

In [22]:
# !pip freeze > requirements.txt

In [174]:
#fin