# Text-To-Speech (TTS) Model
This tutorial introudces the **#1** TTS model (as of Jan 15 2025) in the [Hugging Face TTS Arena Leaderboard](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena), [Kokoro-TTS](https://huggingface.co/hexgrad/Kokoro-82M). 

1. The magic about this model is it's very small-size (only 82 million, trained on < 100 hours of audio), but its performance is very good. It beats other large TTS models, such as MetaVoice-1B (a 1.2 billion parameter base model trained on 100K hours of speech for TTS) or Edge TTS (proprietary and owned by Microsoft).
2. It means you can run this model locally with CPU rahter than an expensive GPU! 
3. You can also use your own voice to train the model.
    - You must follow the policy when using the model. Ex., be responsible for the voice data you use to train the model.
3. You can try a hosted demo on its HuggingFace page: https://huggingface.co/spaces/hexgrad/Kokoro-TTS.
4. Technique details if you are interested: 
    - Architecture
        - StyleTTS 2: https://arxiv.org/abs/2306.07691
        - ISTFTNet: https://arxiv.org/abs/2203.02395
        - Decoder only: no diffusion, no encoder release
    - Architected by: Li et al. @ https://github.com/yl4579/StyleTTS2

> Note: [HuggingFace](https://huggingface.co/) is an open-source community and is viewed as the "GitHub for AI".

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/YuxiaoLuo/AI_Intro/blob/main/tts/tts_kokoro.ipynb)

## Some trending external projects for Kokoro

1. Korono-onnx. https://github.com/thewh1teagle/kokoro-onnx
2. FastAPI Endpoints for Kokoro-82M TTS model, emulating the OpenAI TTS. https://github.com/remsky/Kokoro-FastAPI  
3. Kokoro for Rust (a programming language). https://github.com/lucasjinreal/Kokoros

Let's try to run it with Python and play a bit on the model.

### 1️⃣ Install dependencies silently
Run the following code in an emulated terminal environment (what the block below does).

These codes are supposed to be run in Terminal:
- "cmd" in PC
- "terminal" in Mac

In [1]:
# install large file storage system for git
!git lfs install
# clone the HuggingFace model to local folder
!git clone https://huggingface.co/hexgrad/Kokoro-82M
# go into the model folder
%cd Kokoro-82M
# install espeak-ng
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
# install necessary dependencies for Python code
!pip install -q phonemizer torch transformers scipy munch

Updated Git hooks.
Git LFS initialized.
c:\Users\YuxiaoLuo\Documents\python3\AI_Intro\tts\Kokoro-82M


Cloning into 'Kokoro-82M'...
Filtering content:  41% (7/17), 8.25 MiB | 6.97 MiB/s
Filtering content:  47% (8/17), 8.25 MiB | 6.97 MiB/s
Filtering content:  52% (9/17), 8.25 MiB | 6.97 MiB/s
Filtering content:  58% (10/17), 8.25 MiB | 6.97 MiB/s
Filtering content:  64% (11/17), 10.25 MiB | 4.92 MiB/s
Filtering content:  70% (12/17), 10.25 MiB | 4.92 MiB/s
Filtering content:  76% (13/17), 10.25 MiB | 4.92 MiB/s
Filtering content:  82% (14/17), 10.25 MiB | 4.92 MiB/s
Filtering content:  88% (15/17), 10.25 MiB | 4.92 MiB/s
Filtering content:  94% (16/17), 178.58 MiB | 15.13 MiB/s
Filtering content:  94% (16/17), 490.63 MiB | 21.76 MiB/s
Filtering content: 100% (17/17), 490.63 MiB | 21.76 MiB/s
Filtering content: 100% (17/17), 820.18 MiB | 34.92 MiB/s
Filtering content: 100% (17/17), 820.18 MiB | 34.49 MiB/s, done.
The system cannot find the path specified.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the sourc

#### Install `espeak-ng`
- If you failed to install espeak-ng (the open-source speech synthesizer) using the code above, try install it manually following this guide: https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md.
- I installed this espeak-ng (1.52) in this release.

#### If you have trouble with implementing espeak-ng

- Windows users see [this](https://github.com/bootphon/phonemizer/issues/44#issuecomment-1540885186).
    - If you are Windows user, remember to add espeak-ng to "System variables" in the setting of "System Environment Variables"
    ```
    PHONEMIZER_ESPEAK_LIBRARY="c:\Program Files\eSpeak NG\libespeak-ng.dll"
    PHONEMIZER_ESPEAK_PATH =“c:\Program Files\eSpeak NG”
    ```


- Mac users see [this](https://huggingface.co/hexgrad/Kokoro-82M/discussions/12#67742594fdeebf74f001ecfc)

#### Run the following Python code to ensure path variables are correctly set

In [7]:
import os
os.environ["PHONEMIZER_ESPEAK_LIBRARY"] = r"C:\Program Files\eSpeak NG\libespeak-ng.dll"
os.environ["PHONEMIZER_ESPEAK_PATH"] = r"C:\Program Files\eSpeak NG\espeak-ng.exe"

In [9]:
print("PHONEMIZER_ESPEAK_LIBRARY:", os.environ.get("PHONEMIZER_ESPEAK_LIBRARY"))
print("PHONEMIZER_ESPEAK_PATH:", os.environ.get("PHONEMIZER_ESPEAK_PATH"))

PHONEMIZER_ESPEAK_LIBRARY: C:\Program Files\eSpeak NG\libespeak-ng.dll
PHONEMIZER_ESPEAK_PATH: C:\Program Files\eSpeak NG\espeak-ng.exe


### 2️⃣ Build the model and load the default voicepack
Run the following code in Jupyter Notebook, Google Colab, or any Python development sotware you are used to.

In [1]:
%cd Kokoro-82M

C:\Users\YuxiaoLuo\Documents\python3\AI_Intro\tts\Kokoro-82M


In [11]:
# import torch
from models import build_model
import torch

In [47]:
# if you have Nvidia GPU on your PC, use GPU for faster processing
# if you don't have Nvidia GPU, use CPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# load the model "kokoro-v0_19.pth"
MODEL = build_model('kokoro-v0_19.pth', device)

# define voice name used for model training
VOICE_NAME = [
    'af', # Default voice is a 50-50 mix of Bella & Sarah
    'af_bella', 'af_sarah', 'am_adam', 'am_michael',
    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
    'af_nicole', 'af_sky'][0]

# load the voice model, weights_only=True to save memory
VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)

# print the voice name
print(f'Loaded voice: {VOICE_NAME}')

Loaded voice: af


### 3️⃣ Input your text and generate voice based on selected voice pack
Run the following code to generate voice based on your text input.
- Ensure you have [espeak](https://pypi.org/project/python-espeak/) installed, a tool that makes your computer talk.
- Call `generate` from `kokoro`
- it returns 24khz audio and the phonemes used

In [40]:
text = "Hello, World!"
# I love taking the CIS 2350! Dr. Luo is amazing and I'm so glad he is my instructor! HA HA HA, what a wonderful day!

In [52]:
from kokoro import generate
audio, out_ps = generate(MODEL, 
                         text, 
                         VOICEPACK, 
                         lang=VOICE_NAME[0])

# Language is determined by the first letter of the VOICE_NAME:
# 🇺🇸 'a' => American English => en-us
# 🇬🇧 'b' => British English => en-gb

### 4️⃣ Display the 24khz audio and print the output phonemes

In [53]:
from IPython.display import display, Audio
display(Audio(data=audio, rate=24000, autoplay=True))
print(out_ps)

həlˈoʊ, wˈɜːld!


### 5️⃣ Blending voices
Voicepacks are Torch Tensors (i.e., multi-dimensional array), similar to NumPy array.

To blend voices:
1. Create different voicepack
2. Blend the voice pack with "arithmetic" operations of arrays

In [32]:
import numpy as np
# see the dimension of VOICEPACK
VOICEPACK.shape

torch.Size([511, 1, 256])

In [34]:
VOICEPACK_01 = torch.load('voices/af_bella.pt', weights_only=True).to(device)

In [35]:
VOICEPACK_02 = torch.load('voices/bf_isabella.pt', weights_only=True).to(device)

In [36]:
VOICEPACK_03 = torch.load('voices/af_sky.pt', weights_only=True).to(device)

In [62]:
# average the tensors
af_average = (VOICEPACK_01 + VOICEPACK_03)/2
voice_3 = (VOICEPACK_01 + VOICEPACK_02 + VOICEPACK_03)/3

In [57]:
audio_af_average, out_ps = generate(MODEL,
                        text,
                        af_average,
                        lang = VOICE_NAME[0])

In [60]:
display(Audio(data=audio_af_average, rate=24000, autoplay=True))

In [63]:
audio_voice_3, out_ps = generate(MODEL,
                        text,
                        voice_3,
                        lang = VOICE_NAME[0])
display(Audio(data=audio_voice_3, rate=24000, autoplay=True))

In [61]:
display(Audio(data=audio, rate=24000, autoplay=True))

##### save the audio file to the folder

In [73]:
from scipy.io.wavfile import write

# sample rate is 24000, data is audio
# data must be A 1-D or 2-D NumPy array of either integer or float data-type.
write("output.wav", 24000, audio)

# print out saving path
print(f'The audio file is saved to {os.getcwd()}.')

The audio file is saved to C:\Users\YuxiaoLuo\Documents\python3\AI_Intro\tts\Kokoro-82M.


## Appendix
### Training Details

#### Compute
Kokoro was trained on A100 80GB vRAM instances rented from Vast.ai (referral link). Vast was chosen over other compute providers due to its competitive on-demand hourly rates. The average hourly cost for the A100 80GB vRAM instances used for training was below 1 dollar per hour per GPU, which was around half the quoted rates from other providers at the time.


#### Data
Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:
    - Public domain audio
    - Audio licensed under Apache, MIT, etc
    - Synthetic audio[1] generated by closed[2] TTS models from large providers
        - [1] https://copyright.gov/ai/ai_policy_guidance.pdf
        [2] No synthetic audio from open TTS models or "custom voice clones"

#### Epochs
Less than 20 epochs

####  Total Dataset Size
Less than 100 hours of audio