# kokoro

- https://huggingface.co/hexgrad/Kokoro-82M
- https://github.com/yl4579/StyleTTS2
- [2023. StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models](https://arxiv.org/abs/2306.07691)
- [2022. iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform](https://arxiv.org/abs/2203.02395)
- Decoder only: no diffusion, no encoder release

参数：

Kokoro v0.19: 82M params (Model total has 81.763 million parameters), Apache, trained on <100 hours of audio

模型参数低，直接可以在低端设备上运行，比如手机端，边缘硬件。


发布的开源权重是Kokoro v0.19，不支持中文，但是可以通过 phonemizer 将文本转成音素，但是效果不好

Kokoro v0.23 支持中文， 但是未公开权重


https://huggingface.co/spaces/hexgrad/Kokoro-TTS


## run kokoro-tts with pytorch

In [1]:
# 1️⃣ Install dependencies silently
!git lfs install
!git clone https://huggingface.co/hexgrad/Kokoro-82M
%cd Kokoro-82M
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
!pip install -q phonemizer torch transformers scipy munch

Git LFS initialized.
Cloning into 'Kokoro-82M'...
remote: Enumerating objects: 131, done.[K
remote: Counting objects: 100% (127/127), done.[K
remote: Compressing objects: 100% (127/127), done.[K
remote: Total 131 (delta 57), reused 0 (delta 0), pack-reused 4 (from 1)[K
Receiving objects: 100% (131/131), 57.13 KiB | 4.76 MiB/s, done.
Resolving deltas: 100% (57/57), done.
Filtering content: 100% (17/17), 820.18 MiB | 33.18 MiB/s, done.
/content/Kokoro-82M
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.6/162.6 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.8/103.8 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m


In [2]:
# 2️⃣ Build the model and load the default voicepack
from models import build_model
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
MODEL = build_model('kokoro-v0_19.pth', device)

total_params = 0
for key,model in MODEL.items():
    print(f'{key} Model: {model}')
    params = sum(p.numel() for p in model.parameters())
    total_params += params
    model_million_params = params / 1e6
    print(f'{key} Model has {model_million_params:.3f} million parameters')

model_million_params = total_params / 1e6
print(f'Model total has {model_million_params:.3f} million parameters')

VOICE_NAME = [
    'af', # Default voice is a 50-50 mix of Bella & Sarah
    'af_bella', 'af_sarah', 'am_adam', 'am_michael',
    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
    'af_nicole', 'af_sky',
][0]
VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
print(f'Loaded voice: {VOICE_NAME}')


  WeightNorm.apply(module, name, dim)


bert Model: CustomAlbert(
  (embeddings): AlbertEmbeddings(
    (word_embeddings): Embedding(178, 128, padding_idx=0)
    (position_embeddings): Embedding(512, 128)
    (token_type_embeddings): Embedding(2, 128)
    (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0, inplace=False)
  )
  (encoder): AlbertTransformer(
    (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
    (albert_layer_groups): ModuleList(
      (0): AlbertLayerGroup(
        (albert_layers): ModuleList(
          (0): AlbertLayer(
            (full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (attention): AlbertSdpaAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (attention_dropout): Dropout(p=0, inplac

In [5]:
# 3️⃣ Call generate, which returns 24khz audio and the phonemes used
from kokoro import generate
#text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
text = """
In the quiet corners of our minds,
Where thoughts dance and shadows intertwine,
There lies a place both soft and bright,
A sanctuary for the heart’s own fight.

Mental health is like a gentle stream,
Sometimes calm, at times it teems.
But when waves rise, threatening to crest,
It's knowing you're not alone that brings rest.

We all journey through life’s winding path,
Each with our trials, each with our wrath.
Yet in the presence of those who care,
Our burdens lighten, and we find repair.

Seeing someone with a listening ear,
Brings comfort to fears both far and near.
A gentle hand can lift us from our fall,
And remind us that together, we stand tall.

Let’s cherish the bonds that help us heal,
For in each other's strength, we find real zeal.
In this world of ours, vast and wide,
Compassion is where true healing resides.
"""

audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])
# Language is determined by the first letter of the VOICE_NAME:
# 🇺🇸 'a' => American English => en-us
# 🇬🇧 'b' => British English => en-gb




Truncated to 510 tokens


In [6]:
# 4️⃣ Display the 24khz audio and print the output phonemes
from IPython.display import display, Audio
display(Audio(data=audio, rate=24000, autoplay=True))
print(out_ps)


ɪnðə kwˈaɪət kˈɔːɹnɚz ʌv ˌaʊɚ mˈaɪndz,wˌɛɹ θˈɔːts dˈæns ænd ʃˈædoʊz ˌɪntɚtwˈaɪn,ðɛɹ lˈaɪz ɐ plˈeɪs bˈoʊθ sˈɔft ænd bɹˈaɪt,ɐ sˈæŋktjuːˌɛɹi fɚðə hˈɑːɹts ˈoʊn fˈaɪt.mˈɛntəl hˈɛlθ ɪz lˈaɪk ɐ dʒˈɛntəl stɹˈiːm,sˈʌmtaɪmz kˈɑːm, æt tˈaɪmz ɪt tˈiːmz.bˌʌt wɛn wˈeɪvz ɹˈaɪz, θɹˈɛʔn̩ɪŋ tə kɹˈɛst,ɪts nˈoʊɪŋ jʊɹ nˌɑːt ɐlˈoʊn ðæt bɹˈɪŋz ɹˈɛst.wiː ˈɔːl dʒˈɜːni θɹuː lˈaɪfz wˈaɪndɪŋ pˈæθ,ˈiːtʃ wɪð ˌaʊɚ tɹˈaɪəlz, ˈiːtʃ wɪð ˌaʊɚ ɹˈæθ.jˈɛt ɪnðə pɹˈɛzəns ʌv ðoʊz hˌuː kˈɛɹ,ˌaʊɚ bˈɜːdənz lˈaɪʔn̩, ænd wiː fˈaɪnd ɹᵻpˈɛɹ.sˈiːɪŋ sˈʌm


# I was not able to get the code working below. I added suggested changes from Colab, but could not get the error fixed. Maybe you can.


# run kokoro-tts with onnx

In [7]:
import io
import json

import numpy as np
import requests
import torch

voices = [
    "af",
    "af_bella",
    "af_nicole",
    "af_sarah",
    "af_sky",
    "am_adam",
    "am_michael",
    "bf_emma",
    "bf_isabella",
    "bm_george",
    "bm_lewis",
]
voices_json = {}
pattern = "https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/{voice}.pt"
for voice in voices:
    url = pattern.format(voice=voice)
    print(f"Downloading {url}")
    r = requests.get(url)
    content = io.BytesIO(r.content)
    voice_data: np.ndarray = torch.load(content).numpy()
    voices_json[voice] = voice_data.tolist()

with open("/content/voices.json", "w") as f:
    json.dump(voices_json, f, indent=4)

Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/af.pt


  voice_data: np.ndarray = torch.load(content).numpy()


Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/af_bella.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/af_nicole.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/af_sarah.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/af_sky.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/am_adam.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/am_michael.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/bf_emma.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/bf_isabella.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/bm_george.pt
Downloading https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/bm_lewis.pt


In [8]:
!ls -lh /content/voices.json

-rw-r--r-- 1 root root 52M Jan 14 12:50 /content/voices.json


In [9]:
!wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files/kokoro-v0_19.onnx -O /content/kokoro-v0_19.onnx
!wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files/voices.json -O /content/kokoro-voices.json


--2025-01-14 12:50:52--  https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files/kokoro-v0_19.onnx
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/911666237/95b2ba24-78a2-4e31-a4a2-053c54b97b3c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250114%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250114T125052Z&X-Amz-Expires=300&X-Amz-Signature=83a742d71ff8407baeadfb98d02c87296050fa0952824590388d6f89bc4ad131&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dkokoro-v0_19.onnx&response-content-type=application%2Foctet-stream [following]
--2025-01-14 12:50:52--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/911666237/95b2ba24-78a2-4e31-a4a2-053c54b97b3c?X-Amz-Algorithm=AWS4-HMAC-S

In [10]:
!ls -lh /content/kokoro-voices.json /content/kokoro-v0_19.onnx

-rw-r--r-- 1 root root 311M Jan 12 00:07 /content/kokoro-v0_19.onnx
-rw-r--r-- 1 root root  52M Jan  3 16:34 /content/kokoro-voices.json


In [11]:
!pip uninstall -q phonemizer # use phonemizer_fork, Text -> Phonemics
!pip install -Uq kokoro-onnx


Proceed (Y/n)? y
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.3/48.3 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.5/82.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m46.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.3/13.3 MB[0m [31m83.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [13]:
!pip install --upgrade kokoro-onnx



In [15]:
!pip install --upgrade git+https://github.com/thewh1teagle/kokoro-onnx.git

Collecting git+https://github.com/thewh1teagle/kokoro-onnx.git
  Cloning https://github.com/thewh1teagle/kokoro-onnx.git to /tmp/pip-req-build-huvqfj9p
  Running command git clone --filter=blob:none --quiet https://github.com/thewh1teagle/kokoro-onnx.git /tmp/pip-req-build-huvqfj9p
  Resolved https://github.com/thewh1teagle/kokoro-onnx.git to commit 59cc383f3756c590096b0051c23a00dfe5e3ae68
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [16]:
import soundfile as sf
from kokoro_onnx import Kokoro

kokoro = Kokoro("/content/kokoro-v0_19.onnx", "/content/kokoro-voices.json")
samples, sample_rate = kokoro.create(
    "Hello. This audio generated by kokoro!", voice="af_sarah", speed=1.0, lang="en-us"
)
sf.write("audio.wav", samples, sample_rate)
print("Created audio.wav")

AttributeError: type object 'EspeakWrapper' has no attribute 'set_data_path'

In [17]:
from IPython.display import display, Audio
display(Audio(data="audio.wav",rate=sample_rate))



NameError: name 'sample_rate' is not defined

In [None]:
import soundfile as sf
from kokoro_onnx import Kokoro

kokoro = Kokoro("/content/kokoro-v0_19.onnx", "/content/kokoro-voices.json")
samples, sample_rate = kokoro.create(
    "Hello. 你好啊！从前，有一个小女孩，名叫莉莉。她喜欢在阳光下外面玩耍。有一天，她在后院看到一棵柠檬树。它很高，上面结满了柠檬。", voice="af_sarah", speed=1.0, lang="cmn"
)





In [None]:
from IPython.display import display, Audio
display(Audio(data=samples, rate=sample_rate, autoplay=True))