# Fish Speech

### For Windows User / win用户

In [None]:
!chcp 65001

### For Linux User / Linux 用户

In [33]:
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

import os
import nltk

nltk.download('punkt')
from nltk.tokenize import sent_tokenize

import glob


[nltk_data] Downloading package punkt to /home/eingerman/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Prepare Model

In [None]:
# For Chinese users, you probably want to use mirror to accelerate downloading
# !set HF_ENDPOINT=https://hf-mirror.com
# !export HF_ENDPOINT=https://hf-mirror.com 

!huggingface-cli download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini/

## WebUI Inference

> You can use --compile to fuse CUDA kernels for faster inference (10x).

In [None]:
!python tools/run_webui.py \
    --llama-checkpoint-path checkpoints/openaudio-s1-mini \
    --decoder-checkpoint-path checkpoints/openaudio-s1-mini/codec.pth \
    # --compile

## Break-down CLI Inference

### 1. Encode reference audio: / 从语音生成 prompt: 

You should get a `fake.npy` file.

你应该能得到一个 `fake.npy` 文件.

In [31]:
## Enter the path to the audio file here
src_audio = r"../outputs/kokoro/1.txt.wav"

!python fish_speech/models/dac/inference.py \
    -i {src_audio} \
    --checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth"

from IPython.display import Audio, display
audio = Audio(filename="fake.wav")
display(audio)

  WeightNorm.apply(module, name, dim)
[32m2025-06-08 14:05:42.152[0m | [1mINFO    [0m | [36m__main__[0m:[36mload_model[0m:[36m46[0m - [1mLoaded model: <All keys matched successfully>[0m
[32m2025-06-08 14:05:42.152[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m75[0m - [1mProcessing in-place reconstruction of ../outputs/kokoro/1.txt.wav[0m
[32m2025-06-08 14:05:42.237[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m84[0m - [1mLoaded audio with 10.05 seconds[0m
[32m2025-06-08 14:05:43.693[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m95[0m - [1mGenerated indices of shape torch.Size([10, 217])[0m
[32m2025-06-08 14:05:44.030[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m112[0m - [1mGenerated audio of shape torch.Size([1, 1, 444416]), equivalent to 10.08 seconds from 217 features, features/second: 21.53[0m
[32m2025-06-08 14:05:44.051[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m119[0m 

### 2. Generate semantic tokens from text: / 从文本生成语义 token:

> This command will create a codes_N file in the working directory, where N is an integer starting from 0.

> You may want to use `--compile` to fuse CUDA kernels for faster inference (~30 tokens/second -> ~300 tokens/second).

> 该命令会在工作目录下创建 codes_N 文件, 其中 N 是从 0 开始的整数.

> 您可以使用 `--compile` 来融合 cuda 内核以实现更快的推理 (~30 tokens/秒 -> ~300 tokens/秒)

In [36]:
# with open("../inputs/2.txt", "r") as f:
#     text = f.read()
#     text = text.replace("\"", "\\""") #.replace("\n", " ")
#     print(text)
# text = "She sells seashells on the seashore. The shells she sells are seashells, I’m sure, so if she sells seashells on the seashore, then I’m sure she sells seashore shells."
for fname in glob.glob("../inputs/*.txt"):
    with open(fname, "r") as f:
        texts = f.read().strip()
        print(texts)
        #split text into sentences using nltk
        cmd = "sox "
        for i, text in enumerate(sent_tokenize(texts)):
            text = text.replace("\"", "\\""").replace("\n", " ")
            print(f"Processing text: {text}")
            !python fish_speech/models/text2semantic/inference.py \
                --text "{text}" \
                --prompt-text "Expressive audiobook text" \
                --prompt-tokens "fake.npy" \
                --checkpoint-path "checkpoints/openaudio-s1-mini" \
                --num-samples 1 --compile
            !python fish_speech/models/dac/inference.py \
                -i "temp/codes_0.npy" \
                --checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
                -o "temp/{i}.wav"
            cmd += f" temp/{i}.wav"
        cmd += f" ../outputs/fish-speech/{os.path.basename(fname)}.wav"
        print(cmd)
        !{cmd}
        



Of all the problems which have been submitted to my friend, Mr. Sherlock Holmes, for solution during the years of our intimacy, there were only two which I was the means of introducing to his notice—that of Mr. Hatherley’s thumb, and that of Colonel Warburton’s madness.
Of these the latter may have afforded a finer field for an acute and original observer, but the other was so strange in its inception and so dramatic in its details that it may be the more worthy of being placed upon record, even if it gave my friend fewer openings for those deductive methods of reasoning by which he achieved such remarkable results.
The story has, I believe, been told more than once in the newspapers, but, like all such narratives, its effect is much less striking when set forth en bloc in a single half-column of print than when the facts slowly evolve before your own eyes, and the mystery clears gradually away as each new discovery furnishes a step which leads on to the complete truth.
At the time the

### 3. Generate speech from semantic tokens: / 从语义 token 生成人声:

In [None]:

from IPython.display import Audio, display
audio = Audio(filename="fake.wav")
display(audio)

  WeightNorm.apply(module, name, dim)
[32m2025-06-06 21:11:38.665[0m | [1mINFO    [0m | [36m__main__[0m:[36mload_model[0m:[36m46[0m - [1mLoaded model: <All keys matched successfully>[0m
[32m2025-06-06 21:11:38.666[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m100[0m - [1mProcessing precomputed indices from temp/codes_0.npy[0m
Traceback (most recent call last):
  File "/home/eingerman/Projects/TTS/TTSEval/fish-speech/fish_speech/models/dac/inference.py", line 123, in <module>
    main()
  File "/home/eingerman/Projects/TTS/TTSEval/fish-speech/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/eingerman/Projects/TTS/TTSEval/fish-speech/.venv/lib/python3.10/site-packages/click/core.py", line 1442, in __call__
    return self.main(*args, **kwargs)
  File "/home/eingerman/Projects/TTS/TTSEval/fish-speech/.venv/lib/python3.10/site-packages/click/core.py", line 1363, in ma