<a href="https://colab.research.google.com/github/akashjss/hibiki/blob/main/Demo_Hibiki.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hibiki: High-Fidelity Simultaneous Speech-To-Speech Translation


Hibiki is a model for **streaming speech translation** (also known as
*simultaneous* translation). Unlike offline translation—where one waits for the end of the source utterance to start
translating--- Hibiki **adapts its flow** to accumulate just enough context to produce a correct translation in real-time,
chunk by chunk. As the user speaks, Hibiki generates natural speech in the target language,
optionally with voice transfer, **along with a text translation**.

For more information, checkout our repo
[[repo]](https://github.com/kyutai-labs/hibiki),
[[samples]](https://huggingface.co/spaces/kyutai/hibiki-samples), and [[paper]](https://arxiv.org/abs/2410.00037).

In [None]:
!pip install "git+https://git@github.com/kyutai-labs/moshi.git#egg=moshi&subdirectory=moshi"
!pip install gradio
!wget https://github.com/kyutai-labs/moshi/raw/refs/heads/main/data/sample_fr_hibiki_crepes.mp3

Collecting moshi
  Cloning https://****@github.com/kyutai-labs/moshi.git to /tmp/pip-install-vg4deqhj/moshi_b3c883244e884c32b84de51d91d8dbc0
  Running command git clone --filter=blob:none --quiet 'https://****@github.com/kyutai-labs/moshi.git' /tmp/pip-install-vg4deqhj/moshi_b3c883244e884c32b84de51d91d8dbc0
  Resolved https://****@github.com/kyutai-labs/moshi.git to commit 0146d47f29726b134730acfd6f56f3575c4b236f
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting bitsandbytes<0.46,>=0.45 (from moshi)
  Downloading bitsandbytes-0.45.2-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting sounddevice==0.5 (from moshi)
  Downloading sounddevice-0.5.0-py3-none-any.whl.metadata (1.4 kB)
Collecting sphn>=0.1.4 (from moshi)
  Downloading sphn-0.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.2 kB)
Collecting nvidia-cuda-nvrtc-c

In [None]:
import IPython.display as ipd  # type: ignore
# Let's translate a demo file we have downloaded from the Moshi repo.
# Batch size controls the number of parallel translations over the same files.
# CFG Coef > 1. can increase speaker similarity, although large values will lead to artifacts.
!python -m moshi.run_inference sample_fr_hibiki_crepes.mp3 crepes_out.wav --hf-repo kyutai/hibiki-1b-pytorch-bf16 --cfg-coef 3 --half --batch-size 2
# Also available in 2B version for better quality.
# !python -m moshi.run_inference sample_fr_hibiki_crepes.mp3 crepes_out.wav --hf-repo kyutai/hibiki-1b-pytorch-bf16 --cfg-coef 3
ipd.display(ipd.Audio('crepes_out-0.wav'))
ipd.display(ipd.Audio('crepes_out-1.wav'))

In [None]:
# You can also run the model through our WebUI for live translation.
# Click on the gradio.live link!
! python -m moshi.server --gradio-tunnel --hf-repo kyutai/hibiki-1b-pytorch-bf16 --cfg-coef 3 --half

[1;34m[Info][0m retrieving checkpoint
config.json: 100% 1.52k/1.52k [00:00<00:00, 11.7MB/s]
hibikim-pytorch-dc2cf5a5@80.safetensors: 100% 3.60G/3.60G [01:25<00:00, 42.2MB/s]
mimi-pytorch-e351c8d8@125.safetensors: 100% 385M/385M [00:09<00:00, 42.2MB/s]
tokenizer_spm_48k_multi6_2.model: 100% 857k/857k [00:00<00:00, 49.1MB/s]
[1;34m[Info][0m loading mimi
[1;34m[Info][0m mimi loaded
[1;34m[Info][0m loading moshi
[1;34m[Info][0m moshi loaded
[1;34m[Info][0m warming up the model
[1;34m[Info][0m retrieving the static content
dist.tgz: 100% 589k/589k [00:00<00:00, 2.68MB/s]
[1;34m[Info][0m serving static content from /root/.cache/huggingface/hub/models--kyutai--moshi-artifacts/snapshots/8481e95f73827e4e70ac7311c12b0be099276182/dist
[1;34m[Info][0m Access the Web UI directly at http://localhost:8998
[1;34m[Info][0m Tunnel started, if executing on a remote GPU, you can use https://76b0fa9da2a181f479.gradio.live.
[1;34m[Info][0m Note that this tunnel goes through the US and 

In [None]:
from pathlib import Path
from shutil import rmtree
import subprocess as sp
from google.colab import files
import shlex
import sys

def from_upload(model='hibiki-1b', batch_size=1, cfg_coef=3.):
    """`model` should be 'hibiki-1b' or 'hibiki-2b'.
    `batch_size` is the number of repetitions.
    """
    out_path = Path('dst_en')
    in_path = Path('src_fr')

    in_path.mkdir(exist_ok=True, parents=True)
    out_path.mkdir(exist_ok=True, parents=True)

    uploaded = files.upload()
    for name, content in uploaded.items():
        (in_path / name).write_bytes(content)

    for name in uploaded.keys():
        argv = [
            'python', '-m', 'moshi.run_inference', '--hf-repo',
            f'kyutai/{model}-pytorch-bf16', '--cfg-coef', str(cfg_coef), '--half',
            '--batch-size', str(batch_size),
            str(in_path / name), str(out_path / name),
        ]
        command = " ".join(shlex.quote(x) for x in argv)
        !{command}

from_upload()