<a href="https://colab.research.google.com/github/benjaminsinzore/ASP.NET-MYSQL/blob/main/Fine_Tuning_Moshi_7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with Fine-Tuning Moshi 7B

This notebook shows you a simple example of how to LoRA finetune Moshi 7B. You can run this notebook in Google Colab using a A100 GPU.

<a target="_blank" href="https://colab.research.google.com/github//kyutai-labs/moshi-finetune/blob/main/tutorials/moshi_finetune.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Check out `moshi-finetune` Github repo to learn more: https://github.com/kyutai-labs/moshi-finetune/


## Installation

Clone the `moshi-finetune` repo:


In [1]:
%cd /content/
!git clone https://github.com/kyutai-labs/moshi-finetune.git

/content
Cloning into 'moshi-finetune'...
remote: Enumerating objects: 227, done.[K
remote: Counting objects: 100% (37/37), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 227 (delta 28), reused 24 (delta 24), pack-reused 190 (from 1)[K
Receiving objects: 100% (227/227), 623.92 KiB | 7.01 MiB/s, done.
Resolving deltas: 100% (127/127), done.


Install all required dependencies:


In [2]:
%pip install -e /content/moshi-finetune

Obtaining file:///content/moshi-finetune
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting moshi@ git+https://github.com/kyutai-labs/moshi.git#subdirectory=moshi (from finetune==0.0.0)
  Cloning https://github.com/kyutai-labs/moshi.git to /tmp/pip-install-1nc0qg6s/moshi_57d0eb88241e4a2382a5fbceb0a5c9e7
  Running command git clone --filter=blob:none --quiet https://github.com/kyutai-labs/moshi.git /tmp/pip-install-1nc0qg6s/moshi_57d0eb88241e4a2382a5fbceb0a5c9e7
  Resolved https://github.com/kyutai-labs/moshi.git to commit 62d0154eb199074c459f1fef2ef71028486fd528
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting fire (from finetune==0.0.0)
  Dow

## Prepare dataset


In [None]:
from pathlib import Path
from huggingface_hub import snapshot_download
import time

Path("/content/data/daily-talk-contiguous").mkdir(parents=True, exist_ok=True)

# Download the dataset with retries and delay
local_dir = None
retries = 3  # Number of retries
delay = 5    # Delay in seconds between retries

for i in range(retries):
    try:
        local_dir = snapshot_download(
            "kyutai/DailyTalkContiguous",
            repo_type="dataset",
            local_dir="/content/data/daily-talk-contiguous",
        )
        break  # Exit loop if successful
    except Exception as e:
        if "429" in str(e):  # Check for rate limit error
            print(f"Rate limit hit. Retrying in {delay} seconds... (Attempt {i+1}/{retries})")
            time.sleep(delay)
        else:
            raise e  # Re-raise other exceptions

if local_dir is None:
    print("Failed to download dataset after multiple retries.")
else:
    print("Dataset downloaded successfully!")

Fetching 5085 files:   0%|          | 0/5085 [00:00<?, ?it/s]

1198.json:   0%|          | 0.00/505 [00:00<?, ?B/s]

1199.json:   0%|          | 0.00/918 [00:00<?, ?B/s]

1197.json:   0%|          | 0.00/4.11k [00:00<?, ?B/s]

1198.wav:   0%|          | 0.00/2.62M [00:00<?, ?B/s]

1199.wav:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

12.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

1197.wav:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

120.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

120.wav:   0%|          | 0.00/2.95M [00:00<?, ?B/s]

12.wav:   0%|          | 0.00/5.00M [00:00<?, ?B/s]

1200.json:   0%|          | 0.00/934 [00:00<?, ?B/s]

1200.wav:   0%|          | 0.00/4.53M [00:00<?, ?B/s]

1201.json:   0%|          | 0.00/470 [00:00<?, ?B/s]

1202.wav:   0%|          | 0.00/4.84M [00:00<?, ?B/s]

1201.wav:   0%|          | 0.00/2.71M [00:00<?, ?B/s]

1202.json:   0%|          | 0.00/2.13k [00:00<?, ?B/s]

1203.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

1204.json:   0%|          | 0.00/942 [00:00<?, ?B/s]

1204.wav:   0%|          | 0.00/2.54M [00:00<?, ?B/s]

1203.wav:   0%|          | 0.00/4.51M [00:00<?, ?B/s]

1205.wav:   0%|          | 0.00/4.33M [00:00<?, ?B/s]

1205.json:   0%|          | 0.00/911 [00:00<?, ?B/s]

1206.wav:   0%|          | 0.00/6.82M [00:00<?, ?B/s]

1207.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

1206.json:   0%|          | 0.00/2.33k [00:00<?, ?B/s]

1207.wav:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

1208.wav:   0%|          | 0.00/2.67M [00:00<?, ?B/s]

1209.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

1208.json:   0%|          | 0.00/974 [00:00<?, ?B/s]

1209.wav:   0%|          | 0.00/4.22M [00:00<?, ?B/s]

121.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

121.wav:   0%|          | 0.00/3.61M [00:00<?, ?B/s]

1210.json:   0%|          | 0.00/871 [00:00<?, ?B/s]

1210.wav:   0%|          | 0.00/2.86M [00:00<?, ?B/s]

1211.wav:   0%|          | 0.00/7.07M [00:00<?, ?B/s]

1211.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1212.json:   0%|          | 0.00/1.68k [00:00<?, ?B/s]

1212.wav:   0%|          | 0.00/6.31M [00:00<?, ?B/s]

1213.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

1214.wav:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

1214.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

1213.wav:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

1215.wav:   0%|          | 0.00/7.17M [00:00<?, ?B/s]

1216.json:   0%|          | 0.00/990 [00:00<?, ?B/s]

1215.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

1217.json:   0%|          | 0.00/818 [00:00<?, ?B/s]

1216.wav:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

1217.wav:   0%|          | 0.00/2.88M [00:00<?, ?B/s]

1218.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

1219.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

1218.wav:   0%|          | 0.00/4.45M [00:00<?, ?B/s]

1219.wav:   0%|          | 0.00/4.44M [00:00<?, ?B/s]

122.json:   0%|          | 0.00/2.71k [00:00<?, ?B/s]

122.wav:   0%|          | 0.00/12.6M [00:00<?, ?B/s]

1220.wav:   0%|          | 0.00/2.84M [00:00<?, ?B/s]

1220.json:   0%|          | 0.00/703 [00:00<?, ?B/s]

1221.json:   0%|          | 0.00/2.39k [00:00<?, ?B/s]

1221.wav:   0%|          | 0.00/6.69M [00:00<?, ?B/s]

1222.json:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

1223.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

1222.wav:   0%|          | 0.00/4.63M [00:00<?, ?B/s]

1223.wav:   0%|          | 0.00/5.05M [00:00<?, ?B/s]

1224.wav:   0%|          | 0.00/2.77M [00:00<?, ?B/s]

1225.json:   0%|          | 0.00/761 [00:00<?, ?B/s]

1225.wav:   0%|          | 0.00/3.16M [00:00<?, ?B/s]

1224.json:   0%|          | 0.00/833 [00:00<?, ?B/s]

1226.json:   0%|          | 0.00/6.18k [00:00<?, ?B/s]

1227.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

1226.wav:   0%|          | 0.00/16.5M [00:00<?, ?B/s]

1227.wav:   0%|          | 0.00/5.06M [00:00<?, ?B/s]

1228.wav:   0%|          | 0.00/11.5M [00:00<?, ?B/s]

1228.json:   0%|          | 0.00/4.10k [00:00<?, ?B/s]

1229.json:   0%|          | 0.00/757 [00:00<?, ?B/s]

1229.wav:   0%|          | 0.00/2.83M [00:00<?, ?B/s]

123.wav:   0%|          | 0.00/3.05M [00:00<?, ?B/s]

123.json:   0%|          | 0.00/853 [00:00<?, ?B/s]

1230.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

1231.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

1230.wav:   0%|          | 0.00/6.30M [00:00<?, ?B/s]

1231.wav:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

1232.wav:   0%|          | 0.00/11.5M [00:00<?, ?B/s]

1232.json:   0%|          | 0.00/3.59k [00:00<?, ?B/s]

1233.json:   0%|          | 0.00/1.82k [00:00<?, ?B/s]

1234.json:   0%|          | 0.00/1.93k [00:00<?, ?B/s]

1233.wav:   0%|          | 0.00/4.82M [00:00<?, ?B/s]

1234.wav:   0%|          | 0.00/5.42M [00:00<?, ?B/s]

1235.json:   0%|          | 0.00/1.09k [00:00<?, ?B/s]

1236.wav:   0%|          | 0.00/6.72M [00:00<?, ?B/s]

1237.json:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

1235.wav:   0%|          | 0.00/3.80M [00:00<?, ?B/s]

1236.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

1237.wav:   0%|          | 0.00/6.53M [00:00<?, ?B/s]

1238.json:   0%|          | 0.00/6.03k [00:00<?, ?B/s]

1238.wav:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

1239.wav:   0%|          | 0.00/4.60M [00:00<?, ?B/s]

124.wav:   0%|          | 0.00/2.58M [00:00<?, ?B/s]

1240.json:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

1239.json:   0%|          | 0.00/767 [00:00<?, ?B/s]

1240.wav:   0%|          | 0.00/6.41M [00:00<?, ?B/s]

1241.json:   0%|          | 0.00/1.70k [00:00<?, ?B/s]

1241.wav:   0%|          | 0.00/4.11M [00:00<?, ?B/s]

1243.json:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

1242.json:   0%|          | 0.00/1.68k [00:00<?, ?B/s]

1242.wav:   0%|          | 0.00/4.32M [00:00<?, ?B/s]

124.json:   0%|          | 0.00/856 [00:00<?, ?B/s]

1243.wav:   0%|          | 0.00/5.13M [00:00<?, ?B/s]

1244.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

1245.json:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

1244.wav:   0%|          | 0.00/4.33M [00:00<?, ?B/s]

1246.json:   0%|          | 0.00/1.85k [00:00<?, ?B/s]

1246.wav:   0%|          | 0.00/5.31M [00:00<?, ?B/s]

1245.wav:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

1247.json:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

1247.wav:   0%|          | 0.00/6.06M [00:00<?, ?B/s]

1248.wav:   0%|          | 0.00/5.57M [00:00<?, ?B/s]

1248.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

1249.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

1249.wav:   0%|          | 0.00/4.05M [00:00<?, ?B/s]

125.json:   0%|          | 0.00/841 [00:00<?, ?B/s]

125.wav:   0%|          | 0.00/2.33M [00:00<?, ?B/s]

1250.json:   0%|          | 0.00/920 [00:00<?, ?B/s]

1250.wav:   0%|          | 0.00/3.04M [00:00<?, ?B/s]

1251.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

1251.wav:   0%|          | 0.00/2.90M [00:00<?, ?B/s]

1252.wav:   0%|          | 0.00/3.93M [00:00<?, ?B/s]

1253.json:   0%|          | 0.00/2.05k [00:00<?, ?B/s]

1252.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

1253.wav:   0%|          | 0.00/7.56M [00:00<?, ?B/s]

1254.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

1254.wav:   0%|          | 0.00/6.87M [00:00<?, ?B/s]

1255.wav:   0%|          | 0.00/5.37M [00:00<?, ?B/s]

1255.json:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

1256.json:   0%|          | 0.00/2.09k [00:00<?, ?B/s]

1256.wav:   0%|          | 0.00/5.15M [00:00<?, ?B/s]

1257.wav:   0%|          | 0.00/4.35M [00:00<?, ?B/s]

1257.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

1258.wav:   0%|          | 0.00/4.02M [00:00<?, ?B/s]

1258.json:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

1259.json:   0%|          | 0.00/936 [00:00<?, ?B/s]

126.wav:   0%|          | 0.00/3.34M [00:00<?, ?B/s]

1259.wav:   0%|          | 0.00/2.05M [00:00<?, ?B/s]

1261.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

1260.json:   0%|          | 0.00/1.85k [00:00<?, ?B/s]

1261.wav:   0%|          | 0.00/6.17M [00:00<?, ?B/s]

126.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1260.wav:   0%|          | 0.00/6.77M [00:00<?, ?B/s]

1262.json:   0%|          | 0.00/4.72k [00:00<?, ?B/s]

1262.wav:   0%|          | 0.00/10.8M [00:00<?, ?B/s]

1264.wav:   0%|          | 0.00/2.60M [00:00<?, ?B/s]

1263.json:   0%|          | 0.00/705 [00:00<?, ?B/s]

1265.json:   0%|          | 0.00/732 [00:00<?, ?B/s]

1264.json:   0%|          | 0.00/757 [00:00<?, ?B/s]

1265.wav:   0%|          | 0.00/2.49M [00:00<?, ?B/s]

1266.json:   0%|          | 0.00/2.29k [00:00<?, ?B/s]

1267.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

1266.wav:   0%|          | 0.00/5.11M [00:00<?, ?B/s]

1263.wav:   0%|          | 0.00/2.33M [00:00<?, ?B/s]

1268.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

1268.wav:   0%|          | 0.00/2.56M [00:00<?, ?B/s]

1269.wav:   0%|          | 0.00/3.94M [00:00<?, ?B/s]

1269.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

127.json:   0%|          | 0.00/6.44k [00:00<?, ?B/s]

127.wav:   0%|          | 0.00/14.6M [00:00<?, ?B/s]

1267.wav:   0%|          | 0.00/3.68M [00:00<?, ?B/s]

1270.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

1270.wav:   0%|          | 0.00/3.48M [00:00<?, ?B/s]

1271.json:   0%|          | 0.00/585 [00:00<?, ?B/s]

1271.wav:   0%|          | 0.00/2.47M [00:00<?, ?B/s]

1272.wav:   0%|          | 0.00/5.74M [00:00<?, ?B/s]

1273.json:   0%|          | 0.00/484 [00:00<?, ?B/s]

1272.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

1274.json:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

1274.wav:   0%|          | 0.00/2.71M [00:00<?, ?B/s]

1273.wav:   0%|          | 0.00/3.61M [00:00<?, ?B/s]

1275.wav:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

1275.json:   0%|          | 0.00/593 [00:00<?, ?B/s]

1276.wav:   0%|          | 0.00/4.84M [00:00<?, ?B/s]

1278.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

1277.json:   0%|          | 0.00/634 [00:00<?, ?B/s]

1277.wav:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

1276.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

1278.wav:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

1279.wav:   0%|          | 0.00/8.54M [00:00<?, ?B/s]

1279.json:   0%|          | 0.00/3.64k [00:00<?, ?B/s]

1280.json:   0%|          | 0.00/1.78k [00:00<?, ?B/s]

128.wav:   0%|          | 0.00/4.52M [00:00<?, ?B/s]

128.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

1281.json:   0%|          | 0.00/3.78k [00:00<?, ?B/s]

1280.wav:   0%|          | 0.00/6.24M [00:00<?, ?B/s]

1281.wav:   0%|          | 0.00/7.93M [00:00<?, ?B/s]

1282.json:   0%|          | 0.00/1.91k [00:00<?, ?B/s]

1282.wav:   0%|          | 0.00/6.06M [00:00<?, ?B/s]

1283.wav:   0%|          | 0.00/13.3M [00:00<?, ?B/s]

1283.json:   0%|          | 0.00/4.38k [00:00<?, ?B/s]

1284.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

1284.wav:   0%|          | 0.00/5.25M [00:00<?, ?B/s]

1285.json:   0%|          | 0.00/1.64k [00:00<?, ?B/s]

1285.wav:   0%|          | 0.00/6.14M [00:00<?, ?B/s]

1286.wav:   0%|          | 0.00/4.79M [00:00<?, ?B/s]

1287.json:   0%|          | 0.00/1.81k [00:00<?, ?B/s]

## Start training


In [None]:
# these info is needed for training
import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [None]:
# define training configuration
# for your own use cases, you might want to change the data paths, model path, run_dir, and other hyperparameters
import yaml

config = """
# data
data:
  train_data: '/content/data/daily-talk-contiguous/dailytalk.jsonl' # Fill
  eval_data: '' # Optionally Fill
  shuffle: true

# model
moshi_paths:
  hf_repo_id: "kyutai/moshiko-pytorch-bf16"


full_finetuning: false # Activate lora.enable if partial finetuning
lora:
  enable: true
  rank: 128
  scaling: 2.
  ft_embed: false

# training hyperparameters
first_codebook_weight_multiplier: 100.
text_padding_weight: .5


# tokens per training steps = batch_size x num_GPUs x duration_sec
# we recommend a sequence duration of 300 seconds
# If you run into memory error, you can try reduce the sequence length
duration_sec: 100
batch_size: 1
max_steps: 300

gradient_checkpointing: true # Activate checkpointing of layers

# optim
optim:
  lr: 2.e-6
  weight_decay: 0.1
  pct_start: 0.05

# other
seed: 0
log_freq: 10
eval_freq: 1
do_eval: False
ckpt_freq: 10

save_adapters: True

run_dir: "/content/test"  # Fill
"""

# save the same file locally into the example.yaml file
with open("/content/example.yaml", "w") as file:
    yaml.dump(yaml.safe_load(config), file)

In [None]:
# make sure the run_dir has not been created before
# only run this when you ran torchrun previously and created the /content/test_ultra file
# ! rm -r /content/test

In [None]:
# start training

!cd /content/moshi-finetune && torchrun --nproc-per-node 1 -m train /content/example.yaml

## Inference

Once the model has been trained, inference can be run on the colab GPU too, and gradio can be used to tunnel the audio data from a local client to the notebook.

More details on how to set this up can be found in the [moshi readme](https://github.com/kyutai-labs/moshi?tab=readme-ov-file#python-pytorch).


In [None]:
!pip install gradio

In [None]:
!python -m moshi.server --gradio-tunnel --lora-weight=/content/test/checkpoints/checkpoint_000300/consolidated/lora.safetensors --config-path=/content/test/checkpoints/checkpoint_000300/consolidated/config.json