## Before training

This program saves the last 3 generations of models to Google Drive. Since 1 generation of models is >1GB, you should have at least 3GB of free space in Google Drive. If you do not have such free space, it is recommended to create another Google Account.

Training requires >10GB VRAM. (T4 should be enough) Inference does not require such a lot of VRAM.

## Installation

In [1]:
#@title Check GPU
!nvidia-smi

Thu Jun  8 22:41:28 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
|  0%   45C    P8    18W / 125W |      6MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
#@title Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
#@title Install dependencies
#@markdown pip may fail to resolve dependencies and raise ERROR, but it can be ignored.
!python -m pip install -U pip wheel
%pip install -U ipython 

#@markdown Branch (for development)
BRANCH = "none" #@param {"type": "string"}
if BRANCH == "none":
    %pip install -U so-vits-svc-fork
else:
    %pip install -U git+https://github.com/34j/so-vits-svc-fork.git@{BRANCH}

Collecting pip
  Using cached pip-23.1.2-py3-none-any.whl (2.1 MB)
Collecting wheel
  Using cached wheel-0.40.0-py3-none-any.whl (64 kB)
Installing collected packages: wheel, pip
  Attempting uninstall: wheel
    Found existing installation: wheel 0.38.4
    Uninstalling wheel-0.38.4:
      Successfully uninstalled wheel-0.38.4
  Attempting uninstall: pip
    Found existing installation: pip 23.0.1
    Uninstalling pip-23.0.1:
      Successfully uninstalled pip-23.0.1
Successfully installed pip-23.1.2 wheel-0.40.0
Collecting ipython
  Using cached ipython-8.12.2-py3-none-any.whl (797 kB)
Installing collected packages: ipython
  Attempting uninstall: ipython
    Found existing installation: ipython 8.12.0
    Uninstalling ipython-8.12.0:
      Successfully uninstalled ipython-8.12.0
Successfully installed ipython-8.12.2
Note: you may need to restart the kernel to use updated packages.
Collecting so-vits-svc-fork
  Using cached so_vits_svc_fork-4.0.1-py3-none-any.whl (92 kB)
Collecting S

## Training

In [None]:
#@title Make dataset directory
!mkdir -p "dataset_raw"

In [None]:
#!rm -r "dataset_raw"
#!rm -r "dataset/44k"

In [None]:
#@title Copy your dataset
#@markdown **We assume that your dataset is in your Google Drive's `so-vits-svc-fork/dataset/(speaker_name)` directory.**
DATASET_NAME = "kiritan" #@param {type: "string"}
!cp -R /content/drive/MyDrive/so-vits-svc-fork/dataset/{DATASET_NAME}/ -t "dataset_raw/"

In [None]:
#@title Download dataset (Tsukuyomi-chan JVS)
#@markdown You can download this dataset if you don't have your own dataset.
#@markdown Make sure you agree to the license when using this dataset.
#@markdown https://tyc.rei-yumesaki.net/material/corpus/#toc6
# !wget https://tyc.rei-yumesaki.net/files/sozai-tyc-corpus1.zip
# !unzip sozai-tyc-corpus1.zip
# !mv "/content/つくよみちゃんコーパス Vol.1 声優統計コーパス（JVSコーパス準拠）/おまけ：WAV（+12dB増幅＆高音域削減）/WAV（+12dB増幅＆高音域削減）" "dataset_raw/tsukuyomi"

In [None]:
#@title Automatic preprocessing
!svc pre-resample

In [None]:
!svc pre-config

In [None]:
#@title Copy configs file
!cp configs/44k/config.json drive/MyDrive/so-vits-svc-fork

In [None]:
F0_METHOD = "dio" #@param ["crepe", "crepe-tiny", "parselmouth", "dio", "harvest"]
!svc pre-hubert -fm {F0_METHOD}

In [None]:
#@title Train
%load_ext tensorboard
%tensorboard --logdir drive/MyDrive/so-vits-svc-fork/logs/44k
!svc train --model-path drive/MyDrive/so-vits-svc-fork/logs/44k

## Training Cluster model

In [None]:
!svc train-cluster --output-path drive/MyDrive/so-vits-svc-fork/logs/44k/kmeans.pt

## Inference

In [None]:
#@title Get the author's voice as a source
import random
NAME = str(random.randint(1, 49))
TYPE = "fsd50k" #@param ["", "digit", "dog", "fsd50k"]
CUSTOM_FILEPATH = "" #@param {type: "string"}
if CUSTOM_FILEPATH != "":
    NAME = CUSTOM_FILEPATH
else:
    # it is extremely difficult to find a voice that can download from the internet directly
    if TYPE == "dog":
        !wget -N f"https://huggingface.co/datasets/437aewuh/dog-dataset/resolve/main/dogs/dogs_{NAME:.0000}.wav" -O {NAME}.wav
    elif TYPE == "digit":
        # george, jackson, lucas, nicolas, ...
        !wget -N f"https://github.com/Jakobovski/free-spoken-digit-dataset/raw/master/recordings/0_george_{NAME}.wav" -O {NAME}.wav
    elif TYPE == "fsd50k":
        !wget -N f"https://huggingface.co/datasets/Fhrozen/FSD50k/blob/main/clips/dev/{10000+int(NAME)}.wav" -O {NAME}.wav
    else:
        !wget -N f"https://zunko.jp/sozai/utau/voice_{"kiritan" if NAME < 25 else "itako"}{NAME % 5 + 1}.wav" -O {NAME}.wav
from IPython.display import Audio, display
display(Audio(f"{NAME}.wav"))

In [1]:
!svc -h

[2;36m[04:35:55][0m[2;36m [0m[34mINFO    [0m [1m[[0m[1;92m04:35:55[0m[1m][0m Version: [1;36m4.0[0m.[1;36m1[0m                     ]8;id=898352;file:///home/arvin/miniconda3/envs/svc-demucs/lib/python3.8/site-packages/so_vits_svc_fork/__main__.py\[2m__main__.py[0m]8;;\[2m:[0m]8;id=751327;file:///home/arvin/miniconda3/envs/svc-demucs/lib/python3.8/site-packages/so_vits_svc_fork/__main__.py#31\[2m31[0m]8;;\
Usage: svc [OPTIONS] COMMAND [ARGS]...

  so-vits-svc allows any folder structure for training data.
  However, the following folder structure is recommended.
      When training: dataset_raw/{speaker_name}/**/{wav_name}.{any_format}
      When inference: configs/44k/config.json, logs/44k/G_XXXX.pth
  If the folder structure is followed, you DO NOT NEED TO SPECIFY model path, config path, etc.
  (The latest model will be automatically loaded.)
  To train a model, run pre-resample, pre-config, pre-hubert, train.
  To infer a model, run infer.

Options:
  -h

In [2]:
!svc infer -h

[2;36m[04:36:08][0m[2;36m [0m[34mINFO    [0m [1m[[0m[1;92m04:36:08[0m[1m][0m Version: [1;36m4.0[0m.[1;36m1[0m                     ]8;id=651193;file:///home/arvin/miniconda3/envs/svc-demucs/lib/python3.8/site-packages/so_vits_svc_fork/__main__.py\[2m__main__.py[0m]8;;\[2m:[0m]8;id=197926;file:///home/arvin/miniconda3/envs/svc-demucs/lib/python3.8/site-packages/so_vits_svc_fork/__main__.py#31\[2m31[0m]8;;\
Usage: svc infer [OPTIONS] INPUT_PATH

  Inference

Options:
  -o, --output-path PATH          path to output dir
  -s, --speaker TEXT              speaker name
  -m, --model-path PATH           path to model  [default: logs/44k]
  -c, --config-path PATH          path to config  [default: configs/44k/config.json]
  -k, --cluster-model-path PATH   path to cluster model
  -re, --recursive                Search recursively
  -t, --transpose INTEGER         transpose  [default: 0]
  -db, --db-thresh INTEGER        threshold (DB) (RELATIVE)  [default: -20]
  -f

## Preprocess target

In [1]:
!pip install AudioConverter

Collecting AudioConverter
  Downloading AudioConverter-1.0.0-py3-none-any.whl (6.2 kB)
Collecting click<8.0.0,>=7.1.2 (from AudioConverter)
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.8/82.8 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colorama<1.0.0,>=0.4.3 (from AudioConverter)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting pydub<1.0.0,>=0.24.1 (from AudioConverter)
  Using cached pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub, colorama, click, AudioConverter
  Attempting uninstall: click
    Found existing installation: click 8.1.3
    Uninstalling click-8.1.3:
      Successfully uninstalled click-8.1.3
Successfully installed AudioConverter-1.0.0 click-7.1.2 colorama-0.4.6 pydub-0.25.1


In [5]:
import os, shutil
from os.path import dirname, abspath
from utils.demucs.utils import separate

# create output dir
filename = 'iu_strawberry_concert_preview'
output_dir = os.path.join('/home/arvin/so-vits-svc-fork/audio/', filename)
demucs_dir = os.path.join(output_dir, 'demucs')

# DEMUCS
separate(inp=output_dir, outp=demucs_dir)

Going to separate the files:
/home/arvin/so-vits-svc-fork/audio/iu_strawberry_concert_preview/iu_strawberry_concert_preview.mp3
With command:  python3 -m demucs.separate -o /home/arvin/so-vits-svc-fork/audio/iu_strawberry_concert_preview/demucs -n htdemucs --mp3 --mp3-bitrate=320
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /home/arvin/so-vits-svc-fork/audio/iu_strawberry_concert_preview/demucs/htdemucs
Separating track /home/arvin/so-vits-svc-fork/audio/iu_strawberry_concert_preview/iu_strawberry_concert_preview.mp3


100%|██████████████████████████████████████████████| 269.09999999999997/269.09999999999997 [00:10<00:00, 24.93seconds/s]


In [7]:
#@title Use trained model
#@markdown **Put your .wav file in `so-vits-svc-fork/audio` directory**
from IPython.display import Audio, display

PATH = f'/home/arvin/so-vits-svc-fork/audio/{filename}/demucs/htdemucs/{filename}'
VERSION = 109
NAME = 'vocals'

# !svc infer drive/MyDrive/so-vits-svc-fork/audio/{NAME}.wav -m drive/MyDrive/so-vits-svc-fork/logs/44k/ -c drive/MyDrive/so-vits-svc-fork/logs/44k/config.json
!svc infer -t -5 -fm crepe -na {PATH}/{NAME}.mp3 -m /home/arvin/so-vits-svc-fork/logs/44k/{VERSION} -c /home/arvin/so-vits-svc-fork/logs/44k/config.json

# display(Audio(f"drive/MyDrive/so-vits-svc-fork/audio/{NAME}.out.wav", autoplay=True))

[2;36m           [0m         transpose = [1;36m-5[0m. If you want to change the    [2m               [0m
[2;36m           [0m         pitch, please set transpose.Generally        [2m               [0m
[2;36m           [0m         transpose = [1;36m0[0m does not work because your     [2m               [0m
[2;36m           [0m         voice pitch and target voice pitch are       [2m               [0m
[2;36m           [0m         different.                                   [2m               [0m
[2;36m          [0m[2;36m [0m[34mINFO    [0m [1m[[0m[1;92m09:09:52[0m[1m][0m Since model_path is a directory,  ]8;id=23697;file:///home/arvin/miniconda3/envs/svc-demucs/lib/python3.8/site-packages/so_vits_svc_fork/__main__.py\[2m__main__.py[0m]8;;\[2m:[0m]8;id=274842;file:///home/arvin/miniconda3/envs/svc-demucs/lib/python3.8/site-packages/so_vits_svc_fork/__main__.py#273\[2m273[0m]8;;\
[2;36m           [0m         use                            

In [1]:
#@title Use trained model
#@markdown **Put your .wav file in `so-vits-svc-fork/audio` directory**
from IPython.display import Audio, display

VERSION = 109
NAME = 'vocals'

# !svc infer drive/MyDrive/so-vits-svc-fork/audio/{NAME}.wav -m drive/MyDrive/so-vits-svc-fork/logs/44k/ -c drive/MyDrive/so-vits-svc-fork/logs/44k/config.json
!svc infer -t -5 -fm crepe -na /home/arvin/so-vits-svc-fork/audio/iu_blueming_orig_1.mp3 -m /home/arvin/so-vits-svc-fork/logs/44k/{VERSION} -c /home/arvin/so-vits-svc-fork/logs/44k/config.json

# display(Audio(f"drive/MyDrive/so-vits-svc-fork/audio/{NAME}.out.wav", autoplay=True))

[2;36m           [0m         transpose = [1;36m-5[0m. If you want to change the    [2m               [0m
[2;36m           [0m         pitch, please set transpose.Generally        [2m               [0m
[2;36m           [0m         transpose = [1;36m0[0m does not work because your     [2m               [0m
[2;36m           [0m         voice pitch and target voice pitch are       [2m               [0m
[2;36m           [0m         different.                                   [2m               [0m
[2;36m          [0m[2;36m [0m[34mINFO    [0m [1m[[0m[1;92m10:38:58[0m[1m][0m Since model_path is a directory,  ]8;id=5594;file:///home/arvin/miniconda3/envs/svc-demucs/lib/python3.8/site-packages/so_vits_svc_fork/__main__.py\[2m__main__.py[0m]8;;\[2m:[0m]8;id=173907;file:///home/arvin/miniconda3/envs/svc-demucs/lib/python3.8/site-packages/so_vits_svc_fork/__main__.py#273\[2m273[0m]8;;\
[2;36m           [0m         use                             

In [None]:
##@title Use trained model (with cluster)
!svc infer {NAME}.wav -s speaker -r 0.1 -m drive/MyDrive/so-vits-svc-fork/logs/44k/ -c drive/MyDrive/so-vits-svc-fork/logs/44k/config.json -k drive/MyDrive/so-vits-svc-fork/logs/44k/kmeans.pt
display(Audio(f"{NAME}.out.wav", autoplay=True))

### Pretrained models

In [None]:
#@title https://huggingface.co/TachibanaKimika/so-vits-svc-4.0-models/tree/main
!wget -N "https://huggingface.co/TachibanaKimika/so-vits-svc-4.0-models/resolve/main/riri/G_riri_220.pth"
!wget -N "https://huggingface.co/TachibanaKimika/so-vits-svc-4.0-models/resolve/main/riri/config.json"

In [None]:
!svc infer {NAME}.wav -c config.json -m G_riri_220.pth
display(Audio(f"{NAME}.out.wav", autoplay=True))

In [None]:
#@title https://huggingface.co/therealvul/so-vits-svc-4.0/tree/main
!wget -N "https://huggingface.co/therealvul/so-vits-svc-4.0/resolve/main/Pinkie%20(speaking%20sep)/G_166400.pth"
!wget -N "https://huggingface.co/therealvul/so-vits-svc-4.0/resolve/main/Pinkie%20(speaking%20sep)/config.json"

In [None]:
!svc infer {NAME}.wav --speaker "Pinkie {neutral}" -c config.json -m G_166400.pth
display(Audio(f"{NAME}.out.wav", autoplay=True))