<a href="https://colab.research.google.com/github/alexledd/So-VITS-SVC-Notebook/blob/main/so-vits-svc_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](https://repository-images.githubusercontent.com/654490329/ab89913c-c654-48e9-ac2c-ac667cf31155)

# **SO-VITS-SVC Colaboratory**

#### **🌡️ Before training**
* 💾 This program saves the last 3 generations of models to Google Drive. Since 1 generation of models is >1GB,

* 🔺 Make sure your Google Drive have enough storage, 4GB is minimum!

* 🧑‍🏫 Training requires >10GB VRAM, (T4 should be enough)

* ✍️ Inference does not require such a lot of VRAM,

* 📁 If your dataset is  >10 minutes, you need to split it into sections. Split the audio manually or using `Split Tool` below.

**📝 Notes: be cautius with your file/folder name, preferably without spaces!**

**Also that playing audio directly in Colab can cause runtime to restart. To solve this, download it manually or move it inside /content/drive/MyDrive and play it over GDrive instead**

In [None]:
#@title NVIDIA SMI (GPU Check)
!nvidia-smi


# Dependencies & Mount Gdrive
Restart runtime after everything is installed


In [None]:
#@title Mount GDrive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
#@title Audio editor dependencies

!pip install yt_dlp
!pip install ffmpeg
!mkdir youtubeaudio
!python3 -m pip install -U demucs
!python3 -m pip install pydub

In [None]:
#@title SO-VITS-SVC dependencies

!python -m pip install -U pip wheel
!pip install pyworld==0.3.2
!pip install numpy==1.23.5
!pip install -U ipython
!pip install -U tensorboard-plugin-profile
!pip install -U so-vits-svc-fork
!mkdir drive/MyDrive/so-vits-svc-fork
#@markdown pip may fail to resolve dependencies and raise ERROR, but it can be ignored.
#@markdown You need to restart the runtime after running this cell! (MUST!)


# Training
This set is for training an SVC model

In [None]:
#@title Make dataset directory
!mkdir -p "dataset_raw"

#!rm -r "dataset_raw"
#!rm -r "dataset/44k"

In [None]:
#@title Copy your dataset
DATASET_NAME = "" #@param {type: "string"}
DATASET_DIR = "" #@param {type: "string"}

DS_TO = f'dataset_raw/{DATASET_NAME}'
!mkdir -p {DS_TO}

!cp -R {DATASET_DIR}/. -t {DS_TO}

In [None]:
#@title Automatic preprocessing
!svc pre-resample

In [None]:
#@title Pre-Config for new dataset
!svc pre-config

In [None]:
#@title Copy configs file
!cp configs/44k/config.json drive/MyDrive/so-vits-svc-fork

In [None]:
#@title  Training Method
#@markdown  Here's a brief explanation of each:

#@markdown * CREPE (Convolutional Representation for Pitch Estimation): CREPE is a pitch estimation model designed for monophonic audio signals. It uses a deep convolutional neural network (CNN) to estimate the fundamental frequency (pitch) of the input audio.
#@markdown * CREPE-Tiny: CREPE-Tiny is a smaller version of the CREPE model. It is a lightweight model with reduced complexity, making it suitable for deployment on resource-constrained devices or environments where computational resources are limited.
#@markdown * DIO (Distributed Input/Output): DIO is a fundamental frequency (F0) estimation algorithm. It uses a probabilistic approach called Harvest to estimate the F0 values of audio signals. DIO is particularly effective for polyphonic and noisy audio signals.
#@markdown * Parselmouth: Parselmouth is a Python library for the analysis, visualization, and manipulation of speech and music signals. It provides a wide range of functions for extracting various acoustic features, such as pitch, intensity, formants, and spectrograms.
#@markdown * Harvest: Harvest is an algorithm used for pitch estimation, implemented as part of the World toolkit. It can estimate the F0 values of monophonic audio signals by analyzing the harmonic structure and periodicity in the signal.
F0_METHOD = "dio" #@param ["crepe", "crepe-tiny", "parselmouth", "dio", "harvest"]
!svc pre-hubert -fm {F0_METHOD}

In [None]:
#@title Training
%load_ext tensorboard
%tensorboard --logdir drive/MyDrive/so-vits-svc-fork/logs/44k
!svc train --model-path drive/MyDrive/so-vits-svc-fork/logs/44k

In [None]:
#@title Training Cluster Model
!svc train-cluster --output-path drive/MyDrive/so-vits-svc-fork/logs/44k/kmeans.pt

# Inference
This set is for using the SO-VITS-SVC model for conversion

In [None]:
#@title **INFERENCE**
#@markdown #INFERING USING PRE/TRAINED SVC MODEL
#@markdown * remove **".wav"** on AUDIO
from IPython.display import Audio

AUDIO = "" #@param {type:"string"}
MODEL = "" #@param {type:"string"}
CONFIG = "" #@param {type:"string"}
#@markdown * Change according to your model's voice pitch. 12 = 1 Octave | -12 = -1 Octave.
#@markdown * Higher pitch audio to Lower pitch Model usually use -12 to -24; Vice Versa
PITCH = -12 #@param {type:"integer"}
#@markdown * Options, or leave it by default
Auto_Predict = False #@param {type:"boolean"}
Pitch_Bypass = False #@param {type:"boolean"}
DisplayAudio_Infer = False #@param {type:"boolean"}

def Auto_PredictFalse():
  if Pitch_Bypass:
    !svc infer {AUDIO}.wav -c {CONFIG} -m {MODEL} -na
  else:
    !svc infer {AUDIO}.wav -c {CONFIG} -m {MODEL} -na -t {PITCH}

def Auto_PredictTrue():
  if Pitch_Bypass:
    !svc infer {AUDIO}.wav -c {CONFIG} -m {MODEL}
  else:
    !svc infer {AUDIO}.wav -c {CONFIG} -m {MODEL} -t {PITCH}

if Auto_Predict:
    Auto_PredictTrue()
else:
    Auto_PredictFalse()

#@markdown Displaying audio can restart the runtime sometimes
if DisplayAudio_Infer :
  display(Audio(f"{AUDIO}.out.wav"))

# Downloader
This cell is for downloading from the internet; url must be direct to the file

In [None]:
#@title Downloader
#@markdown The default downloads folder is in "/content/downloaded"
file_url = "" #@param {type:"string"}
file_url2 = "" #@param {type:"string"}

!mkdir downloaded
!wget -N {file_url} -P downloaded/
!wget -N {file_url2} -P downloaded/

In [None]:
#@title YouTube Audio Downloader (WAV Output)
from __future__ import unicode_literals
import yt_dlp
import ffmpeg
import sys

ydl_opts = {
    'format': 'bestaudio/best',
#    'outtmpl': 'output.%(ext)s',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'wav',
    }],
    "outtmpl": 'youtubeaudio/audio',  # this is where you can edit how you'd like the filenames to be formatted
}
def download_from_url(url):
    ydl.download([url])
    # stream = ffmpeg.input('output.m4a')
    # stream = ffmpeg.output(stream, 'output.wav')


with yt_dlp.YoutubeDL(ydl_opts) as ydl:
      url = "" #@param {type:"string"}
      download_from_url(url)


In [None]:
#@title Unzip Tool
ZIP_PATH = "" #@param {type:"string"}
FOLDER_NAME = "" #@param {type:"string"}

!unzip {ZIP_PATH} -d {FOLDER_NAME}

# Audio Editor
This set is for audio editing

In [None]:
#@title Split Tool
from scipy.io import wavfile
import os
import numpy as np
import argparse
from tqdm import tqdm
import json

from datetime import datetime, timedelta

# Utility functions

def GetTime(video_seconds):

    if (video_seconds < 0) :
        return 00

    else:
        sec = timedelta(seconds=float(video_seconds))
        d = datetime(1,1,1) + sec

        instant = str(d.hour).zfill(2) + ':' + str(d.minute).zfill(2) + ':' + str(d.second).zfill(2) + str('.001')

        return instant

def GetTotalTime(video_seconds):

    sec = timedelta(seconds=float(video_seconds))
    d = datetime(1,1,1) + sec
    delta = str(d.hour) + ':' + str(d.minute) + ":" + str(d.second)

    return delta

def windows(signal, window_size, step_size):
    if type(window_size) is not int:
        raise AttributeError("Window size must be an integer.")
    if type(step_size) is not int:
        raise AttributeError("Step size must be an integer.")
    for i_start in range(0, len(signal), step_size):
        i_end = i_start + window_size
        if i_end >= len(signal):
            break
        yield signal[i_start:i_end]

def energy(samples):
    return np.sum(np.power(samples, 2.)) / float(len(samples))

def rising_edges(binary_signal):
    previous_value = 0
    index = 0
    for x in binary_signal:
        if x and not previous_value:
            yield index
        previous_value = x
        index += 1

# Change the parameters here
split_input = "" #@param {type:"string"}
split_output = "" #@param {type:"string"}
#@markdown ---
#@markdown The minimum length of silence at which a split may occur [seconds]. Defaults to 3 seconds.
min_silence_length = 0.6 #@param {type:"number"}
#@markdown The energy level (between 0.0 and 1.0) below which the signal is regarded as silent.
silence_threshold = 1e-4 #@param {type:"number"}
#@markdown The amount of time to step forward in the input file after calculating energy. Smaller value = slower, but more accurate silence detection. Larger value = faster, but might miss some split opportunities. Defaults to **(min-silence-length / 10.)**.
step_duration = 0.03/10 #@param {type:"string"}
#@markdown ---
#@markdown Or leave it as my parameters= *0.6 ; 1e-4 ; 0.003/10*

input_filename = split_input
window_duration = min_silence_length
if step_duration is None:
    step_duration = window_duration / 10.
else:
    step_duration = step_duration

output_filename_prefix = os.path.splitext(os.path.basename(input_filename))[0]
dry_run = False

print("Splitting {} where energy is below {}% for longer than {}s.".format(
    input_filename,
    silence_threshold * 100.,
    window_duration
    )
)

# Read and split the file

sample_rate, samples = input_data=wavfile.read(filename=input_filename, mmap=True)

max_amplitude = np.iinfo(samples.dtype).max
print(max_amplitude)

max_energy = energy([max_amplitude])
print(max_energy)

window_size = int(window_duration * sample_rate)
step_size = int(step_duration * sample_rate)

signal_windows = windows(
    signal=samples,
    window_size=window_size,
    step_size=step_size
)

window_energy = (energy(w) / max_energy for w in tqdm(
    signal_windows,
    total=int(len(samples) / float(step_size))
))

window_silence = (e > silence_threshold for e in window_energy)

cut_times = (r * step_duration for r in rising_edges(window_silence))

# This is the step that takes long, since we force the generators to run.
print("Finding silences...")
cut_samples = [int(t * sample_rate) for t in cut_times]
cut_samples.append(-1)

cut_ranges = [(i, cut_samples[i], cut_samples[i+1]) for i in range(len(cut_samples) - 1)]

video_sub = {str(i) : [str(GetTime(((cut_samples[i])/sample_rate))),
                       str(GetTime(((cut_samples[i+1])/sample_rate)))]
             for i in range(len(cut_samples) - 1)}

for i, start, stop in tqdm(cut_ranges):
    output_file_path = "{}_{:03d}.wav".format(
        os.path.join(split_output, output_filename_prefix),
        i
    )
    if not dry_run:
        print("Writing file {}".format(output_file_path))
        wavfile.write(
            filename=output_file_path,
            rate=sample_rate,
            data=samples[start:stop]
        )
    else:
        print("Not writing file {}".format(output_file_path))

with open (split_output+'\\'+output_filename_prefix+'.json', 'w') as output:
    json.dump(video_sub, output)


In [None]:
#@title Convert to Waveform (.WAV)
#@markdown remove the file extension (.mp3;m4a) in input section. default output is in "/content/converted"
FFMPEG_INPUT = "" #@param {type:"string"}
FILE_EXT = "" #@param {type:"string"}
OUT = "" #@param {type:"string"}

!mkdir converted
!ffmpeg -i {FFMPEG_INPUT}.{FILE_EXT} -acodec pcm_s16le /content/converted/{OUT}.wav

In [None]:
#@title Demuxer (Seperate Vocal and Background)
import subprocess
AUDIO_INPUT = "" #@param {type:"string"}

command = f"demucs --two-stems=vocals {AUDIO_INPUT}"
result = subprocess.run(command.split(), stdout=subprocess.PIPE)
print(result.stdout.decode())

In [None]:
#@title Analyzing Audio Volume
ANLZ_INPUT = "" #@param {type:"string"}

!ffmpeg -i {ANLZ_INPUT} -filter:a volumedetect -f null /dev/null!

In [None]:
#@title Volume Manipulation
VM_INPUT = "" #@param {type:"string"}
#@markdown Value can be in "1.5" (150% Increase) or in "10dB" (10dB Increase)
VM_VALUE = "" #@param {type:"string"}
#@markdown Output filename; In /content/volume_changed
VM_OUTPUT = "" #@param {type:"string"}

!mkdir volume_changed
!ffmpeg -i {VM_INPUT} -filter:a "volume={VM_VALUE}" -c:a pcm_s16le /content/volume_changed/{VM_OUTPUT}.volume.wav

In [None]:
#@title Audio Normalization
#@markdown * Audio Normalization input; this cell will also convert audio file to waveform.
AN_INPUT = "" #@param {type:"string"}
#@markdown * Target loudness; type just the value in dB (ex. "-6")
TARGET_LDNS = "-6" #@param {type:"string"}
#@markdown * The default Loudness Range is 11dB
RANGE_LDNS = "11" #@param {type:"string"}
#@markdown * The default value is -1.5dB
TRUE_PEAK = "-1.5" #@param {type:"string"}
#@markdown * Output filename; in /content/normalized
AN_OUTPUT = "" #@param {type:"string"}

!mkdir normalized
!ffmpeg -i {AN_INPUT} -af loudnorm=I={TARGET_LDNS}:LRA={RANGE_LDNS}:TP={TRUE_PEAK} -c:a pcm_s16le /content/normalized/{AN_OUTPUT}.normalized.wav



In [None]:
#@title Combine
from pydub import AudioSegment
!mkdir combined

AUDIO_01 = "" #@param {type:"string"}
AUDIO_02 = "" #@param {type:"string"}
DisplayAudio_Combined = False #@param {type:"boolean"}

sound1 = AudioSegment.from_file(AUDIO_01)
sound2 = AudioSegment.from_file(AUDIO_02)

combined = sound1.overlay(sound2)

combined.export("/content/combined/audio.combined.wav", format='wav')

def DisplayAudioResult():
    display(Audio(f"/content/combined/audio.combined.wav"))

if DisplayAudio_Combined :
  DisplayAudioResult()

In [None]:
#@title Audio Recording
REC_NAME = "Audio.wav" #@param {type:"string"}
REC_OUT = "/content/" #@param {type:"string"}
REC_COMB = os.path.join(REC_OUT, REC_NAME)

AUDIO_HTML = """
<script>
var my_div = document.createElement("DIV");
var my_p = document.createElement("P");
var my_btn = document.createElement("BUTTON");
var t = document.createTextNode("Press to start recording");

my_btn.appendChild(t);
my_div.appendChild(my_btn);
document.body.appendChild(my_div);

var base64data = 0;
var reader;
var recorder, gumStream;
var recordButton = my_btn;

var handleSuccess = function(stream) {
  gumStream = stream;
  var options = {
    mimeType : 'audio/webm;codecs=opus'
  };
  recorder = new MediaRecorder(stream);
  recorder.ondataavailable = function(e) {
    var url = URL.createObjectURL(e.data);
    var preview = document.createElement('audio');
    preview.controls = true;
    preview.src = url;
    document.body.appendChild(preview);

    reader = new FileReader();
    reader.readAsDataURL(e.data);
    reader.onloadend = function() {
      base64data = reader.result;
    }
  };
  recorder.start();
};

recordButton.innerText = "Recording... Press to stop";

navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);

function toggleRecording() {
  if (recorder && recorder.state == "recording") {
      recorder.stop();
      gumStream.getAudioTracks()[0].stop();
      recordButton.innerText = "Saving the recording..."
  }
}

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

var data = new Promise(resolve=>{
  recordButton.onclick = ()=>{
    toggleRecording();

    sleep(2000).then(() => {
      resolve(base64data.toString());
    });
  };
});
</script>
"""


import subprocess
import ffmpeg

def get_audio():
    display(HTML(AUDIO_HTML))
    data = eval_js("data")
    binary = b64decode(data.split(',')[1])

    process = subprocess.Popen(
        ["ffmpeg", "-i", "pipe:0", "-f", "wav", "pipe:1"],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    output, err = process.communicate(input=binary)

    riff_chunk_size = len(output) - 8
    # Break up the chunk size into four bytes, held in b.
    q = riff_chunk_size
    b = []
    for i in range(4):
        q, r = divmod(q, 256)
        b.append(r)

    # Replace bytes 4:8 in proc.stdout with the actual size of the RIFF chunk.
    riff = output[:4] + bytes(b) + output[8:]

    sr, audio = wavfile.read(io.BytesIO(riff))

    return audio, sr

audio, sr = get_audio()
wavfile.write(REC_COMB, sr, audio)
