## 克隆 StarGANv2-VC

In [1]:
!git clone https://github.com/yl4579/StarGANv2-VC.git

Cloning into 'StarGANv2-VC'...
remote: Enumerating objects: 159, done.[K
remote: Counting objects: 100% (35/35), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 159 (delta 29), reused 16 (delta 16), pack-reused 124[K
Receiving objects: 100% (159/159), 108.18 MiB | 12.54 MiB/s, done.
Resolving deltas: 100% (54/54), done.
Checking out files: 100% (33/33), done.


In [2]:
%cd StarGANv2-VC

/content/StarGANv2-VC


In [None]:
!pip install SoundFile torchaudio munch parallel_wavegan torch pydub

## 准备数据

下载并提取[VCTK 数据集](https://datashare.ed.ac.uk/handle/10283/3443) 并使用VCTK.ipynb准备数据（下采样到 24 kHz 等）

或者下载这个更小的[数据集](https://drive.google.com/file/d/1t7QQbu4YC_P1mv9puA_KgSomSFDsSzD6/view?usp=sharing)

In [4]:
%cd ..

/content


In [None]:
# 下载更小的数据集
!gdown --id '1t7QQbu4YC_P1mv9puA_KgSomSFDsSzD6'

In [5]:
# 下载VCTK数据集 30+ min
!wget https://datashare.ed.ac.uk/download/DS_10283_3443.zip -O DS_10283_3443.zip
# 解压文件
!unzip -q DS_10283_3443.zip -d VCTK_Data
# 删除文件
!rm -f DS_10283_3443.zip
# 解压文件
!unzip -q VCTK_Data/VCTK-Corpus-0.92.zip -d VCTK_Data
# 删除文件
!rm -f VCTK_Data/VCTK-Corpus-0.92.zip

--2022-05-18 11:19:52--  https://datashare.ed.ac.uk/download/DS_10283_3443.zip
Resolving datashare.ed.ac.uk (datashare.ed.ac.uk)... 192.41.117.26
Connecting to datashare.ed.ac.uk (datashare.ed.ac.uk)|192.41.117.26|:443... connected.
HTTP request sent, awaiting response... 200 200
Length: 11749118637 (11G) [application/zip]
Saving to: ‘DS_10283_3443.zip’


2022-05-18 11:51:42 (5.87 MB/s) - ‘DS_10283_3443.zip’ saved [11749118637/11749118637]



In [None]:
# # 删除文件夹
# !rm -rf /content/Data

In [49]:
# VCTK Corpus Path
# __CORPUSPATH__ = "/share/naplab/users/yl4579/data/VCTK-Corpus/VCTK-Corpus"
__CORPUSPATH__ = "/content/VCTK_Data/wav48_silence_trimmed"

# output path
__OUTPATH__ = "/content/Data"

In [53]:
import os
import random
import pandas as pd
from scipy.io import wavfile
from pydub import AudioSegment
from pydub.silence import split_on_silence


def split(sound):
    dBFS = sound.dBFS
    chunks = split_on_silence(sound,
        min_silence_len = 100,
        silence_thresh = dBFS-16,
        keep_silence = 100
    )
    return chunks

def combine(_src):
    audio = AudioSegment.empty()
    for i,filename in enumerate(os.listdir(_src)):
        if i>=10: break  # <--------------------------------
        if filename.endswith('.wav'):
            filename = os.path.join(_src, filename)
            audio += AudioSegment.from_wav(filename)
        elif filename.endswith('.flac'):
            filename = os.path.join(_src, filename)
            audio += AudioSegment.from_file(filename, format="flac") 

    return audio

def save_chunks(chunks, directory):
    if not os.path.exists(directory):
        os.makedirs(directory)
    counter = 0

    target_length = 5 * 1000
    output_chunks = [chunks[0]]
    for chunk in chunks[1:]:
        if len(output_chunks[-1]) < target_length:
            output_chunks[-1] += chunk
        else:
            # if the last output chunk is longer than the target length,
            # we can start a new one
            output_chunks.append(chunk)

    for chunk in output_chunks:
        chunk = chunk.set_frame_rate(24000)
        chunk = chunk.set_channels(1)
        counter = counter + 1
        chunk.export(os.path.join(directory, str(counter) + '.wav'), format="wav")

In [54]:
# Source: http://speech.ee.ntu.edu.tw/~jjery2243542/resource/model/is18/en_speaker_used.txt
# Source: https://github.com/jjery2243542/voice_conversion

speakers = [225,228,229,230,231,233,236,239,240,244,226,227,232,243,254,256,258,259,270,273]

In [55]:
# downsample to 24 kHz

for p in speakers:
    directory = __OUTPATH__ + '/p' + str(p)
    if not os.path.exists(directory):
        audio = combine(__CORPUSPATH__ + '/p' + str(p))
        chunks = split(audio)
        save_chunks(chunks, directory)

In [56]:
# get all speakers

data_list = []
for path, subdirs, files in os.walk(__OUTPATH__):
    for name in files:
        if name.endswith(".wav"):
            speaker = int(path.split('/')[-1].replace('p', ''))
            if speaker in speakers:
                data_list.append({"Path": os.path.join(path, name), "Speaker": int(speakers.index(speaker)) + 1})
                


data_list = pd.DataFrame(data_list)
data_list = data_list.sample(frac=1)


split_idx = round(len(data_list) * 0.1)

test_data = data_list[:split_idx]
train_data = data_list[split_idx:]

In [57]:
# write to file 

file_str = ""
for index, k in train_data.iterrows():
    file_str += k['Path'] + "|" +str(k['Speaker'] - 1)+ '\n'
text_file = open(__OUTPATH__ + "/train_list.txt", "w")
text_file.write(file_str)
text_file.close()

file_str = ""
for index, k in test_data.iterrows():
    file_str += k['Path'] + "|" + str(k['Speaker'] - 1) + '\n'
text_file = open(__OUTPATH__ + "/val_list.txt", "w")
text_file.write(file_str)
text_file.close()

## 训练
```
python train.py --config_path ./Configs/config.yml
```
请在文件中指定训练和验证数据config.yml。更改num_domains数据集中的发言者数量。数据列表格式需要为filename.wav|speaker_number，以train_list.txt为例。

检查点和 Tensorboard 日志将保存在log_dir. 为了加快训练速度，您可能需要batch_size尽可能大的 GPU RAM。但是，请注意，这batch_size = 5将需要大约 10G GPU RAM。

In [58]:
%cd StarGANv2-VC

[Errno 2] No such file or directory: 'StarGANv2-VC'
/content/StarGANv2-VC


In [None]:
# !python train.py --config_path ./Configs/config.yml

In [59]:
import os
import os.path as osp
import re
import sys
import yaml
import shutil
import numpy as np
import torch
import click
import warnings
warnings.simplefilter('ignore')

from functools import reduce
from munch import Munch

from meldataset import build_dataloader
from optimizers import build_optimizer
from models import build_model
from trainer import Trainer
from torch.utils.tensorboard import SummaryWriter

from Utils.ASR.models import ASRCNN
from Utils.JDC.model import JDCNet

import logging
from logging import StreamHandler
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
handler = StreamHandler()
handler.setLevel(logging.DEBUG)
logger.addHandler(handler)

torch.backends.cudnn.benchmark = True

修改 Configs/config.yml

train_data: "/content/Data/train_list.txt"

val_data: "/content/Data/val_list.txt"

In [68]:
config_path = "./Configs/config.yml"

config = yaml.safe_load(open(config_path))

log_dir = config['log_dir']
if not osp.exists(log_dir): os.makedirs(log_dir, exist_ok=True)
shutil.copy(config_path, osp.join(log_dir, osp.basename(config_path)))
writer = SummaryWriter(log_dir + "/tensorboard")


In [69]:
# write logs
file_handler = logging.FileHandler(osp.join(log_dir, 'train.log'))
file_handler.setLevel(logging.DEBUG)
file_handler.setFormatter(logging.Formatter('%(levelname)s:%(asctime)s: %(message)s'))
logger.addHandler(file_handler)

batch_size = config.get('batch_size', 10)
device = config.get('device', 'cpu')
epochs = config.get('epochs', 1000)
save_freq = config.get('save_freq', 20)
train_path = config.get('train_data', None)
val_path = config.get('val_data', None)
stage = config.get('stage', 'star')
fp16_run = config.get('fp16_run', False)


In [70]:
def get_data_path_list(train_path=None, val_path=None):
    if train_path is None:
        train_path = "../Data/train_list.txt"
    if val_path is None:
        val_path = "../Data/val_list.txt"

    with open(train_path, 'r') as f:
        train_list = f.readlines()
    with open(val_path, 'r') as f:
        val_list = f.readlines()

    return train_list, val_list


# load data
train_list, val_list = get_data_path_list(train_path, val_path)
train_dataloader = build_dataloader(train_list,
                                    batch_size=batch_size,
                                    num_workers=4,
                                    device=device)
val_dataloader = build_dataloader(val_list,
                                    batch_size=batch_size,
                                    validation=True,
                                    num_workers=2,
                                    device=device)


In [71]:

# load pretrained ASR model
ASR_config = config.get('ASR_config', False)
ASR_path = config.get('ASR_path', False)
with open(ASR_config) as f:
        ASR_config = yaml.safe_load(f)
ASR_model_config = ASR_config['model_params']
ASR_model = ASRCNN(**ASR_model_config)
params = torch.load(ASR_path, map_location='cpu')['model']
ASR_model.load_state_dict(params)
_ = ASR_model.eval()


In [72]:
# load pretrained F0 model
F0_path = config.get('F0_path', False)
F0_model = JDCNet(num_class=1, seq_len=192)
params = torch.load(F0_path, map_location='cpu')['net']
F0_model.load_state_dict(params)

<All keys matched successfully>

In [73]:
    
# build model
model, model_ema = build_model(Munch(config['model_params']), F0_model, ASR_model)

scheduler_params = {
    "max_lr": float(config['optimizer_params'].get('lr', 2e-4)),
    "pct_start": float(config['optimizer_params'].get('pct_start', 0.0)),
    "epochs": epochs,
    "steps_per_epoch": len(train_dataloader),
}

_ = [model[key].to(device) for key in model]
_ = [model_ema[key].to(device) for key in model_ema]
scheduler_params_dict = {key: scheduler_params.copy() for key in model}
scheduler_params_dict['mapping_network']['max_lr'] = 2e-6
optimizer = build_optimizer({key: model[key].parameters() for key in model},scheduler_params_dict=scheduler_params_dict)

trainer = Trainer(args=Munch(config['loss_params']), 
                  model=model,
                  model_ema=model_ema,
                  optimizer=optimizer,
                  device=device,
                  train_dataloader=train_dataloader,
                  val_dataloader=val_dataloader,
                  logger=logger,
                  fp16_run=fp16_run)

# if config.get('pretrained_model', '') != '':
#     trainer.load_checkpoint(config['pretrained_model'],load_only_params=config.get('load_only_params', True))


{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 15}
{'max_lr': 2e-06, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 15}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 15}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 15}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 15}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 150, 'steps_per_epoch': 15}


In [74]:
epochs = 2 # 150

for _ in range(1, epochs+1):
    epoch = trainer.epochs
    train_results = trainer._train_epoch()
    eval_results = trainer._eval_epoch()
    results = train_results.copy()
    results.update(eval_results)
    logger.info('--- epoch %d ---' % epoch)
    for key, value in results.items():
        if isinstance(value, float):
            logger.info('%-15s: %.4f' % (key, value))
            writer.add_scalar(key, value, epoch)
        else:
            for v in value:
                writer.add_figure('eval_spec', v, epoch)
    if (epoch % save_freq) == 0:
        trainer.save_checkpoint(osp.join(log_dir, 'epoch_%05d.pth' % epoch))

[train]: 100%|██████████| 15/15 [01:41<00:00,  6.74s/it]
[eval]: 100%|██████████| 2/2 [00:11<00:00,  5.66s/it]
--- epoch 0 ---
--- epoch 0 ---
train/real     : 0.6498
train/real     : 0.6498
train/fake     : 0.7972
train/fake     : 0.7972
train/reg      : 0.0005
train/reg      : 0.0005
train/real_adv_cls: 0.0000
train/real_adv_cls: 0.0000
train/con_reg  : 0.0000
train/con_reg  : 0.0000
train/adv      : 0.9618
train/adv      : 0.9618
train/sty      : 0.0339
train/sty      : 0.0339
train/ds       : 0.0010
train/ds       : 0.0010
train/cyc      : 1.5190
train/cyc      : 1.5190
train/norm     : 9.1607
train/norm     : 9.1607
train/asr      : 0.3855
train/asr      : 0.3855
train/f0       : 0.6260
train/f0       : 0.6260
train/adv_cls  : 0.0000
train/adv_cls  : 0.0000
eval/real      : 0.5716
eval/real      : 0.5716
eval/fake      : 0.5427
eval/fake      : 0.5427
eval/reg       : 0.0000
eval/reg       : 0.0000
eval/real_adv_cls: 0.0000
eval/real_adv_cls: 0.0000
eval/con_reg   : 0.0000
eval/co

#### 下载预训练模型

In [None]:
!gdown --id '1nzTyyl-9A1Hmqya2Q_f2bpZkUoRjbZsY' --output Models.zip
!gdown --id '1q8oSAzwkqi99oOGXDZyLypCiz0Qzn3Ab' --output Vocoder.zip

In [85]:
# 解压文件
!unzip -q Models.zip
!unzip -q Vocoder.zip

In [82]:
# 删除文件
!rm -f Models.zip
!rm -f Vocoder.zip

## 推理

详情请参考inference.ipynb。

VCTK 语料库上预训练的 StarGANv2 和 ParallelWaveGAN 可以在[StarGANv2 Link](https://drive.google.com/file/d/1nzTyyl-9A1Hmqya2Q_f2bpZkUoRjbZsY/view?usp=sharing)和[ParallelWaveGAN Link](https://drive.google.com/file/d/1q8oSAzwkqi99oOGXDZyLypCiz0Qzn3Ab/view?usp=sharing)下载。请分别解压缩Models并Vocoder运行笔记本中的每个单元格。

In [75]:
# load packages
import random
import yaml
from munch import Munch
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
import torchaudio
import librosa

from Utils.ASR.models import ASRCNN
from Utils.JDC.model import JDCNet
from models import Generator, MappingNetwork, StyleEncoder

%matplotlib inline

In [76]:
# Source: http://speech.ee.ntu.edu.tw/~jjery2243542/resource/model/is18/en_speaker_used.txt
# Source: https://github.com/jjery2243542/voice_conversion

speakers = [225,228,229,230,231,233,236,239,240,244,226,227,232,243,254,256,258,259,270,273]

to_mel = torchaudio.transforms.MelSpectrogram(
    n_mels=80, n_fft=2048, win_length=1200, hop_length=300)
mean, std = -4, 4

def preprocess(wave):
    wave_tensor = torch.from_numpy(wave).float()
    mel_tensor = to_mel(wave_tensor)
    mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std
    return mel_tensor

def build_model(model_params={}):
    args = Munch(model_params)
    generator = Generator(args.dim_in, args.style_dim, args.max_conv_dim, w_hpf=args.w_hpf, F0_channel=args.F0_channel)
    mapping_network = MappingNetwork(args.latent_dim, args.style_dim, args.num_domains, hidden_dim=args.max_conv_dim)
    style_encoder = StyleEncoder(args.dim_in, args.style_dim, args.num_domains, args.max_conv_dim)
    
    nets_ema = Munch(generator=generator,
                     mapping_network=mapping_network,
                     style_encoder=style_encoder)

    return nets_ema

def compute_style(speaker_dicts):
    reference_embeddings = {}
    for key, (path, speaker) in speaker_dicts.items():
        if path == "":
            label = torch.LongTensor([speaker]).to('cuda')
            latent_dim = starganv2.mapping_network.shared[0].in_features
            ref = starganv2.mapping_network(torch.randn(1, latent_dim).to('cuda'), label)
        else:
            wave, sr = librosa.load(path, sr=24000)
            audio, index = librosa.effects.trim(wave, top_db=30)
            if sr != 24000:
                wave = librosa.resample(wave, sr, 24000)
            mel_tensor = preprocess(wave).to('cuda')

            with torch.no_grad():
                label = torch.LongTensor([speaker])
                ref = starganv2.style_encoder(mel_tensor.unsqueeze(1), label)
        reference_embeddings[key] = (ref, label)
    
    return reference_embeddings

In [77]:
# load F0 model

F0_model = JDCNet(num_class=1, seq_len=192)
params = torch.load("Utils/JDC/bst.t7")['net']
F0_model.load_state_dict(params)
_ = F0_model.eval()
F0_model = F0_model.to('cuda')

### 1

In [86]:
# load vocoder
from parallel_wavegan.utils import load_model
vocoder = load_model("Vocoder/checkpoint-400000steps.pkl").to('cuda').eval()
vocoder.remove_weight_norm()
_ = vocoder.eval()

In [89]:
# load starganv2

# model_path = 'Models/VCTK20/epoch_00150.pth'
model_path = 'Models/epoch_00150.pth'

with open('Models/VCTK20/config.yml') as f:
    starganv2_config = yaml.safe_load(f)
starganv2 = build_model(model_params=starganv2_config["model_params"])
params = torch.load(model_path, map_location='cpu')
params = params['model_ema']
_ = [starganv2[key].load_state_dict(params[key]) for key in starganv2]
_ = [starganv2[key].eval() for key in starganv2]
starganv2.style_encoder = starganv2.style_encoder.to('cuda')
starganv2.mapping_network = starganv2.mapping_network.to('cuda')
starganv2.generator = starganv2.generator.to('cuda')

Conversion

In [90]:
# load input wave
selected_speakers = [273, 259, 258, 243, 254, 244, 236, 233, 230, 228]
k = random.choice(selected_speakers)
wav_path = 'Demo/VCTK-corpus/p' + str(k) + '/p' + str(k) + '_023.wav'
audio, source_sr = librosa.load(wav_path, sr=24000)
audio = audio / np.max(np.abs(audio))
audio.dtype = np.float32

Convert by style encoder

In [91]:
# with reference, using style encoder
speaker_dicts = {}
for s in selected_speakers:
    k = s
    speaker_dicts['p' + str(s)] = ('Demo/VCTK-corpus/p' + str(k) + '/p' + str(k) + '_023.wav', speakers.index(s))

reference_embeddings = compute_style(speaker_dicts)

In [None]:
# conversion 
import time
start = time.time()
    
source = preprocess(audio).to('cuda:0')
keys = []
converted_samples = {}
reconstructed_samples = {}
converted_mels = {}

for key, (ref, _) in reference_embeddings.items():
    with torch.no_grad():
        f0_feat = F0_model.get_feature_GAN(source.unsqueeze(1))
        out = starganv2.generator(source.unsqueeze(1), ref, F0=f0_feat)
        
        c = out.transpose(-1, -2).squeeze().to('cuda')
        y_out = vocoder.inference(c)
        y_out = y_out.view(-1).cpu()

        if key not in speaker_dicts or speaker_dicts[key][0] == "":
            recon = None
        else:
            wave, sr = librosa.load(speaker_dicts[key][0], sr=24000)
            mel = preprocess(wave)
            c = mel.transpose(-1, -2).squeeze().to('cuda')
            recon = vocoder.inference(c)
            recon = recon.view(-1).cpu().numpy()

    converted_samples[key] = y_out.numpy()
    reconstructed_samples[key] = recon

    converted_mels[key] = out
    
    keys.append(key)
end = time.time()
print('total processing time: %.3f sec' % (end - start) )

import IPython.display as ipd
for key, wave in converted_samples.items():
    print('Converted: %s' % key)
    display(ipd.Audio(wave, rate=24000))
    print('Reference (vocoder): %s' % key)
    if reconstructed_samples[key] is not None:
        display(ipd.Audio(reconstructed_samples[key], rate=24000))

print('Original (vocoder):')
wave, sr = librosa.load(wav_path, sr=24000)
mel = preprocess(wave)
c = mel.transpose(-1, -2).squeeze().to('cuda')
with torch.no_grad():
    recon = vocoder.inference(c)
    recon = recon.view(-1).cpu().numpy()
display(ipd.Audio(recon, rate=24000))
print('Original:')
display(ipd.Audio(wav_path, rate=24000))


Convert by mapping network

In [93]:
# no reference, using mapping network
speaker_dicts = {}
selected_speakers = [273, 259, 258, 243, 254, 244, 236, 233, 230, 228]
for s in selected_speakers:
    k = s
    speaker_dicts['p' + str(s)] = ('', speakers.index(s))

reference_embeddings = compute_style(speaker_dicts)

In [94]:
# conversion 
import time
start = time.time()
    
source = preprocess(audio).to('cuda:0')
keys = []
converted_samples = {}
reconstructed_samples = {}
converted_mels = {}

for key, (ref, _) in reference_embeddings.items():
    with torch.no_grad():
        f0_feat = F0_model.get_feature_GAN(source.unsqueeze(1))
        out = starganv2.generator(source.unsqueeze(1), ref, F0=f0_feat)
        
        c = out.transpose(-1, -2).squeeze().to('cuda')
        y_out = vocoder.inference(c)
        y_out = y_out.view(-1).cpu()

        if key not in speaker_dicts or speaker_dicts[key][0] == "":
            recon = None
        else:
            wave, sr = librosa.load(speaker_dicts[key][0], sr=24000)
            mel = preprocess(wave)
            c = mel.transpose(-1, -2).squeeze().to('cuda')
            recon = vocoder.inference(c)
            recon = recon.view(-1).cpu().numpy()

    converted_samples[key] = y_out.numpy()
    reconstructed_samples[key] = recon

    converted_mels[key] = out
    
    keys.append(key)
end = time.time()
print('total processing time: %.3f sec' % (end - start) )

import IPython.display as ipd
for key, wave in converted_samples.items():
    print('Converted: %s' % key)
    display(ipd.Audio(wave, rate=24000))
    print('Reference (vocoder): %s' % key)
    if reconstructed_samples[key] is not None:
        display(ipd.Audio(reconstructed_samples[key], rate=24000))

print('Original (vocoder):')
wave, sr = librosa.load(wav_path, sr=24000)
mel = preprocess(wave)
c = mel.transpose(-1, -2).squeeze().to('cuda')
with torch.no_grad():
    recon = vocoder.inference(c)
    recon = recon.view(-1).cpu().numpy()
display(ipd.Audio(recon, rate=24000))
print('Original:')
display(ipd.Audio(wav_path, rate=24000))

total processing time: 15.474 sec
Converted: p273


Reference (vocoder): p273
Converted: p259


Reference (vocoder): p259
Converted: p258


Reference (vocoder): p258
Converted: p243


Reference (vocoder): p243
Converted: p254


Reference (vocoder): p254
Converted: p244


Reference (vocoder): p244
Converted: p236


Reference (vocoder): p236
Converted: p233


Reference (vocoder): p233
Converted: p230


Reference (vocoder): p230
Converted: p228


Reference (vocoder): p228
Original (vocoder):


Original:


## ASR & F0 Models

Utils该文件夹下提供了预训练的 F0 和 ASR 模型。F0 和 ASR 模型都使用 meldataset.py 预处理的melspectrograms进行训练，并且这两个模型都仅针对语音数据进行训练。

ASR 模型是在英语语料库上训练的，但它似乎在用其他语言（如日语）训练 StarGANv2 模型时有效。F0 模型似乎也适用于歌唱数据。然而，为了获得最佳性能，鼓励为非英语和非语音数据训练您自己的 ASR 和 F0 模型。

您可以使用自己的 melspectrogram 预处理编辑meldataset.py，但提供的预训练模型将不再有效。您将需要使用新的预处理来训练您自己的 ASR 和 F0 模型。例如，您可以参考 repo Diamondfan/CTC_pytorch和keums/melodyExtraction_JDC来训练您自己的 ASR 和 F0 模型。