# 利用Modelscope的文件处理和NeMo训练fs2
Use modelscope for data processing & use nemo for tts training.

流程:
1. 数据处理 
3. 获取辅助特征 
4. 转为NeMo所需格式,以及其他杂项 


本次采用了峰哥素材库 2022年所有直播,最后保留约40次直播, 约50小时.

**实际上如果为了不计划学习一个复杂的情感(峰哥的情感大多数时候不算复杂hh), 10个小时左右基本可以满足要求.下述时间都是在10h数据上完成的.**

**所有实验在3090和同等算力的显卡上完成,显存要求均小于24g. 如果你有3090和一个24核的CPU,下述的时间都是准确的.**

最后,制作视频和notebook都花费了不少时间和精力,希望大家多多关注, 多多三连.
<!-- 
Steps:
1. data processing
2. get auxiliary features
3. trans metadata to nemo format, run fastpitch and hifigan -->

In [None]:
!apt install ffmpeg -y
!pip install pqdm
pip install "modelscope[audio]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

----------
## 数据处理pipeline

更改了modelscope的几个问题(感兴趣可以帮忙merge一下):

1. 重要! funasr/bin/vad_inference.py 报错 (更新到modelscope1.4.3后报错,之前的版本好像不会)
```
funasr/bin/vad_inference.py line 275:
    删除 results[i] = json.loads(results[i])
```

1. funasr/modules/nets_utils.py pad_list在len(xs) == 1时不再pad *(减小CPU占用)*
```
funasr/modules/nets_utils.py line 54:
    if len(xs) == 1:
        return xs
```

2. funasr/models/frontend/wav_frontend.py在 apply_cmvn 时使用to(device)改为torch.tensor(, device=device) *(减小CPU占用)*
```
funasr/models/frontend/wav_frontend.py line 53:
    inputs += torch.tensor(means, dtype=dtype, device=device)
    inputs *= torch.tensor(vars, dtype=dtype, device=device)
```

funasr/utils/postprocess_utils.py @的问题,在batch infer时出现的错误. 感谢modelscope钉钉群的解答!
```
funasr/utils/postprocess_utils.py line 9:
    删除分支 ch == '@'
```

funasr/bin/asr_inference_paraformer.py self.frontend.forward() *(未使用GPU加速)*
```
funasr/bin/asr_inference_paraformer.py line 213:
    feats, feats_len = self.frontend.forward(speech.to(self.device), speech_lengths.to(self.device))
```


In [None]:
from glob import glob
from tqdm import tqdm
import os
import json
import soundfile as sf
import subprocess
import numpy as np
import re

import random
from IPython.display import Audio
import IPython.display as iply

from pqdm.processes import pqdm
from pqdm.threads import pqdm as pqdmT
import pandas as pd

import tgt
from pypinyin import lazy_pinyin, Style

import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

import torch

In [None]:
# 定义一下路径, 在该路径下,把音频放入/raw/
VERSION = ''
start_path = f"./{VERSION}"
if start_path[-1] != '/':
    start_path = start_path + '/'
    
os.makedirs(os.path.join(start_path, 'raw'), exist_ok=True)

In [None]:
def run_multiprocess(x):
    DEVICE, IJOB = x
    subprocess.run(f"CUDA_VISIBLE_DEVICES={DEVICE} "+
                   f"nohup python {os.path.join(start_path, 'tmp', 'temp.py')} {IJOB} "+
                   f"> {os.path.join(start_path, 'tmp','tmp_'+str(IJOB))} 2>&1",
                   shell=True)

In [None]:
SR = 22050
sg_len = 300 # 秒, 5分钟*60
CPU_kernels = 64 # 建议有几核就写几核

### 数据定义和切片 (CPU友好任务,速度应该很快)

1. 预筛选了一些没有连麦,音乐较少的视频,保留并删除其他的奇怪视频(比如户外).
2. 数据切片, 将数据转换成16000采样率(建议22050,因为有train好的hifigan), wav格式, 且每段5分钟

随后继续使用VAD模型,将音频切成10-20秒左右的短片段


In [None]:
def run(x):
    name = "/".join(x.split("/")[:-1]).replace("/raw","/wavs/") + x.split("/")[-1].split(".")[0]
    return subprocess.run(f"ffmpeg -hide_banner -loglevel panic -i '{x}' -ac 1 -ar 16000 -f s16le - "+
                   f"| ffmpeg -hide_banner -loglevel panic -f s16le -ar 16000 -i - -f segment -segment_time {sg_len} {name}_%03d.wav",
                   shell=True)

os.makedirs(os.path.join(start_path, "wavs"), exist_ok=True)
wavs_name = glob(os.path.join(start_path, "raw/*.*"))

# run(wavs_name[0]) # 理论上你可以看到,在wavs文件夹会出现很多的音频.

In [None]:
# 多进程运行 ffmpeg.
res = pqdm(wavs_name, run, n_jobs=CPU_kernels)

-----------

### VAD数据切片(预计30min)
继续使用VAD模型,将音频切成10-20秒左右的短片段

`ps -ef | grep tmp/temp.py | grep -v grep | cut -c 9-16| xargs kill -9` 来kill掉已经跑起来的进程.

`watch -n1 "cat vad/vad_*/1best_recog/text |wc -l"` 来查看条数.


In [None]:
# cuda_devices = [0]*2
cuda_devices = [0,1,2,3]*2 # 显卡个数,多卡代表0,1,2,3. 如果是单卡,设置为 [0]*2
NJOB = len(cuda_devices) # 使用多少个进程同时进行VAD计算. 通常一个GPU可以放下2-4个进程

min_length = 2.0 # 音频最短长度, 如果小于该长度,会和后面的结果合并
max_length = 15.0 # 音频最长长度, 如果超过该长度, 会直接从中间砍开

In [None]:
!rm -rf {os.path.join(start_path, "tmp")}
os.makedirs(os.path.join(start_path, "tmp"), exist_ok=True)

_script = f'''
import os
import sys
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

start_path = '{start_path}'
inference_pipeline = pipeline(
    task=Tasks.voice_activity_detection,
    model='damo/speech_fsmn_vad_zh-cn-16k-common-pytorch',
    model_revision='v1.1.8',
    output_dir=os.path.join(start_path, "vad", "vad_"+sys.argv[1]),
)

print(os.path.join(start_path, 'tmp', "temp_"+sys.argv[1]+".scp"))
inference_pipeline(audio_in=os.path.join(start_path, 'tmp', "temp_"+sys.argv[1]+".scp"))
'''
with open(os.path.join(start_path, "tmp", "temp.py"), "w") as f:
    f.write(_script)
    
data = glob(os.path.join(start_path,"wavs", "*.wav"))
slice_len = (len(data) + NJOB - 1) // NJOB
for IJOB in range(NJOB):
    data_dir = data[IJOB*slice_len : (IJOB+1)*slice_len]
    with open(os.path.join(start_path, "tmp", f"temp_{IJOB}.scp"), "w") as f:
        for i in data_dir:
            f.writelines(i.split("/")[-1].split(".")[0] + " "+ i + "\n")

In [None]:
res = pqdm(list(zip(cuda_devices, list(range(NJOB)))), run_multiprocess, n_jobs=NJOB)

In [None]:
datas = glob(os.path.join(start_path,"vad/*/1best_recog/*"))
os.makedirs(start_path + "slices", exist_ok=True)
os.makedirs(start_path + "metas", exist_ok=True)

with open(start_path+"metas/meta.csv", "w") as meta:
    for data in tqdm(datas):
        with open(data) as f:
            f={i[:i.find(" ")]:json.loads(i[i.find(" "):]) for i in f.readlines()}
        for name in f:
            wav, fs = sf.read(start_path + f"wavs/{name}.wav")
            
            if f[name] == []:
                continue
            
            temp = []
            count = 0
            for idxs in f[name]:
                temp.append([int(idxs[0] / 1000 * fs), int(idxs[1] / 1000 * fs)])
                if (idxs[1] - idxs[0]) / 1000 < min_length:
                    continue
                
                _wavs = [np.concatenate([wav[i[0]:i[1]] for i in temp])]
                if (_wavs[0].shape[0] / fs) > max_length:
                    _k = 2 ** np.ceil(np.log2((_wavs[0].shape[0]) / fs / max_length))
                    for i in range(int(_k)):
                        _wavs.append(_wavs[0][int(i/_k*_wavs[0].shape[0]) : int((i+1)/_k*_wavs[0].shape[0])])
                    _wavs = _wavs[1:]
                    
                for _wav in _wavs:
                    sf.write(start_path + f"slices/{name}_{str(count).zfill(4)}.wav", _wav, fs)
                    _meta = {'audio_filepath': start_path + f"slices/{name}_{str(count).zfill(4)}.wav", "duration": round(_wav.shape[0]/fs, 3)}
                    meta.writelines(json.dumps(_meta) + '\n')
                
                    count += 1
                temp = []

In [None]:
with open(start_path+"metas/meta.csv") as meta:
    meta = [json.loads(i) for i in meta.readlines()]
    
meta = pd.DataFrame(meta)
print(
    '15分位数时长', meta['duration'].quantile(0.15), 
    '\n85分位数时长', meta['duration'].quantile(0.85), 
    '\n总时长', round(meta['duration'].sum()/60/60, 5)
     )

print(meta.head(5))
Audio(random.choice(meta.values)[0])

-----
### modelscope ASR识别(预计30min)

直接调用modelscope的ASR模型.

`watch -n1 "cat asr/asr_*/1best_recog/text |wc -l"` 来看当前进度.

In [None]:
cuda_devices = [0,1,2,3] * 2
NJOB = len(cuda_devices) 

# batch取决于你的显卡. 显存多可以适当拉大
# 注意一下, batch变大似乎会使结果有轻微下降.
# 同时,batch大于1的时候,会出现之前提到的 `@` 的问题.
# 但是很随机,可能就几条音频会出现. 如果不想改代码,就1.
BATCH = 1 

In [None]:
!rm -rf {os.path.join(start_path, "tmp")}
os.makedirs(os.path.join(start_path, "tmp"), exist_ok=True)

_script = f'''
import os
import sys
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

start_path = '{start_path}'
inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
    output_dir=os.path.join(start_path, "asr", "asr_"+sys.argv[1]),
    batch={BATCH},
)

print(os.path.join(start_path, 'tmp', "temp_"+sys.argv[1]+".scp"))
inference_pipeline(audio_in=os.path.join(start_path, 'tmp', "temp_"+sys.argv[1]+".scp"))
'''

with open(os.path.join(start_path, "tmp", "temp.py"), "w") as f:
    f.write(_script)
    
data = glob(os.path.join(start_path, "slices", "*.wav"))
slice_len = (len(data) + NJOB - 1) // NJOB
for IJOB in range(NJOB):
    data_dir = data[IJOB*slice_len : (IJOB+1)*slice_len]
    with open(os.path.join(start_path, "tmp", f"temp_{IJOB}.scp"), "w") as f:
        for i in data_dir:
            f.writelines(i.split("/")[-1].split(".")[0] + " "+ i + "\n")

In [None]:
res = pqdm(list(zip(cuda_devices, list(range(NJOB)))), run_multiprocess, n_jobs=NJOB)

In [None]:
datas = glob(os.path.join(start_path,"asr/*/1best_recog/text"))

try:
    with open(start_path+"metas/meta.csv") as meta:
        meta = [json.loads(i) for i in meta.readlines()]
    meta = pd.DataFrame(meta).set_index('audio_filepath')
except:
    meta = pd.read_csv(start_path + "metas/meta.csv").set_index("audio_filepath")
meta['_text'] = None

for data in tqdm(datas):
    with open(data) as f:
        _f = {}
        for i in f.readlines():
            if len(i.strip().split()) > 1:
                _f[start_path +'slices/'+ i.split()[0] + '.wav'] = " ".join(i.split()[1:]).strip()
    for name in _f:
        meta.at[name, '_text'] = _f[name]
        
data_dir = meta[meta['_text'].apply(lambda x: x is None)].reset_index()['audio_filepath'].values.tolist()
if len(data_dir) > 0:        
    print("有句子没搞出来? batch为1的时候应该都能搞出来才对.")
    
meta = meta[meta['_text'].apply(lambda x: x is not None)].reset_index()
meta.to_csv(start_path + "metas/meta.csv", index=False)


### modelscope 标点符号 (预计10min)

In [None]:
!rm -rf {os.path.join(start_path, "tmp")}
os.makedirs(os.path.join(start_path, "tmp"), exist_ok=True)

_script = f'''
import os
import sys
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

start_path = '{start_path}'
inference_pipeline = pipeline(
    task=Tasks.punctuation,
    model='damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch',
    output_dir=os.path.join(start_path, "punc", "punc_"+sys.argv[1])
)

print(os.path.join(start_path, 'tmp', "temp_"+sys.argv[1]+".txt"))
rec_result = inference_pipeline(text_in=os.path.join(start_path, 'tmp', "temp_"+sys.argv[1]+".txt"))    
'''
with open(os.path.join(start_path, "tmp", "temp.py"), "w") as f:
    f.write(_script)
    
meta = pd.read_csv(start_path + "metas/meta.csv")
slice_len = (len(meta) + NJOB - 1) // NJOB
for IJOB in range(NJOB):
    data_dir = meta.values[IJOB*slice_len : (IJOB+1)*slice_len]
    with open(os.path.join(start_path, "tmp", f"temp_{IJOB}.txt"), "w") as f:
        for i in data_dir:
            f.writelines(i[0].split("/")[-1].split(".")[0] + "\t"+ i[2] + "\n")

In [None]:
res = pqdm(list(zip(cuda_devices, list(range(NJOB)))), run_multiprocess, n_jobs=NJOB)

In [None]:
from pypinyin import lazy_pinyin, Style
def get_pinyin(text):
    text = text.lower()
    initials = lazy_pinyin(text, neutral_tone_with_five=False, style=Style.INITIALS, strict=False)
    finals = lazy_pinyin(text, neutral_tone_with_five=False, style=Style.FINALS_TONE3)

    text_phone = []
    for _o in zip(initials, finals):
        if _o[0] != _o[1] and _o[0] != '':
            _o = ['@'+i for i in _o]
            text_phone.extend(_o)
        elif _o[0] != _o[1] and _o[0] == '':
            text_phone.append('@'+_o[1])
        else:
            text_phone.extend(list(_o[0]))

    text_phone = " ".join(text_phone)
    return text_phone

In [None]:
datas = glob(os.path.join(start_path,"punc/*/infer.out"))
meta = pd.read_csv(start_path + "metas/meta.csv").set_index("audio_filepath")

meta['text'] = None
meta['_text_with_punc'] = None

for data in tqdm(datas):
    with open(data) as f:
        f = {start_path +'slices/'+ i.split("\t")[0] + '.wav': i.split("\t")[-1].strip() for i in f.readlines()}
        
    for name in f:
        text = "".join(f[name].split())
        meta.at[name, 'text'] = text
        meta.at[name, '_text_with_punc'] = get_pinyin(text)
            
meta = meta.reset_index()
meta.to_csv(start_path + "metas/meta.csv", index=False)

In [None]:
# 检查是否有连续punc和None, 出现过奇奇怪怪的bug
# 出现了请联系我😂.

def _if_continue_punc(x):
    punc = False
    for i in list(x):
        if i in '、，。？！～…—':
            if punc == True:
                return True
            else:
                punc = True
        else:
            punc = False
    return False
        
meta[meta['text'].apply(lambda x:_if_continue_punc(x) or x is None or x == '')] 

### 转到目标SR
由于在modelscope的运算时,会调用降采样算法到16000,速度很慢,所以一开始先统一到了16000.

为了在目标采样率上训练,需要重新进行第一步的操作.

In [None]:
def run(x):
    name = "/".join(x.split("/")[:-1]).replace("/raw","/wavs/") + x.split("/")[-1].split(".")[0]
    return subprocess.run(f"ffmpeg -hide_banner -loglevel panic -i '{x}' -ac 1 -ar {SR} -f s16le - "+
                   f"| ffmpeg -hide_banner -loglevel panic -f s16le -ar {SR} -i - -f segment -segment_time {sg_len} {name}_%03d.wav",
                   shell=True)

os.makedirs(os.path.join(start_path, "wavs"), exist_ok=True)
wavs_name = glob(os.path.join(start_path, "raw/*.*"))

res = pqdm(wavs_name, run, n_jobs=CPU_kernels)

In [None]:
datas = glob(os.path.join(start_path,"vad/*/1best_recog/*"))
for data in tqdm(datas):
    with open(data) as f:
        f={i[:i.find(" ")]:json.loads(i[i.find(" "):]) for i in f.readlines()}
    for name in f:
        wav, fs = sf.read(start_path + f"wavs/{name}.wav")
        assert fs == SR
        
        if f[name] == []:
            continue

        temp = []
        count = 0
        for idxs in f[name]:
            temp.append([int(idxs[0] / 1000 * fs), int(idxs[1] / 1000 * fs)])
            if (idxs[1] - idxs[0]) / 1000 < min_length:
                continue

            _wavs = [np.concatenate([wav[i[0]:i[1]] for i in temp])]
            if (_wavs[0].shape[0] / fs) > max_length:
                _k = 2 ** np.ceil(np.log2((_wavs[0].shape[0]) / fs / max_length))
                for i in range(int(_k)):
                    _wavs.append(_wavs[0][int(i/_k*_wavs[0].shape[0]) : int((i+1)/_k*_wavs[0].shape[0])])
                _wavs = _wavs[1:]

            for _wav in _wavs:
                sf.write(start_path + f"slices/{name}_{str(count).zfill(4)}.wav", _wav, fs)
                count += 1
            temp = []

-----
## 获取辅助特征

### 获取bert特征 (预计30min)
将bert得到的feat,按照音素的结构进行简单的复制.

比如 我的A(1 2 3): w o3 d e4 a -> 我 我 的 的 a -> 1 1 2 2 3

In [None]:
use_gpt = False # 使用GPT2特征或Bert-based特征,好像大差不差.

In [None]:
!rm -rf {os.path.join(start_path, "tmp")}
os.makedirs(os.path.join(start_path, "tmp"), exist_ok=True)

_script = f'''
import os
import sys
import torch
from tqdm import tqdm
from glob import glob
import json

import numpy as np
import pandas as pd

from pypinyin import lazy_pinyin, Style

if {not use_gpt}:
    from transformers import AutoTokenizer, AutoModelForMaskedLM
    tokenizer = AutoTokenizer.from_pretrained("hfl/chinese-roberta-wwm-ext-large")
    model = AutoModelForMaskedLM.from_pretrained("hfl/chinese-roberta-wwm-ext-large")

if {use_gpt}:
    from transformers import AutoTokenizer, AutoModelForCausalLM
    tokenizer = AutoTokenizer.from_pretrained("uer/gpt2-chinese-cluecorpussmall")
    model = AutoModelForCausalLM.from_pretrained("uer/gpt2-chinese-cluecorpussmall")

device = 'cuda:0'
model = model.to(device)

start_path = '{start_path}'

NJOB={NJOB}
meta = pd.read_csv(start_path + "metas/meta.csv")
slice_len = (len(meta) + NJOB - 1) // NJOB
meta = meta.loc[int(sys.argv[1])*slice_len : (int(sys.argv[1])+1)*slice_len]
'''

In [None]:
_script += f'''
os.makedirs(start_path + 'bert_feats/', exist_ok=True)
for series in tqdm(meta.iloc):
    name = start_path + "bert_feats/" + series['audio_filepath'].split("/")[-1].replace(".wav",".npy")
    text = series['text']
    
    with torch.no_grad():
        inputs = tokenizer(text, return_tensors='pt')
        for i in inputs:
            inputs[i] = inputs[i].to(device)
        res = model(**inputs, output_hidden_states=True)
        res = torch.cat(res['hidden_states'][-3:-2], -1)[0].cpu().numpy() # 有sos和eos token
    
    initials = lazy_pinyin(text, neutral_tone_with_five=False, style=Style.INITIALS, strict=False)
    finals = lazy_pinyin(text, neutral_tone_with_five=False, style=Style.FINALS_TONE3)
    
    _vecs = []
    _text = []
    _chars = []
    for _o in zip(zip(initials, finals), text, res[1:-1]):
        _o, _c, _vec = _o
        if _o[0] != _o[1] and _o[0] != '':
            _text.extend(['@'+i for i in _o])
            _chars.extend([_c]*2)
            _vecs.extend([_vec]*2)
        elif _o[0] != _o[1] and _o[0] == '':
            _text.append('@'+_o[1])
            _chars.append(_c)
            _vecs.append(_vec)
        else:
            _text.extend(list(_o[0]))  
            _chars.extend([_c]*len(_o[0]))
            _vecs.extend([_vec]*len(_o[0]))
    try:
        assert len(_text) == len(_chars)
        assert len(_vecs) == len(_text)
    except:
        print(name)
        continue
        
    _vecs = np.stack([res[0]] + _vecs + [res[-1]])
    np.save(name, _vecs)
'''

In [None]:
with open(os.path.join(start_path, "tmp", "temp.py"), "w") as f:
    f.write(_script)
res = pqdm(list(zip(cuda_devices, list(range(NJOB)))), run_multiprocess, n_jobs=NJOB)

------
## To NEMO format & Ready to go.
将meta转为nemo需要的格式. 可以开始运行啦!

### 打包成需要的数据格式

In [None]:
for_valid = 128 # valid数据长度

In [None]:
def meta_to_nemo(meta, g):
    for series in tqdm(meta):
        lines = {}
        lines['audio_filepath'] = start_path + 'slices/' + series[0].split("/")[-1]
        lines['duration'] = series[1]
        lines['text'] = series[-2]
        
        temp = []
        for j in series[-1].split():
            if True:
                temp.append(j)
        if len(temp) == 0:
            print(i)
            continue
        lines['normalized_text'] = " ".join(temp)
        
        g.writelines(json.dumps(lines, ensure_ascii=False)+'\n')

In [None]:
meta = pd.read_csv(start_path + 'metas/meta.csv')

portion = (meta['_text_with_punc'].apply(lambda x:len(x.split())) / meta['duration'])
meta_processed = meta[(portion >= portion.quantile(0.05)) &(portion <= portion.quantile(0.95))]
meta_processed = meta_processed[(3 < meta_processed['duration']) & (meta_processed['duration'] < 13)]
meta_processed = meta_processed.values

print("总计:", len(meta_processed))

In [None]:
np.random.shuffle(meta_processed)
os.makedirs(start_path + "metas/nemo", exist_ok=True)

with open(start_path + "metas/nemo/train_manifest.json", "w") as g:
    meta_to_nemo(meta_processed[:-for_valid], g)
        
with open(start_path + "metas/nemo/val_manifest.json", "w") as g:
    meta_to_nemo(meta_processed[-for_valid:], g)

### 训练Fastpitch(预计7h)

主要是2步,首先是训练fastpitch,注意要安装一下NeMo的requirement.

10小时数据每epoch约50秒,预计训练500epoch,7小时后可以训练完毕.

In [None]:
os.makedirs(os.path.join(start_path, "codes"), exist_ok=True)
!cd {os.path.join(start_path, "codes")} && git clone -b r1.20.0 https://github.com/NVIDIA/NeMo.git

In [None]:
# 修改一些文件
##################################
# 1. 数据类的主要是增加了一个tokenizer. 方便于测试MoeGoe的国际音标(尽管多语会有帮助, 中文并没有什么实际的作用, 所以实际还是用的pinyin)
# 2. 增加了bert特征, 需要改动模型代码以及load data的代码.
# 3. 由于使用了pyin作为pitch的提取器,首个epoch速度会很慢,改成了pyworld速度能快不少
# 4. 一些基本的调参参数改了一下, 如学习率, batchsize等等
# 5. nemo版的MAS使用的是numba, 默认占满所有的CPU资源, 实际上限制thread到4速度不会受到影响.
# 6. 最后,使用fp16来训. 对于一些很糟糕的数据有可能把他train爆, 爆了换32跑(但其实概率非常低). 
##################################

!cp ./replace_files/tts_tokenizers.py {start_path}/codes/NeMo/nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py
!cp ./replace_files/dataset.py {start_path}/codes/NeMo/nemo/collections/tts/data/dataset.py
!cp ./replace_files/tts_data_types.py {start_path}/codes/NeMo/nemo/collections/tts/torch/tts_data_types.py

!cp ./replace_files/fs_model.py {start_path}/codes/NeMo/nemo/collections/tts/models/fastpitch.py
!cp ./replace_files/fs_modules.py {start_path}/codes/NeMo/nemo/collections/tts/modules/fastpitch.py
!cp ./replace_files/bert_tsfm.py {start_path}/codes/NeMo/nemo/collections/tts/modules/transformer.py

!cp ./replace_files/fastpitch_align_v1.05.yaml {start_path}/codes/NeMo/examples/tts/conf/fastpitch_align_v1.05.yaml
!cp ./replace_files/fs_run.py {start_path}/codes/NeMo/examples/tts/fastpitch.py

# !cd {start_path}/codes/NeMo && ./reinstall.sh
!cd {start_path}/codes/NeMo && cat ./requirements/requirements_tts.txt ./requirements/requirements_asr.txt \
./requirements/requirements_common.txt ./requirements/requirements_lightning.txt ./requirements/requirements.txt > ./requirements.txt \
&& pip install -r requirements.txt

!conda install -c conda-forge pyworld -y

In [None]:
# 提取一下所需要的pitch,随后用于计算pitch的值
# echo得到的值运行一下, 如果在colab上的话可以删除echo直接运行.

!echo PYTHONPATH={start_path}/codes/NeMo CUDA_VISIBLE_DEVICES=0 python {start_path}/codes/NeMo/examples/tts/fastpitch.py \
    train_dataset={start_path}/metas/nemo/train_manifest.json \
    validation_datasets={start_path}/metas/nemo/val_manifest.json \
    sup_data_path={start_path}/sup_data \
    exp_manager.exp_dir={start_path}/tmp \
    bert_path={start_path}/bert_feats \
    trainer.strategy=null name=testing pitch_mean=130.01991 pitch_std=50.18665 trainer.max_epochs=1

In [None]:
# 提取pitch的均值方差. 首先通过nemo自己生成sup文件,stop后重新计算.

temp = []
for i in tqdm(glob(start_path + "sup_data/pitch/*")):
    _pitch = torch.load(i)
    _pitch = _pitch[_pitch!=0].numpy()
    temp.append(_pitch)
    
_pitch = np.concatenate(temp)
pmax, pmin = _pitch.max(), _pitch.min()
pstd, pmean = _pitch.std(), _pitch.mean()

print(pmax, pmin)
print(pstd, pmean)

In [None]:
# 开始训练!

!echo PYTHONPATH={start_path}/codes/NeMo CUDA_VISIBLE_DEVICES=0 python {start_path}/codes/NeMo/examples/tts/fastpitch.py \
    train_dataset={start_path}/metas/nemo/train_manifest.json \
    validation_datasets={start_path}/metas/nemo/val_manifest.json \
    sup_data_path={start_path}/sup_data \
    exp_manager.exp_dir={start_path}/results \
    bert_path={start_path}/bert_feats \
    trainer.strategy=null name=fs2 pitch_mean={pmean} pitch_std={pstd} pitch_fmin={pmin} pitch_fmax={pmax} 
#     +init_from_pretrained_model='tts_zh_fastpitch_sfspeech'

### 训练HIFIGAN(预计?)

该步时间较长,一般train的时间越长越不fuzzy.

等到出了ckpt就可以定期在下面的infer那里check一下.

hifigan的训练是比较慢的, 因为没有FP16(FP16会有一定的下降)
<br>一个epoch约3分钟,通常要100个epoch左右即5个小时.

In [None]:
# 提取训练fs2的mel的结果!

!echo PYTHONPATH={start_path}/codes/NeMo CUDA_VISIBLE_DEVICES=0 python {start_path}/codes/NeMo/examples/tts/fastpitch.py \
    train_dataset={start_path}/metas/nemo/train_manifest.json \
    validation_datasets={start_path}/metas/nemo/val_manifest.json \
    sup_data_path={start_path}/sup_data \
    exp_manager.exp_dir={start_path}/results \
    bert_path={start_path}/bert_feats \
    name=fs2 pitch_mean={pmean} pitch_std={pstd} pitch_fmin={pmin} pitch_fmax={pmax} \
    model.train_ds.dataloader_params.batch_size=1 trainer.precision=32 \
    model.get_mel_result={start_path}/sup_data/pred_mels \
    trainer.strategy=null trainer.max_epochs=1000000

In [None]:
# 生成meta文件

with open(f'{start_path}/metas/nemo/train_manifest_mel.json', "w") as g:
    for s in tqdm(glob(f'{start_path}/sup_data/pred_mels/*.wav')):
        _wav, _fs = sf.read(s)
        _dict = {
            'audio_filepath': s,
            'duration': round(_wav.shape[0] / _fs, 3),
            'mel_filepath': s.replace(".wav", ".npy"),
        }
        g.writelines(json.dumps(_dict, ensure_ascii=False)+'\n')
        
!head -n32 {start_path}/metas/nemo/train_manifest_mel.json > {start_path}/metas/nemo/val_manifest_mel.json

In [None]:
# finetune HIFIGAN.
# 这一步会下载nemo的ckpt, 可能会很慢>_<

!echo PYTHONPATH={start_path}/codes/NeMo CUDA_VISIBLE_DEVICES=0 python {start_path}/codes/NeMo/examples/tts/hifigan_finetune.py \
    train_dataset={start_path}/metas/nemo/train_manifest_mel.json \
    validation_datasets={start_path}/metas/nemo/val_manifest_mel.json \
    exp_manager.exp_dir={start_path}/results \
    model/train_ds=train_ds_finetune model/validation_ds=val_ds_finetune \
    trainer.strategy=null name=hifigan \
    trainer.check_val_every_n_epoch=1 \
    +init_from_pretrained_model='tts_zh_hifigan_sfspeech' \
    --config-name hifigan.yaml 

----------
## Infer with NeMo

如果上述任务均训练完毕, 可以开始进行infer.

按顺序执行代码即可,修改raw_text为你需要的文本.

In [None]:
import os
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION']='python'

In [None]:
import logging
import sys
logging.disable(logging.ERROR)
sys.path.append(start_path + "codes/NeMo")
from nemo.collections.tts.models import HifiGanModel, FastPitchModel
from transformers import AutoTokenizer, AutoModelForMaskedLM

device='cuda:0'
tokenizer = AutoTokenizer.from_pretrained("hfl/chinese-roberta-wwm-ext-large")
bert_model = AutoModelForMaskedLM.from_pretrained("hfl/chinese-roberta-wwm-ext-large")
bert_model = bert_model.to(device).eval()

In [None]:
hfg_path

In [None]:
# load hifigan
hfg_path = glob(start_path + "results/hifigan/*/checkpoints/*last*")
if len(hfg_path) > 0:
    vocoder_model_pt = HifiGanModel.load_from_checkpoint(checkpoint_path=hfg_path[0]).eval().to(device)
else:
    # 如果没train完可以听个动静
    print("使用nemo给的voocder,想达到更好效果需要tune!")
    vocoder_model_pt = HifiGanModel.from_pretrained(model_name="tts_zh_hifigan_sfspeech").to(device) 

In [None]:
# load fastpitch
fastpitch_model_path =glob(start_path + "results/fs2/checkpoints/*last*")[0]
spec_gen_model = FastPitchModel.load_from_checkpoint(checkpoint_path=fastpitch_model_path).eval().to(device)

In [None]:
# 在这里修改.
raw_text = "大家好我是二次元峰哥"

In [None]:
punc_e2c = {',':'，', '.':'。','?':'？',"、":"，"}
raw_text="".join([punc_e2c[i] if i in punc_e2c else i for i in raw_text])
text = raw_text.replace(" ","").lower()

with torch.no_grad():
    inputs = tokenizer("".join(text), return_tensors='pt')
    for i in inputs:
        inputs[i] = inputs[i].to(device)
    res = bert_model(**inputs, output_hidden_states=True)
    res = torch.cat(res['hidden_states'][-3:-2], -1)[0].cpu().numpy()

initials = lazy_pinyin(raw_text, neutral_tone_with_five=False, style=Style.INITIALS, strict=False)
finals = lazy_pinyin(raw_text, neutral_tone_with_five=False, style=Style.FINALS_TONE3)

_vecs = []
_text = []
for _o in zip(zip(initials, finals), list(raw_text), res[1:-1]):
    _o, _c, _vec = _o
    if _o[0] != _o[1] and _o[0] != '':
        _text.extend(['@'+i for i in _o])
        _vecs.extend([_vec]*2)
    elif _o[0] != _o[1] and _o[0] == '':
        _text.append('@'+_o[1])
        _vecs.append(_vec)
    else:
        _text.extend(list(_o[0]))
        _vecs.extend([_vec]*len(_o[0]))

_vecs = np.stack([res[0]] + _vecs + [res[-1]])
bert = torch.tensor(_vecs.T[None]).to(device)
phoneme = " ".join(_text)

In [None]:
with torch.no_grad():
    parsed = spec_gen_model.parse(str_input=phoneme, normalize=False)
    res = spec_gen_model.generate_spectrogram(
        tokens=parsed, pace=1,
        bert_feats=bert,
    )
    spectrogram = res
    audio = vocoder_model_pt.convert_spectrogram_to_audio(spec=spectrogram)

spectrogram = spectrogram.to('cpu').numpy()[0]
audio = audio.to('cpu').numpy()
# audio = audio / np.abs(audio).max()

In [None]:
iply.display(iply.Audio(audio[0], rate=22050))

## 总结

1. 对于TTS任务而言, 通常10个小时就足以得到好的效果. 如果有一个好的底模,训练速度能有更大的提升.
2. 对于大多数情况, 不需要训练VITS, 对于视频制作,直播而言, fs2就够了. 
3. 更多的数据意味更好得效果, 不论是chatgpt还是stable diff, 都告诉我们数据层面的军备竞赛远没有结束. 在作者自己的实验上来看,数据从10h到50h,效果是一直在提高的.
3. 更好的模型意味更大的风险, 比如用chatgpt诈骗, unstable diff生涩图都是技术的反面. 避免技术的反面将一直是重要的挑战. 在制作视频期间,网信办也起草了新的管理办法,相信在法律的规范下,技术也能逐渐更好的服务每一个人.

最后,制作视频和notebook都花费了不少时间和精力,希望大家多多关注, 多多三连.
