# Train your first 🐸 TTS model 💫

### 👋 Hello and welcome to Coqui (🐸) TTS

The goal of this notebook is to show you a **typical workflow** for **training** and **testing** a TTS model with 🐸.

Let's train a very small model on a very small amount of data so we can iterate quickly.

In this notebook, we will:

1. Download data and format it for 🐸 TTS.
2. Configure the training and testing runs.
3. Train a new model.
4. Test the model and display its performance.

So, let's jump right in!


In [1]:
## Install Coqui TTS
!git clone https://github.com/coqui-ai/TTS

Cloning into 'TTS'...
remote: Enumerating objects: 28948, done.[K
remote: Counting objects: 100% (162/162), done.[K
remote: Compressing objects: 100% (127/127), done.[K
remote: Total 28948 (delta 45), reused 128 (delta 29), pack-reused 28786[K
Receiving objects: 100% (28948/28948), 129.69 MiB | 9.32 MiB/s, done.
Resolving deltas: 100% (21044/21044), done.


In [1]:
%cd TTS
!make system-deps  # only on Linux systems.
!make install

/content/TTS
sudo apt-get install -y libsndfile1-dev
Reading package lists... Done
Building dependency tree       
Reading state information... Done
libsndfile1-dev is already the newest version (1.0.28-7ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.
pip install -e .[all]
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Obtaining file:///content/TTS
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting numba==0.55.1
  Using cached numba-0.55.1-1-cp38-cp38-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.4 MB)
Installing collected packages: numba, TTS
  Attempting uninstall: numba
    Found existing installation: numba 0.55.2
    Uninstalling numba-0.55.2:
      Successfully uninstalled numba-0.55.2
  Attemptin

## ✅ Data Preparation

### **First things first**: we need some data.

We're training a Text-to-Speech model, so we need some _text_ and we need some _speech_. Specificially, we want _transcribed speech_. The speech must be divided into audio clips and each clip needs transcription. More details about data requirements such as recording characteristics, background noise abd vocabulary coverage can be found in the [🐸TTS documentation](https://tts.readthedocs.io/en/latest/formatting_your_dataset.html).

If you have a single audio file and you need to **split** it into clips. It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using **wav** file format.

The data format we will be adopting for this tutorial is taken from the widely-used  **LJSpeech** dataset, where **waves** are collected under a folder:

<span style="color:purple;font-size:15px">
/wavs<br /> 
 &emsp;| - audio1.wav<br /> 
 &emsp;| - audio2.wav<br /> 
 &emsp;| - audio3.wav<br /> 
  ...<br /> 
</span>

and a **metadata.csv** file will have the audio file name in parallel to the transcript, delimited by `|`: 
 
<span style="color:purple;font-size:15px">
# metadata.csv <br /> 
audio1|This is my sentence. <br /> 
audio2|This is maybe my sentence. <br /> 
audio3|This is certainly my sentence. <br /> 
audio4|Let this be your sentence. <br /> 
...
</span>

In the end, we should have the following **folder structure**:

<span style="color:purple;font-size:15px">
/MyTTSDataset <br /> 
&emsp;| <br /> 
&emsp;| -> metadata.csv<br /> 
&emsp;| -> /wavs<br /> 
&emsp;&emsp;| -> audio1.wav<br /> 
&emsp;&emsp;| -> audio2.wav<br /> 
&emsp;&emsp;| ...<br /> 
</span>

🐸TTS already provides tooling for the _LJSpeech_. if you use the same format, you can start training your models right away. <br /> 

After you collect and format your dataset, you need to check two things. Whether you need a **_formatter_** and a **_text_cleaner_**. <br /> The **_formatter_** loads the text file (created above) as a list and the **_text_cleaner_** performs a sequence of text normalization operations that converts the raw text into the spoken representation (e.g. converting numbers to text, acronyms, and symbols to the spoken format).

If you use a different dataset format then the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own **_formatter_** and  **_text_cleaner_**.

## ⏳️ Loading your dataset
Load one of the dataset supported by 🐸TTS.

We will start by defining dataset config and setting LJSpeech as our target dataset and define its path.


만약 밑에 하다가 character not found 뭐시기 뜨면 이거 실행시켜서 복붙 ㄱㄱ

In [2]:
# pylint: disable=bad-option-value
import argparse
from argparse import RawTextHelpFormatter

from TTS.config import load_config
from TTS.tts.datasets import load_tts_samples

parser = argparse.ArgumentParser(
    description="""Find all the unique characters or phonemes in a dataset.\n\n"""
    """
Example runs:

python TTS/bin/find_unique_chars.py --config_path config.json
    """,
    formatter_class=RawTextHelpFormatter,
)
# parser.add_argument("--config_path", type=str, help="Path to dataset config file.", required=True)
# args = parser.parse_args()

c = load_config('path/to/config.json')

# load all datasets
train_items, eval_items = load_tts_samples(
    c.datasets, eval_split=True, eval_split_max_size=c.eval_split_max_size, eval_split_size=c.eval_split_size
)

items = train_items + eval_items

texts = "".join(item["text"] for item in items)
chars = set(texts)
lower_chars = filter(lambda c: c.islower(), chars)
chars_force_lower = [c.lower() for c in chars]
chars_force_lower = set(chars_force_lower)

print(f" > Number of unique characters: {len(chars)}")
print(f" > Unique characters: {''.join(sorted(chars))}")
print(f" > Unique lower characters: {''.join(sorted(lower_chars))}")
print(f" > Unique all forced to lower characters: {''.join(sorted(chars_force_lower))}")


 | > Found 2495 files in /content/TTS/tts_train_dir/test
 > Number of unique characters: 808
 > Unique characters: 
 !-.0123456789?CDEFJNPRWXeghinortw~가각간갇갈감갑갓갔강갖같개거걱건걷걸검겁것게겜겠겨격결겸겼경계고곡곤곳공과관광괜괭굉교구국군굳굴굿궁권귀그극근글금급기긴길김깃깄깊까깎깐깔깜깝깨깽꺼껍껏껴꼈꼬꼭꽂꽉꽤꾸꿀꿔꿨뀌뀔끄끈끊끌끔끝끼낀낌나낚난날남낫났낮내냅냉냐냥너넌널넓넘넣네넥넨녀년녕노녹놀놈농높놓놔놨누눈눌눴느는늘능늦늪니닌닐님다닥단닫달닭닮담답당닿대댓댔더덕던덜덮데덴델도독돈돌동돼됐되된될됩두둑둘둠둥둬뒀뒤드든들듯등디딨딩따딱딴땀땅때땐떡떨떴떻떼또똑똘똥뜨뜬뜯뜻띄라란랄람랍랐랑랗래랬량러럭런럴럼럽렀렇레렉려력련렬렵렸령로록론롤롭료루룩류률륭르른를름릇릉리린릴림립릿링마막만많말맙맛망맞매맨머먹먼멀멈멋멍메멘멜며면멸명몇모목몬몰몸몹못몽무문묻물뭇뭐뭔뭘미민믿밀밌밍밑바박밖반받발밝밤밥방배밴밸버번벌법벗베벤벨벼벽변별볍보복본볼봅봐봤부분불붓붕붙브블비빈빌빛빠빡빨빵빼뺏뻐뻔뻘뻤뼈뽀뽑뿅뿌뿐쁘쁜삐삔사산살삼삽상새색샌생서석선설섬섭성세센셔션셨소속손솔송쇳쇼수숙술숫숲숴쉐쉬쉽슈슉스슨슬습슷시식신실싫심싶싸싹싼쌀쌉쌍써썼쎄쒯쓰쓴쓸씌씨씩아안않알암앗았앞애앤앨앰야약양얘얜어억언얻얼엄업없었에엔여역연열였영옆예옛오옥온올옮옵옷옹와완왔왕왜외왼요욕용우운울움웃워원월웠웬웰위유육율으은을음의이익인일읽임입잇있잉잊자작잔잖잘잠잡잤장잦재잼쟤쟨저적전절점접정제젠져졌조족존좀좁종좋좌죄죠주죽준줄중줘줬즈즉즘증지직진질짐집짓징짜짝쨌쩌쩐쩔쪽쫄쫙쭉쯤찌찍차착찮참창찾채책챙처천철첫청체첸쳐쳤초총최쵸추축충춰츄츠치칙친침칫카칸칼캐캔캡캤컨컴컵케켈켜켰코콘콜쾌쿠큐크큰클큼키킬킹타탄탈탐탑탕태택탭터턴털테텐텔템토톤톰통투튜트특튼틀티틴틸파판팔패팩퍼펑페펙펜편평폐포폭폰폼표풀품풍퓨프픈플픔피필핑하학한할함합핫항해햇했행향허험헷혀현협혔형호혹혼홀홈홉화확환활황횃회획효후훌훑휠휴흐흑흘흠흡희히힌힐힘
 > Unique lower characters: eghinortw
 > Unique all forced to lower characte

In [2]:
import os

# BaseDatasetConfig: defines name, formatter and path of the dataset.
from TTS.tts.configs.shared_configs import BaseDatasetConfig

output_path = "tts_train_dir"
if not os.path.exists(output_path):
    os.makedirs(output_path)
    

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd tts_train_dir
if not os.path.exists("test"):
    os.makedirs("test")
%cd test

/content/TTS/tts_train_dir
/content/TTS/tts_train_dir/test


In [6]:
!sudo apt-get install p7zip-full
!7z x /content/drive/MyDrive/dataset.zip

Reading package lists... Done
Building dependency tree       
Reading state information... Done
p7zip-full is already the newest version (16.02+dfsg-7build1).
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan /content/drive/MyDrive/                                 1 file, 562400110 bytes (537 MiB)

Extracting archive: /content/drive/MyDrive/dataset.zip
--
Path = /content/drive/MyDrive/dataset.zip
Type = zip
Physical Size = 562400110

  0%      1% 66         2% 142 - wavs/1_0140.wav                            4% 200 - wavs/1_0198.wav                          

In [5]:
%cd ..

/content/TTS/tts_train_dir


In [4]:
dataset_config = BaseDatasetConfig(
    formatter="ljspeech", meta_file_train="/content/TTS/tts_train_dir/test/metadata.txt", path=os.path.join(output_path, "test")
)

## ✅ Train a new model

Let's kick off a training run 🚀🚀🚀.

Deciding on the model architecture you'd want to use is based on your needs and available resources. Each model architecture has it's pros and cons that define the run-time efficiency and the voice quality.
We have many recipes under `TTS/recipes/` that provide a good starting point. For this tutorial, we will be using `GlowTTS`.

We will begin by initializing the model training configuration.

In [6]:
from TTS.tts.configs.shared_configs import CharactersConfig
characters_config = CharactersConfig(
    pad = '<PAD>',
    eos = '।', #'<EOS>', #'।',
    bos = '<BOS>',# None,
    blank = '<BLNK>',
    phonemes = None,
    # 여기에 위에서 나온 텍스트 복붙
    characters =  "0123456789CDEFJNPRWXceghinoprtw~가각간갇갈감갑갓갔강갖같개거걱건걷걸검겁것게겜겠겨격결겸겼경계고곡곤곳공과관광괜괭굉교구국군굳굴굿궁권귀그극근글금급기긴길김깃깄깊까깎깐깔깜깝깨깽꺼껍껏껴꼈꼬꼭꽂꽉꽤꾸꿀꿔꿨뀌뀔끄끈끊끌끔끝끼낀낌나낚난날남낫났낮내냅냉냐냥너넌널넓넘넣네넥넨녀년녕노녹놀놈농높놓놔놨누눈눌눴느는늘능늦늪니닌닐님다닥단닫달닭닮담답당닿대댓댔더덕던덜덮데덴델도독돈돌동돼됐되된될됩두둑둘둠둥둬뒀뒤드든들듯등디딨딩따딱딴땀땅때땐떡떨떴떻떼또똑똘똥뜨뜬뜯뜻띄라란랄람랍랐랑랗래랬량러럭런럴럼럽렀렇레렉려력련렬렵렸령로록론롤롭료루룩류률륭르른를름릇릉리린릴림립릿링마막만많말맙맛망맞매맨머먹먼멀멈멋멍메멘멜며면멸명몇모목몬몰몸몹못몽무문묻물뭇뭐뭔뭘미민믿밀밌밍밑바박밖반받발밝밤밥방배밴밸버번벌법벗베벤벨벼벽변별볍보복본볼봅봐봤부분불붓붕붙브블비빈빌빛빠빡빨빵빼뺏뻐뻔뻘뻤뼈뽀뽑뿅뿌뿐쁘쁜삐삔사산살삼삽상새색샌생서석선설섬섭성세센셔션셨소속손솔송쇳쇼수숙술숫숲숴쉐쉬쉽슈슉스슨슬습슷시식신실싫심싶싸싹싼쌀쌉쌍써썼쎄쒯쓰쓴쓸씌씨씩아안않알암앗았앞애앤앨앰야약양얘얜어억언얻얼엄업없었에엔여역연열였영옆예옛오옥온올옮옵옷옹와완왔왕왜외왼요욕용우운울움웃워원월웠웬웰위유육율으은을음의이익인일읽임입잇있잉잊자작잔잖잘잠잡잤장잦재잼쟤쟨저적전절점접정제젠져졌조족존좀좁종좋좌죄죠주죽준줄중줘줬즈즉즘증지직진질짐집짓징짜짝쨌쩌쩐쩔쪽쫄쫙쭉쯤찌찍차착찮참창찾채책챙처천철첫청체첸쳐쳤초총최쵸추축충춰츄츠치칙친침칫카칸칼캐캔캡캤컨컴컵케켈켜켰코콘콜쾌쿠큐크큰클큼키킬킹타탄탈탐탑탕태택탭터턴털테텐텔템토톤톰통투튜트특튼틀티틴틸파판팔패팩퍼펑페펙펜편평폐포폭폰폼표풀품풍퓨프픈플픔피필핑하학한할함합핫항해햇했행향허험헷혀현협혔형호혹혼홀홈홉화확환활황횃회획효후훌훑휠휴흐흑흘흠흡희히힌힐힘",
    punctuations = "!'(),-.:;? ",
)

In [7]:
# GlowTTSConfig: all model related values for training, validating and testing.
from TTS.tts.configs.glow_tts_config import GlowTTSConfig
config = GlowTTSConfig(
    batch_size=32,
    eval_batch_size=16,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=100,
    text_cleaner="basic_cleaners",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    phoneme_language=None,
    print_step=25,
    print_eval=False,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    save_step=1000,
    characters=characters_config
)

numba 버전 오류 해결

In [19]:
!pip install numpy==1.21.6
!pip uninstall numba
!pip install numba==0.55.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Found existing installation: numba 0.55.1
Uninstalling numba-0.55.1:
  Would remove:
    /usr/local/bin/numba
    /usr/local/bin/pycc
    /usr/local/lib/python3.8/dist-packages/numba-0.55.1.dist-info/*
    /usr/local/lib/python3.8/dist-packages/numba/*
Proceed (Y/n)? y
  Successfully uninstalled numba-0.55.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting numba==0.55.2
  Using cached numba-0.55.2-cp38-cp38-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.4 MB)
Installing collected packages: numba
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tts 0.11.1 requires numba==0.55.1; python_version < "3.10", but you have numba 0.55.2 which is incompatible.[0m[31m
[0mSuccessfully installed numba-0.55.

Next we will initialize the audio processor which is used for feature extraction and audio I/O.

In [8]:
from TTS.utils.audio import AudioProcessor
ap = AudioProcessor.init_from_config(config)
# 데이터에 맞춰 수정할것: 보통 22050 해두면 얼추 맞음음
ap.sample_rate = 16000
ap.resample = True

 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:45
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024


Next we will initialize the tokenizer which is used to convert text to sequences of token IDs.  If characters are not defined in the config, default characters are passed to the config.

In [9]:
from TTS.tts.utils.text.tokenizer import TTSTokenizer
tokenizer, config = TTSTokenizer.init_from_config(config)

Next we will load data samples. Each sample is a list of ```[text, audio_file_path, speaker_name]```. You can define your custom sample loader returning the list of samples.

In [10]:
from TTS.tts.datasets import load_tts_samples
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

 | > Found 2495 files in /content/TTS/tts_train_dir/tts_train_dir/test


Now we're ready to initialize the model.

Models take a config object and a speaker manager as input. Config defines the details of the model like the number of layers, the size of the embedding, etc. Speaker manager is used by multi-speaker models.

In [11]:
from TTS.tts.models.glow_tts import GlowTTS
model = GlowTTS(config, ap, tokenizer, speaker_manager=None)

Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training, distributed training, etc.

In [12]:
from trainer import Trainer, TrainerArgs
trainer = Trainer(
    TrainerArgs(), config, output_path, model=model, train_samples=train_samples, eval_samples=eval_samples
)

 > Training Environment:
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 2
 | > Num. of Torch Threads: 1
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 > Start Tensorboard: tensorboard --logdir=tts_train_dir/run-March-01-2023_04+36AM-16b98622

 > Model has 28742353 parameters


In [25]:
!pip install python-mecab-ko

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [28]:
%cd ..

/content/TTS


### AND... 3,2,1... START TRAINING 🚀🚀🚀

#### 🚀 Run the Tensorboard. 🚀
On the notebook and Tensorboard, you can monitor the progress of your model. Also Tensorboard provides certain figures and sample outputs.

In [None]:
!pip install tensorboard
!tensorboard --logdir=tts_train_dir

In [None]:
trainer.fit()


[4m[1m > EPOCH: 0/100[0m
 --> tts_train_dir/run-March-01-2023_04+30AM-0000000

[1m > TRAINING (2023-03-01 04:31:34) [0m




> DataLoader initialization
| > Tokenizer:
	| > add_blank: False
	| > use_eos_bos: False
	| > use_phonemes: False
| > Number of instances : 2471
 | > Preprocessing samples
 | > Max text length: 54
 | > Min text length: 2
 | > Avg text length: 13.5475515985431
 | 
 | > Max audio length: 64029.0
 | > Min audio length: 64029.0
 | > Avg audio length: 64029.0
 | > Num. instances discarded samples: 0
 | > Batch group size: 0.
1이 진짜 최초의 3d라며 아바타가
 [!] Character 'd' not found in the vocabulary. Discarding it.



[1m   --> STEP: 0/78 -- GLOBAL_STEP: 0[0m
     | > current_lr: 0.00000 
     | > step_time: 5.02240  (5.02241)
     | > loader_time: 2.21260  (2.21255)



아바타 4dx 아니라며
 [!] Character 'd' not found in the vocabulary. Discarding it.
아바타 4dx 아니라며
 [!] Character 'x' not found in the vocabulary. Discarding it.
ej라고 하나?
 [!] Character 'j' not found in the vocabulary. Discarding it.
4d 진짜 재밌는데
 [!] Character 'd' not found in the vocabulary. Discarding it.



[1m   --> STEP: 25/78 -- GLOBAL_STEP: 25[0m
     | > loss: 4.53957  (4.89454)
     | > log_mle: 0.56062  (0.56070)
     | > loss_dur: 3.97895  (4.33384)
     | > amp_scaler: 16384.00000  (16384.00000)
     | > grad_norm: 10.72633  (10.42657)
     | > current_lr: 0.00000 
     | > step_time: 0.44550  (0.52304)
     | > loader_time: 0.00370  (0.00707)



4d를 그렇게 보고싶어
 [!] Character 'd' not found in the vocabulary. Discarding it.



[1m   --> STEP: 50/78 -- GLOBAL_STEP: 50[0m
     | > loss: 4.31247  (4.70554)
     | > log_mle: 0.56038  (0.56061)
     | > loss_dur: 3.75209  (4.14493)
     | > amp_scaler: 16384.00000  (16384.00000)
     | > grad_norm: 10.34927  (10.62666)
     | > current_lr: 0.00000 
     | > step_time: 0.47860  (0.55596)
     | > loader_time: 0.00620  (0.00662)



f3 뭐였지?
 [!] Character 'f' not found in the vocabulary. Discarding it.



[1m   --> STEP: 75/78 -- GLOBAL_STEP: 75[0m
     | > loss: 5.24223  (4.72302)
     | > log_mle: 0.56034  (0.56054)
     | > loss_dur: 4.68190  (4.16248)
     | > amp_scaler: 16384.00000  (16384.00000)
     | > grad_norm: 11.62535  (10.74053)
     | > current_lr: 0.00000 
     | > step_time: 0.35080  (0.50234)
     | > loader_time: 0.00250  (0.00597)


[1m > EVALUATION [0m





> DataLoader initialization
| > Tokenizer:
	| > add_blank: False
	| > use_eos_bos: False
	| > use_phonemes: False
| > Number of instances : 24
 | > Preprocessing samples
 | > Max text length: 22
 | > Min text length: 3
 | > Avg text length: 13.541666666666666
 | 
 | > Max audio length: 64029.0
 | > Min audio length: 64029.0
 | > Avg audio length: 64029.0
 | > Num. instances discarded samples: 0
 | > Batch group size: 0.
 | > Synthesizing test sentences.
it took me quite a long time to develop a voice, and now that i have it i'm not going to be silent.
 [!] Character 'k' not found in the vocabulary. Discarding it.
it took me quite a long time to develop a voice, and now that i have it i'm not going to be silent.
 [!] Character 'm' not found in the vocabulary. Discarding it.
it took me quite a long time to develop a voice, and now that i have it i'm not going to be silent.
 [!] Character 'q' not found in the vocabulary. Discarding it.
it took me quite a long time to develop a voice, an


  [1m--> EVAL PERFORMANCE[0m
     | > avg_loader_time: 0.00115 [0m(+0.00000)
     | > avg_loss: 4.79026 [0m(+0.00000)
     | > avg_log_mle: 0.56067 [0m(+0.00000)
     | > avg_loss_dur: 4.22959 [0m(+0.00000)

 > BEST MODEL : tts_train_dir/run-March-01-2023_04+30AM-0000000/best_model_78.pth

[4m[1m > EPOCH: 1/100[0m
 --> tts_train_dir/run-March-01-2023_04+30AM-0000000

[1m > TRAINING (2023-03-01 04:32:31) [0m




> DataLoader initialization
| > Tokenizer:
	| > add_blank: False
	| > use_eos_bos: False
	| > use_phonemes: False
	| > 12 not found characters:
	| > k
	| > m
	| > q
	| > u
	| > a
	| > l
	| > d
	| > v
	| > b
	| > s
	| > y
	| > f
| > Number of instances : 2471
 | > Preprocessing samples
 | > Max text length: 54
 | > Min text length: 2
 | > Avg text length: 13.5475515985431
 | 
 | > Max audio length: 64029.0
 | > Min audio length: 64029.0
 | > Avg audio length: 64029.0
 | > Num. instances discarded samples: 0
 | > Batch group size: 0.
아바타 4dx 아니라며
 [!] Character 'x' not found in the vocabulary. Discarding it.
ej라고 하나?
 [!] Character 'j' not found in the vocabulary. Discarding it.


 ! Run is kept in tts_train_dir/run-March-01-2023_04+30AM-0000000


## ✅ Test the model

We made it! 🙌

Let's kick off the testing run, which displays performance metrics.

We're committing the cardinal sin of ML 😈 (aka - testing on our training data) so you don't want to deploy this model into production. In this notebook we're focusing on the workflow itself, so it's forgivable 😇

You can see from the test output that our tiny model has overfit to the data, and basically memorized this one sentence.

When you start training your own models, make sure your testing data doesn't include your training data 😅

Let's get the latest saved checkpoint. 

In [7]:
import glob, os
output_path = "tts_train_dir"
ckpts = sorted([f for f in glob.glob(output_path+"/*/*.pth")])
configs = sorted([f for f in glob.glob(output_path+"/*/*.json")])

In [4]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [14]:
!pip install TTS

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting TTS
  Using cached TTS-0.11.1-cp38-cp38-manylinux1_x86_64.whl (604 kB)
Installing collected packages: TTS
Successfully installed TTS-0.11.1


In [24]:
 !tts --text "여기에 텍스트 입력" \
      --model_path "path/to/checkpoint.pth" \
      --config_path 'path/to/config.json' \
      --out_path out.wav

 > Using model: glow_tts
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:45
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Text: 에메랄드
 > Text splitted to sentences.
['에메랄드']
 > Processing time: 0.620692253112793
 > Real-time factor: 0.5565331888881379
 > Saving output to out.wav


## 📣 Listen to the synthesized wave 📣

In [25]:
import IPython
IPython.display.Audio("out.wav")

## 🎉 Congratulations! 🎉 You now have trained your first TTS model! 
Follow up with the next tutorials to learn more advanced material.