DeepForcedAligner

Paths modified to train on ICASSP LIMMITS23 dataset. It is trained on each speaker separetly, by changing "metadata_path" to 6 _ combinations, eg: Hindi_M, Telugu_F, etc. You may train on all 6 speakers, with a multilingual setup also. For any questions, contact challenge.syspin@iisc.ac.in

With this tool you can create accurate text-audio alignments given a bunch of audio files and their transcription. The alignments can for example be used to train text-to-speech models such as FastSpeech. In comparison to other forced alignment tools this repo has following advantages:

Multilingual: By design, the DFA is language-agnostic and can align both characters or phonemes.
Robustness: The alignment extraction is highly tolerant against text errors and silent characters.
Convenience: Easy installation with no extra dependencies. You can provide your own data in the standard LJSpeech format without special preprocessing (such as applying phonetic dictionaries, non-speech annotations etc.).

The approach is based on training a simple speech recognition model with CTC loss on mel spectrograms extracted from the wav files.

Installation

Running on Python >=3.6

pip install -r requirements.txt

Example Training and Extraction

Check out the following demo notebook for training and character duration extraction on the LJSpeech dataset:

(1) Download the LJSpeech dataset, set paths in config.yaml:

  dataset_dir: LJSpeech
  metadata_path: LJSpeech/metadata.csv

(2) Preprocess the data and train aligner:

  python preprocess.py
  python train.py

(3) Extract durations with latest model checkpoint (60k steps should be sufficient):

  python extract_durations.py

By default durations are put as numpy files into:

  output/durations

Each character duration correspons to one mel time step, which translates to hop_length / sample_rate seconds in the wav file.

Tensorboard

You can monitor the training with

  tensorboard dfa_checkpoints

Using Your Own Dataset

Just bring your dataset to the LJSpeech format. We recommend to clean and preprocess the text in the metafile.csv before running the DFA, e.g. lower-case, phonemization etc.

Using Preprocessed Mel Spectrograms

You can provide your own mel spectrogams by setting in the config.yaml:

  precomputed_mels: /path/to/mels

Make sure that the mel names match the ids in the metafile, e.g.

  00001.mel ---> 00001|First sample text

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
dfa		dfa
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
extract_durations.py		extract_durations.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
scratch_pred.py		scratch_pred.py
train.py		train.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dfa

dfa

notebooks

notebooks

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

config.yaml

config.yaml

extract_durations.py

extract_durations.py

preprocess.py

preprocess.py

requirements.txt

requirements.txt

scratch_pred.py

scratch_pred.py

train.py

train.py

trainer.py

trainer.py

Repository files navigation

DeepForcedAligner

Installation

Example Training and Extraction

Tensorboard

Using Your Own Dataset

Using Preprocessed Mel Spectrograms

About

Releases

Packages

Languages

License

bloodraven66/DeepForcedAligner

Folders and files

Latest commit

History

Repository files navigation

DeepForcedAligner

Installation

Example Training and Extraction

Tensorboard

Using Your Own Dataset

Using Preprocessed Mel Spectrograms

About

Resources

License

Stars

Watchers

Forks

Languages