Skip to content

ardaillon/FCN_GCI

Repository files navigation

FCN_GCI

Detection of GCIs from raw speech signals using a fully-convolutional network (FCN)

Code for running Glottal Closure Instants (GCI) detection using the fully-convolutional neural network models described in the following publication :

GCI detection from raw speech using a fully-convolutional network
Luc Ardaillon, Axel Roebel.
Submitted on arxiv on 22 Oct 2019.

We kindly request academic publications making use of our FCN models to cite the aforementioned paper.

Description

The code provided in this repository aims at performing GCI dectection using a Fully-Convolutional Neural Network. Note that it also allows to perform the prediction of the glottal flow shape (normalized in amplitude) from which more information than the GCIs may be extracted.

The provided code allows to run the GCI detection on given speech sound files using the provided pretrained models, but no code is currently provided to train the model on new data.
All pre-trained models evaluated in the above-mentionned paper are provided.
The models "FCN_synth_GF" and "FCN_synth_tri have been trained on a large database of high-quality synthetic speech (obtained by resynthesizing the BREF [1] and TIMIT [2] database using the PaN vocoder [3, and 4 Section 3.5.2]). The difference between those 2 models is that "FCN_synth_tri" predicts a triangular curve from which the GCIs are extracted by simple peak-picking on the maximums, while "FCN_synth_GF" predicts the glottal flow shape and performs the peak-picking on its negative derivative. The "FCN_CMU__10_90" and "FCN_CMU__60_20_20" models have been trained on the CMU database (with different train/validation/test splits) using a triangle shape as target.

The models, algorithm, training, and evaluation procedures, as well as the constitution of the databases, have been described in our publication "GCI detection from raw speech using a fully-convolutional network" (https://arxiv.org/abs/1910.10235).

Below are the results of our evaluations comparing our models to the SEDREAMS [5] and DPI [6] algorithms, in terms of IDR, MR, FAR, and IDA. The evaluation has been conducted on both a test database of synthetic speech and two datasets of real speech samples from the CMU artic [7] and PTDB-TUG [8] databases). All model and algorithms have been evaluated on 16kHz audio.

IDR MR FAR IDA
synth CMU PTDB synth CMU PTDB synth CMU PTDB synth CMU PTDB
FCN-synth-tri 99.90 97.95 95.37 0.08 1.89 3.40 0.02 0.17 1.22 0.08 0.26 0.32
FCN-synth-GF 99.91 98.43 95.64 0.06 1.20 2.91 0.04 0.37 1.45 0.11 0.34 0.38
FCN-CMU-10/90 49.63 99.39 90.13 48.05 0.50 8.91 0.51 0.11 0.95 0.52 0.10 0.26
FCN-CMU-60/20/20 60.06 99.52 88.17 39.14 0.40 11.00 0.64 0.08 0.81 0.50 0.09 0.26
SEDREAMS 89.26 99.04 95.34 3.86 0.21 2.15 6.88 0.75 2.51 0.68 0.36 0.62
DPI 88.22 98.69 91.3 2.14 0.23 2.16 9.64 1.08 6.53 0.83 0.23 0.49
DCNN (from [5]) 99.3 0.3 0.4 0.2

Example command-line usage (using provided pretrained models)

Default analysis :

This will run the glottal flow prediction and GCI detection using the FCN-synth-GF on the input file and store the output files (predicted glottal flow as 16kHz wav file, and GCI markers as sdif file) in the same folder than the input file :

python /path_to/FCN-f0/FCN_GCI.py -i /path_to/test_file.wav (note that you may specify a directory au audio files as input instead of a single file)

Run the analysis on a whole folder of audio files and specify output directory :

python /path_to/FCN-f0/FCN_GCI.py -i /path_to/audio_files -o /path_to/output_directory (If the output directory doesn't exist, it will be created)

Run the analysis using a specific model (default is FCN_synth_GF)

Use FCN-synth-tri model :

python /path_to/FCN-f0/FCN_GCI.py -i path_to/audio_files -m FCN_synth_tri -o /path_to/output.FCN-synth-tri.GCI.sdif (possible tags for pre-trained models are "FCN_synth_GF", "FCN_synth_tri", "FCN_CMU__10_90", and "FCN_CMU__60_20_20")

Example figures

Example of prediction of triangle shape from real speech extract : Example of prediction of triangle shape from real speech extract

Example of prediction of glottal flow shape from real speech extract : Example of prediction of glottal flow shape from real speech extract

Dependencies

keras tensorflow scipy numpy (optional : pysndfile)

References

[1] J. L. Gauvain, L. F. Lamel, and M. Eskenazi, "Design Considerations and Text Selection for BREF, a large French Read-Speech Corpus", 1st International Conference on Spoken Language Processing, ICSLP, http://www.limsi.fr/~lamel/kobe90.pdf

[2] V. Zue, S. Seneff, and J. Glass, "Speech Database Development At MIT : TIMIT And Beyond"

[3] Stefan Huber and Axel Roebel, "On glottal source shape parameter transformation using a novel deterministic and stochastic speech analysis and synthesis system", in Interspeech 2015

[4] L. Ardaillon, "Synthesis and expressive transformation of singing voice", Ph.D. dissertation, EDITE; UPMC-Paris 6 Sorbonne Universités, 2017 (Section 3.5.2 : "PaN engine")

[5] Thomas Drugman and Thierry Dutoit, "Glottal Closure and Opening Instant Detection from Speech Signals", in Interspeech 2009

[6] A P Prathosh, T V Ananthapadmanabha, A G Ramakrishnan, and Senior Member, "Epoch Extraction Based on Integrated Linear Prediction Residual Using Plosion Index"

[7] John Kominek and Alan W Black, "THE CMU ARCTIC SPEECH DATABASES", in 5th ISCA Speech Synthesis Workshop, 2004

[8] Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf, “A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario"

About

Detection of GCIs and prediction of glottal flow from raw speech signals using a fully-convolutional network (FCN)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages