Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Please register by sending us an email to

We will keep you informed for changes in code

Zero Speech Challenge 2017

All you need to get started to work on the development data and evaluation for the Track 1 and Track 2 of the Zero Speech Challenge 2017:

  • Development data (train and test sets): raw speech, ABX tasks
  • Evaluation software setup
  • Baseline and topline replication

The setup procedure is described for Linux. It has been tested on several distributions (Ubuntu 16.04, Debian Jessie and CentOS 6). It should work as well on MacOS.

This setup is for doing the hyper-parameter tuning of your system across the three development languages. Once done, you'll be ready to test your system on two surprise languages. To get the surprise data and official result submission, see the ZeroSpeech2017_surprise github repo


  • In order to participate to this challenge and have access to the datasets, you have to send an email to :

General setup

  • First of all install git and conda on your system, if not already done.

  • Create a dedicated Python2 virtual environment to isolate this code from the system wide Python installation.

      conda create --name zerospeech python=2
  • Then clone this repository and go in its root directory

      git clone --recursive
      cd zerospeech2017
  • Do not forget to activate your virtual environment

      source activate zerospeech

Getting the development dataset (hyperparameter training set)

  • To have the password to download the data, you first have to register yourself by sending an email to :

  • Dowload the whole challenge hyper-training dataset on the ZeroSpeech Website

  • The dataset is about 34 GB large, so it will take a while to download.

  • Once downloaded, put the data create the ./data directory in the root folder of this package:

      mkdir -p data/train
      mkdir -p data/test

    and unzip the datasets in the following way:

          {english, mandarin, french}/
          {english, mandarin, french}/
              {1s, 10s, 120s}/
  • All the wav files are sampled at 16kHz on 16 bits.

  • The *.abx files are ABX task files required for the evaluation.

  • The train corpora have been cut by speakers, and the speaker identity is the name of its wav file (e.g. speaker B07 is in the file B07.wav).

  • The test corpora have been cut into small files of controlled size (1s, 10s or 120s). Speaker identification is not available for test (wav files randomly named).

Track 1: Unsupervised subword modeling


  • With your virtual environment activated, simply have a:


    This installs the dependencies of the track 1 evaluation program, baseline and topline replication from the ./track1/src folder to your virtual environment. Those dependencies are:

  • To make sure the installation is correct, you can run the tests:

      pytest ./track1/src

Features file format

  • Our evaluation system requires that your unsupervised subword modeling system outputs a vector of feature values for each frame in the following format:

      <time> <val1>    ... <valN>
      <time> <val1>    ... <valN>


      0.0125 12.3 428.8 -92.3 0.021 43.23
      0.0225 19.0 392.9 -43.1 10.29 40.02
  • The time is in seconds. It corresponds to the center of the frame of each feature. In this example, there are frames every 10ms and the first frame spans a duration of 25ms starting at the beginning of the file, hence, the first frame is centered at .0125 seconds and the second 10ms later. It is not required that the frames be regularly spaced, in fact the only requirement is that the timestamp of frame n+1 is strictly larger than the timestamp of frame n. The frame timestamps are used by the evaluation software to determine which features correspond to a particular triphone among the sequence of features for a whole sentence on the basis of manual phone-level alignments for that sentence.

  • For each wav in the test set (e.g. data/test/mandarin/1s/aghsu09.wav), an ASCII features file with the same name (e.g. features/test/mandarin/1s/aghsu09.fea) as the wav should be generated in the same subdirectories logic //

Evaluation program

  • The Track 1 evaluation programs are ./track1/eval/ The detail of arguments is given by the --help option:

      cd ./track1/eval
      ./ --help
  • For example this command will evaluate features extracted on 1s files for the mandarin corpus:

      ./ mandarin 1 /path/to/feature/folder/ /path/to/output/folder/
  • The input feature folder must contain a collection of feature files as described above, one file per wav files in the test corpus.

  • The evaluation result is an aggregated ABX discriminability. An example of output file, called results.txt is:

      task	score
      across_talkers:	0.757
      within_talkers:	0.868

Using your own distance in evaluation

To see how it is possible to provide your own distance, let us show first how it is possible to obtain the default DTW+cosine distance using the -d option. The distance function used by default (DTW+cosine) is defined in the python script ./track1/eval/ by the function named distance. So calling the executable from the ./track1/eval folder with the option -d ./distance.distance will reproduce the default behavior.

Now to define your own distance function you can for example copy the file ./track1/eval/ in directory dir somewhere on your system, modify the distance function definition to suit your needs and call with the option -d dir/distance.distance.

You will see that the script begins by importing three other python modules, one for DTW, one for cosine distance and one for Kullback-Leibler divergence. The cosine and Kullback-Leibler modules are located in folder ./track1/src/ABXpy/distances/metrics and implement frame-to-frame distance computations in a fashion similar to the scipy.spatial.distance.cdist function from the scipy python library.

The DTW module is also located in the folder ./track1/src/ABXpy/distances/metrics but as a static library ( compiled from the cython source file ./track1/src/ABXpy/distances/metrics/install/dtw.pyx for efficiency reasons. You can use our optimized DTW implementation with any frame-to-frame distance function with a synopsis like the scipy.spatial.distance.cdist function by modifying appropriately your copy of You can also replace the whole distance computation by any python or cython module that you designed as long as it has the same input and output format than the the distance function in the script.

Baseline replication

The track 1 baseline is to run the evaluation on MFCC features with delta and delta-delta (39 dimensions) extracted directly from the test corpora, whithout any learning.

You can replicate the baseline (i.e. extract MFCCs and evaluate them) on the entire test dataset with the command:

./track1/baseline/ ./data ./track1/baseline/results

In this example, ./data is the path to the downloaded challenge dataset and ./track1/baseline/results is created output directory with baseline features and evaluation results.

Topline replication

The topline consists in training an HMM-GMM phone recogniser with speaker adapation on the training set and extracting from the test set a frame-by-frame posteriorgram. This is done kaldi script (track1/ It requires abkhazia.

THIS IS NOT YET AVAILABLE When this topline is available, we will email you about the update (available with git-pull).

Track 2: Spoken term discovery

Please register to keep you updated of changes on the results of the baseline and evaluation code

Track 2 evaluation is done on the training data sets only. This may seem strange, but remember this is unsupervised training (the test set contains files that are not cut in a way that is appropriate for Track 2).

NOTICE: The Track 2 evaluation pipeline is as of today not in its final phase. Right now, the evaluation scripts only evaluate a small percentage of the files and the baseline and toplines are not finished. You can still work on the data, and the updated evaluation will be notified (and available through a git pull).


  • The installation of track 2 depends on external packages, mainly to build/compile the source code and to preprocess speech waveforms, you will need the following packages installed on your system:

    • GNU autoconf to build compilation files of sprachcore-feacalc.
    • GNU gcc C/C++ compiler to build ZRTools and featcal
    • sox to preprocess the speech waveforms (resampling, volume control, etc)
    • GNU-Make to builid ZRTools and featcal

    The installation of the previous dependences are different on each system, please follow the installation description in those webpages.

  • With your virtual environment activated and all the dependences installed on your computer. simply run the following command:

      $ ./track2/setup/

    By running that script you will install all the packages needed on the track 2 including the evaluation program, the baseline and topline replication. These packages will be installed from the ./track2/src folder to your virtual environment in the case of python libraries and in .track2/bin for all the binary programs. Export .track2/bin to your $PATH before moving further. The installed packages are:

    • ZRTools, The same program that was used for the baseline of 2015 Challenge.
    • Term Discovery Evaluation tde using the branch zerospeech2017 is used to compute the scores of 2017 challenge.
    • sprachcore-feacalc it is used to compute speech features for the baseline.

Evaluation input format

To compute the scores for the Track 2, the evaluation toolkit need your results in an ASCII file, listing the set of fragments that were found with the following format:

 Class <classnb>
 <filename> <fragment_onset> <fragment_offset>
 <filename> <fragment_onset> <fragment_offset>
 Class <classnb>
 <filename> <fragment_onset> <fragment_offset>


 Class 1
 dsgea01 1.238 1.763
 dsgea19 3.380 3.821
 reuiz28 18.036 18.537

 Class 2
 zeoqx71   8.389  9.132

Note: the onset and offset are in seconds. If your system only does matching and not clustering, your classes will only have two elements each. If your system does not only matching, but also clustering and parsing, the fragments found will cover the entirety of the files, and there may be classes with only one element in it (the remainder of lexical-based segmentation).

Evaluation program

  • The Track 2 evaluation programs are ./track2/eval/, ./track2/eval/ and ./track2/eval/ The detail of arguments is given by the --help option:

      $ cd ./track2/bin
      $ python ./ --help
  • For example the following command will evaluate the output of your program formatted as described on the previous section for the Mandarin corpus:

      $ python ./ mandarin.classes result_dir/
  • To run the evaluation on multiple cores, use the j-flag. Evaluation runtime and memory usage are also strongly dependent on the particulars of the input file. It is not usefule to use more than 10 cores (each parallel job will do one of the 10 subsampling folds)

  • The output directory will contain one file each for the above described measures, with scores for both cross-speaker and within-speaker performance. The directory will also contain a file called ``VERSION_$'' indicating the version of the evaluation code that was used. Please make sure to report that number in your report. The version number can also be obtained by:

    $ python ./track2/bin/ -V

Baseline replication

You will be able to replicate the baseline on one of the corpus with the command:

bash /track2/baseline/baseline_french/
bash /track2/baseline/baseline_english/
bash /track2/baseline/baseline_mandarin/

All scores are stored on individual files (ned/coverage, boundary, group matching, and token/type) in the out directory in each corpus directory path.

Packages installation on Docker

If you want to install and run the tracks 1 and 2 in a virtualized environment, you can install and use docker. After cloning this repository, go to its root directory and build the docker image by doing:

$ docker build -t zerospeech2017 .

To run the docker image in an interactive mode using the command:

$ docker run -it zerospeech2017 bash

or, if you have downloaded the corpus you can run an interactive session with the data attached by doing:

$ docker run -it -v /my/host/data/folder:/zerospeech2017/data zerospeech2017 bash


Troubles with track 1 or track 2

Please open an issue here.

Troubles with software dependencies

If you have problems with ABXpy, h5features or another of our tools, please refer to their related github page on the Bootphon repository.

The parallelisation of our program relies on a module from Python's standard library called which can be a bit unstable. If you experience problems when running the evaluation, try requiring only one CPU to avoid using this module altogether.


Copyright 2016, 2017 LSCP Bootphon team

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see


All you need to get started for the Zero Speech Challenge 2017




No releases published


No packages published
You can’t perform that action at this time.