Skip to content
Module to perform speech to text using Tensorflow
Python Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

DEEP Open Catalogue: Speech to Text

Build Status

Author: Lara Lloret Iglesias (CSIC)

Project: This work is part of the DEEP Hybrid-DataCloud project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 777435.

This is a plug-and-play tool to train and evaluate a speech to text tool using deep neural networks. The network architecture is based in one of the tutorials provided by Tensorflow ( The architecture used in this tutorial is based on some described in the paper Convolutional Neural Networks for Small-footprint Keyword Spotting. It was chosen because it's comparatively simple, quick to train, and easy to understand, rather than being state of the art. There are lots of different approaches to building neural network models to work with audio, including recurrent networks or dilated (atrous) convolutions. This tutorial is based on the kind of convolutional network that will feel very familiar to anyone who's worked with image recognition. That may seem surprising at first though, since audio is inherently a one-dimensional continuous signal across time, not a 2D spatial problem. We define a window of time we believe our spoken words should fit into, and converting the audio signal in that window into an image. This is done by grouping the incoming audio samples into short segments, just a few milliseconds long, and calculating the strength of the frequencies across a set of bands. Each set of frequency strengths from a segment is treated as a vector of numbers, and those vectors are arranged in time order to form a two-dimensional array. This array of values can then be treated like a single-channel image, and is known as a spectrogram. An example of what one of these spectrograms looks like:


To start using this framework run:

git clone
cd speech-to-text-tf
pip install -e .


  • This project has been tested in Ubuntu 18.04 with Python 3.6.5. Further package requirements are described in the requirements.txt file.
  • It is a requirement to have Tensorflow>=1.12.0 installed (either in gpu or cpu mode). This is not listed in the requirements.txt as it breaks GPU support.
  • Run python -c 'import cv2' to check that you installed correctly the opencv-python package (sometimes dependencies are missed in pip installations).

Project Organization

├──              <- The top-level README for developers using this project.
├── data
│   ├── audios                <- The original, immutable data dump.
│   │
│   └── data_splits            <- Scripts to download or generate data
├── docs                   <- A default Sphinx project; see for details
├── docker                 <- Directory for Dockerfile(s)
│    ├── models                 <- Trained and serialized models, model predictions, or model summaries
├── notebooks              <- Jupyter notebooks. 
├── references             <- Data dictionaries, manuals, and all other explanatory materials.
├── reports                <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures            <- Generated graphics and figures to be used in reporting
├── requirements.txt       <- The requirements file for reproducing the analysis environment, e.g.
│                             generated with `pip freeze > requirements.txt`
├── test-requirements.txt  <- The requirements file for the test environment
├──               <- makes project pip installable (pip install -e .) so imgclas can be imported
├── speechclas    <- Source code for use in this project.
│   ├──        <- Makes imgclas a Python module
│   │
│   ├── dataset            <- Scripts to download or generate data
│   │   └──
│   │
│   ├── features           <- Scripts to turn raw data into features for modeling
│   │   └──
│   │
│   ├── models             <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   └──
│   │
│   └── tests              <- Scripts to perfrom code testing + pylint script
│   │
│   └── visualization      <- Scripts to create exploratory and results oriented visualizations
│       └──
└── tox.ini                <- tox file with settings for running tox; see

Project based on the cookiecutter data science project template. #cookiecutterdatascience


1. Data preprocessing

The first step to train your speech to text neural network is to put your .wav files into folders. The name of each folder should correspond to the label for those particular audios.

1.1 Prepare the audios

Put your audios in the./data/dataset_files folder. If you are using the DEEP api, you can also provide an URL with the location of the tar.gz containing all the folders with the training files. This will automatically download the tar.gz, read the labels and get everything ready to start the training.

Please use wav files.

2. Train the classifier

Before training the classifier you can customize the default parameters of the configuration file. To have an idea of what parameters you can change, you can explore them using the dataset exploration notebook. This step is optional and training can be launched with the default configurarion parameters and still offer reasonably good results.

Once you have customized the configuration parameters in the ./etc/config.yaml file you can launch the training running ./imgclas/ You can monitor the training status using Tensorboard.

After training you can check training statistics and check the logs where you will be able to find the standard output during the training together with the confusion matrix after the training was finished.

Since usually this type of models are used in mobile phone application, the training generates the model in .pb format allowing to use it easily to perfom inference from a mobile phone app.

3. Test the classifier

You can test the classifier on a number of tasks: predict a single local wav file (or url) or predict multiple wavs (or urls).

You can also make and store the predictions of the test.txt file (if you provided one). Once you have done that you can visualize the statistics of the predictions like popular metrics (accuracy, recall, precision, f1-score), the confusion matrix, etc by running the predictions statistics notebook.

Finally you can launch a simple web page to use the trained classifier to predict audios (both local and urls) on your favorite brownser.

Launching the full API

Preliminaries for prediction

If you want to use the API for prediction, you have to do some preliminary steps to select the model you want to predict with:

  • copy your desired .models/[timestamp] to .models/api. If there is no .models/api folder, the default is to use the last available timestamp.
  • in the .models/api/ckpts leave only the desired checkpoint to use for prediction. If there are more than one chekpoints, the default is to use the last available checkpoint.

Running the API

To access this package's complete functionality (both for training and predicting) through an API you have to install the DEEPaaS package:

git clone
cd deepaas
pip install -e .

and run deepaas-run --listen-ip From there you will be able to run training and predictions of this package using model_name=speechclas.


You can’t perform that action at this time.