Athena accepts a textual manifest file as data set interface, which describes speech data set in csv format. In such file, each line contains necessary meta data (e.g. key, audio path, transcription) of a speech audio. For custom data, such manifest file needs to be prepared first. An example is shown as follows:
wav_filename wav_length_ms transcript
/dataset/train-clean-100-wav/374-180298-0000.wav 465004 chapter sixteen i might have told you of the beginning of this liaison in a few lines but i wanted you to see every step by which we came i to agree to whatever marguerite wished
/dataset/train-clean-100-wav/374-180298-0001.wav 514764 marguerite to be unable to live apart from me it was the day after the evening when she came to see me that i sent her manon lescaut from that time seeing that i could not change my mistress's life i changed my own
/dataset/train-clean-100-wav/374-180298-0002.wav 425484 i wished above all not to leave myself time to think over the position i had accepted for in spite of myself it was a great distress to me thus my life generally so calm
/dataset/train-clean-100-wav/374-180298-0003.wav 356044 assumed all at once an appearance of noise and disorder never believe however disinterested the love of a kept woman may be that it will cost one nothing
All of our training/ inference configurations are written in config.json. Below is an example configuration file with comments to help you understand.
{
"batch_size":32,
"num_epochs":20,
"sorta_epoch":1,
"ckpt":"examples/asr/hkust/ckpts/transformer",
"solver_gpu":[0],
"solver_config":{
"clip_norm":100,
"log_interval":10,
"enable_tf_function":true
},
"model":"speech_transformer",
"num_classes": null,
"pretrained_model": null,
"model_config":{
"return_encoder_output":false,
"num_filters":512,
"d_model":512,
"num_heads":8,
"num_encoder_layers":12,
"num_decoder_layers":6,
"dff":1280,
"rate":0.1,
"label_smoothing_rate":0.0
},
"optimizer":"warmup_adam",
"optimizer_config":{
"d_model":512,
"warmup_steps":8000,
"k":0.5
},
"dataset_builder": "speech_recognition_dataset",
"num_data_threads": 1,
"trainset_config":{
"data_csv": "examples/asr/hkust/data/train.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/hkust/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/hkust/data/vocab"},
"input_length_range":[10, 8000]
},
"devset_config":{
"data_csv": "examples/asr/hkust/data/dev.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/hkust/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/hkust/data/vocab"},
"input_length_range":[10, 8000]
},
"testset_config":{
"data_csv": "examples/asr/hkust/data/dev.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/hkust/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/hkust/data/vocab"}
}
}
With all the above preparation done, training becomes straight-forward. athena/main.py
is the entry point of the training module. Just run:
$ python athena/main.py <your_config_in_json_file>
Please install Horovod and MPI at first, if you want to train model using multi-gpu. See the Horovod page for more instructions.
To run on a machine with 4 GPUs with Athena:
$ horovodrun -np 4 -H localhost:4 python athena/horovod_main.py <your_config_in_json_file>
To run on 4 machines with 4 GPUs each with Athena:
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python athena/horovod_main.py <your_config_in_json_file>
After training, you can deploy the model on servers using the TensorFlow C++ API. Below are some steps to achieve this functionality with an ASR model.
- Install all dependencies, including TensorFlow, Protobuf, absl, Eigen3 and kenlm (optional).
- Freeze the model to pb format with
athena/deploy_main.py
. - Compile the C++ codes.
- Load the model and do argmax decoding in C++ codes, see
deploy/src/argmax.cpp
for the entry point.
After compiling, an executable file will be generated and you can run the executable file:
$ ./argmax
Detailed implementation is described here.