Audio Book Corpus (ABC) project has been developed to aid linguist researchers in the field of text to speech for purely academic purposes. In the current form, the corpus consists approximately 200 minutes of speech data in German language. Besides German, we are also in the process of developing Corpus Portuguese and Italian langugae. Future versions of the corpus shall encompass most European languages such as French, Spanish, Czech, Dutch, Polish, Romanian.
CORPUS DETAILS
The ABC project consists of three modules. The speech data is in wave file format, taken from Librivox, https://librivox.org/. Librivox provides free audio books on public domain for the academia on linguistic research.
TECHNIQUE FOR ANNOTATION :
After noise removal of audio data, we used semi-annotation based on deep learning and fuzzy matching technique. This corpus was annotated manually by 20%, and using deep learning techniques, we trained the machine to validate the rest of 80% speech data. In order to complete this, we successfully built a small GUI (python platform) to visualize the audio files and annotated text with perfect coherence and match with speech signals.
CONTRIBUTORS/CORRESPONDENCE
- Ajinkya Kulkarni (ajinkyakulkarni14@gmail.com)
LICENSE FOR USAGE
This work/project is licensed under GNU GPL which gives users:
the freedom to use the software for any purpose,
the freedom to change the software to suit your needs,
the freedom to share the software with your friends and neighbors, and
the freedom to share the changes you make.
It is recommended that due acknowledgement is given to authors, Ajinkya Kulkarni and Parth Gargava, when using the corpus for research.