Skip to content

Audio Book Corpus (ABC) project has been developed to aid linguist researchers in the field of text to speech for purely academic purposes. In the current form, the corpus consists approximately 200 minutes of speech data in German language. Besides German, we are also in the process of developing Corpus Portuguese and Italian langugae. Future v…

License

Notifications You must be signed in to change notification settings

ajinkyakulkarni14/Audio-Book-Corpus-for-European-Languages-

Repository files navigation

Audio-Book-Corpus-for-European-Languages-# Audio-Book-Corpus

Audio Book Corpus (ABC) project has been developed to aid linguist researchers in the field of text to speech for purely academic purposes. In the current form, the corpus consists approximately 200 minutes of speech data in German language. Besides German, we are also in the process of developing Corpus Portuguese and Italian langugae. Future versions of the corpus shall encompass most European languages such as French, Spanish, Czech, Dutch, Polish, Romanian.

CORPUS DETAILS

The ABC project consists of three modules. The speech data is in wave file format, taken from Librivox, https://librivox.org/. Librivox provides free audio books on public domain for the academia on linguistic research.

TECHNIQUE FOR ANNOTATION :

After noise removal of audio data, we used semi-annotation based on deep learning and fuzzy matching technique. This corpus was annotated manually by 20%, and using deep learning techniques, we trained the machine to validate the rest of 80% speech data. In order to complete this, we successfully built a small GUI (python platform) to visualize the audio files and annotated text with perfect coherence and match with speech signals.

CONTRIBUTORS/CORRESPONDENCE

  1. Ajinkya Kulkarni (ajinkyakulkarni14@gmail.com)

LICENSE FOR USAGE

This work/project is licensed under GNU GPL which gives users:

 the freedom to use the software for any purpose,

 the freedom to change the software to suit your needs,

 the freedom to share the software with your friends and neighbors, and

 the freedom to share the changes you make.

It is recommended that due acknowledgement is given to authors, Ajinkya Kulkarni and Parth Gargava, when using the corpus for research.

About

Audio Book Corpus (ABC) project has been developed to aid linguist researchers in the field of text to speech for purely academic purposes. In the current form, the corpus consists approximately 200 minutes of speech data in German language. Besides German, we are also in the process of developing Corpus Portuguese and Italian langugae. Future v…

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published