Skip to content

Data engineering pipeline that allows recording millions of Amharic and Swahili speakers reading digital texts in-app and web platforms

License

Notifications You must be signed in to change notification settings

abel-blue/StoTkas

 
 

Repository files navigation

Speech-to-text data collection with Kafka, Airflow, and Spark

Operating systems Forks Badge Pull Requests Badge Issues Badge GitHub contributors License Badge

causal-image

Table of Contents

  1. Introduction
  2. Project Structure
  3. Installation guide

Introduction

Large and quality datasets are critical to ensure the performance, fairness, robustness, reliability, and scalability of ML systems. Data scientists often lack diverse and large datasets to train and test the machine learning models they design. This project focuses on developing a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model. The general objective of the project is to develop a data engineering pipeline using Apache Kafka, Apache Spark and Airflow to allow collection of millions of Amharic and Swahili audio recordings from speakers reading digital text in app and web platforms. These recordings can be used to produce a large and diverse dataset for training and testing speech-to-text processing models.

The proposed data pipeline was built on Apache Kafka, an open-source distributed event streaming platform. By combining messaging, storage, and stream processing, the data pipeline allow collection, storage and analysis of real-time audio datasets. The data pipeline consists of the following key components:

  1. Data producers
  2. Data consumers
  3. Apache Kafka cluster
  4. Amazon S3 bucket Connectors
  5. Apache Spark Stream preprocessors

Project Structure

  • images/ the folder where all snapshot for the project are stored.
  • logs/ the folder where script logs are stored.
  • data/ the folder where the dataset files are stored.
  • .github/: the folder where github actions and unit-tests are integrated.
  • cml.yaml: the file where the cml configuration is stored.
  • .vscode/: the folder where local path are stored.
  • notebooks/: a jupyter notebook for preprocessing the data.
  • scripts/: folder where modules are stored.
  • tests/: the folder containing unit tests for the scripts.
  • requirements.txt: a text file listing the projet's dependancies.
  • .travis.yml: a configuration file Travis CI for unit test.
  • setup.py: a configuration file for installing the scripts as a package.
  • results.txt: a text file containing the results of the cml report.
  • README.md: Markdown text with a brief explanation of the project and the repository structure.

Installation guide

Conda Enviroment

conda create --name stt python==3.8
conda activate stt

Next

git clone https://github.com/Speech-to-text-Kafka-Airflow-Spark/StoTkas.git
cd StoTkas
sudo python3 setup.py install

Setting up docker container for the project

Setting up kafka and zookeeper

docker-compose -f docker-compose.yml up -d

Setting up spark and airflow

docker-compose -f docker-compose1.yml up -d

License

MIT

About

Data engineering pipeline that allows recording millions of Amharic and Swahili speakers reading digital texts in-app and web platforms

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.7%
  • Other 1.3%