Speech-to-Text Data Collection

A tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model.

Data Capture Pipeline

Directory Structure

.
├── airflow
│   ├── dags
│   │   ├── extract_load.py
│   │   └── scripts
│   │       ├── dataloader.py
│   │       ├── db_connection.py
│   │       ├── __init__.py
│   │       └── schema
│   │           └── amharicnews.sql
│   ├── data
│   │   └── AmharicNewsDataset.csv
│   ├── docker-compose.yaml
│   └── logs
│       └── scheduler
│           └── latest -> /opt/airflow/logs/scheduler/2022-10-05
├── backend
│   └── dummy.txt
├── frontend
│   ├── dummy.txt
│   ├── frontend
│   │   ├── package.json
│   │   ├── package-lock.json
│   │   ├── public
│   │   │   ├── favicon.ico
│   │   │   ├── index.html
│   │   │   ├── logo192.png
│   │   │   ├── logo512.png
│   │   │   ├── manifest.json
│   │   │   └── robots.txt
│   │   ├── README.md
│   │   └── src
│   │       ├── App.css
│   │       ├── App.js
│   │       ├── App.test.js
│   │       ├── index.css
│   │       ├── index.js
│   │       ├── logo.svg
│   │       ├── reportWebVitals.js
│   │       └── setupTests.js
│   └── proto.png
├── img
│   ├── logo.png
│   └── pipelineDiagram.png
├── LICENSE
├── logging
│   └── dummy.txt
├── notebook
│   └── Amharic_news_Classification.ipynb
├── README.md
├── requirements.txt
├── screenshots
│   ├── airflowscreenshoot.png
│   └── design diagram.png
└── testing
    ├── dummy.txt
    └── test_dataloading.py

17 directories, 39 files

Run Locally

Clone the project

  git clone https://github.com/create-speech-to-text-pipeline/pipeline

Go to the project directory

  cd pipeline

Install dependencies

  pip3 install -r requirements.txt

Set up pipeline

  python3 setup.py

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.dvc		.dvc
.github/workflows		.github/workflows
airflow		airflow
backend		backend
frontend		frontend
img		img
kafka		kafka
logging		logging
notebook		notebook
screenshots		screenshots
scripts		scripts
testing		testing
.dvcignore		.dvcignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

License

create-speech-to-text-pipeline/pipeline

Folders and files

Latest commit

History

Repository files navigation

Speech-to-Text Data Collection

Data Capture Pipeline

Directory Structure

Run Locally

Screenshots

Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Languages