A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models
This repository accompanies the 🤗 HuggingFace Community Paper on finetuning Wav2Vec2 XLSR for low-resource languages [link]
(Mostly identical to the huggingface/datasets contributing guide)
-
Fork the repository by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account.
-
Clone your fork to your local disk, and add the base repository as a remote:
git clone git@github.com:<your Github handle>/wav2vec-toolkit.git cd wav2vec-toolkit git remote add upstream https://github.com/anton-l/wav2vec-toolkit.git
-
Set up a development environment by running the following command in a virtual environment:
conda create -n env python=3.7 --y conda activate env pip install -e ".[dev]" pip install -r languages/{YOUR_SPECIFIC_LANGUAGE}/requirements.txt
(If wav2vec-toolkit was already installed in the virtual environment, remove it with
pip uninstall wav2vec_toolkit
before reinstalling it in editable mode with the-e
flag.) -
Create a new branch to hold your development changes:
git checkout -b a-descriptive-name-for-my-changes
do not work on the
master
branch. -
Develop the features on your branch.
- Adding a new language here
-
Format your code. Run black and isort so that your newly added files look nice with the following command:
black --line-length 119 --target-version py36 src scripts languages isort src scripts languages
-
Once you're happy with your implementation, add your changes and make a commit to record your changes locally:
git add . git commit
It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes:
git fetch upstream git rebase upstream/main
Push the changes to your account using:
git push -u origin a-descriptive-name-for-my-changes
-
Once you are satisfied, go the webpage of your fork on GitHub. Click on "Pull request" to send your to the project maintainers for review.