Data2Vec 2.0

Data2Vec is self-supervised highly-efficient general framework to generate representations for vision, speech and text. This repository contains ready-to train data2vec (arXiv) implementation containing helper scripts to load, process & train the data.

If you want to understand Data2Vec in detail, check out this blog on Paperspace.

Run in a Free GPU powered Gradient Notebook

Setup

The file installations.sh contains all the necessary code to install required things. Note that your system must have CUDA to train data2vec. Also, you may require different version of torch based on the version of CUDA. If you are running this on Paperspace, then the default version of CUDA is 11.6 which is compatible with this code. If you are running it somewhere else, please check your CUDA version using nvcc --version. If the version differs from ours, you may want to change versions of PyTorch libraries in the first line of installations.sh by looking at compatibility table.

To install all the dependencies, run below command:

bash installations.sh

Downloading datasets & Start training

datasets directory in this repo contains necessary scripts to download the data and make it ready for training. Currently, this repository supports downloading 3 types of datasets ImageNet (Vision), LibriSpeech (Speech), and OpenWebText (Text).

We have already setup bash scripts for you which will automatically download the dataset for you and will start the training. scripts directory in this repo contains these bash scripts corresponding to few of many tasks which data2vec supports. You can look at one of these task bash scripts to understand what it does.

These bash scripts are compatible for Paperspace workspace. But if you are running it elsewhere, then you will need to replace base path of the paths mentioned in these task files.

To download data files and start training, you can execute below commands corresponding to the task you want to run it for:

# Downloads ImageNet and starts training data2vec_multi with it.
bash scripts/train_data2vec_multi_image.sh

# Downloads OpenWebText and starts training data2vec_multi with it.
bash scripts/train_data2vec_multi_text.sh

# Downloads LibriSpeech and starts training data2vec_multi with it.
bash scripts/train_data2vec_multi_speech.sh

Note that you may want to change some of the arguments in these task scripts based on your system. Since we have single GPU, the arg distributed_training.distributed_world_size=1 for us which you can change based on your requirement.

Original Code

data2vec directory contains the original code taken from fairseq repository. The code present in this directory is exactly same as the original code. We have only made changes in some of the config files corresponding to the tasks.

Reference

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language -- https://arxiv.org/abs/2202.03555

@article{DBLP:journals/corr/abs-2202-03555,
  author    = {Alexei Baevski and
               Wei{-}Ning Hsu and
               Qiantong Xu and
               Arun Babu and
               Jiatao Gu and
               Michael Auli},
  title     = {data2vec: {A} General Framework for Self-supervised Learning in Speech,
               Vision and Language},
  journal   = {CoRR},
  volume    = {abs/2202.03555},
  year      = {2022}
}

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language -- https://arxiv.org/abs/2212.07525

@misc{baevski2022efficient,
      title={Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language},
      author={Alexei Baevski and Arun Babu and Wei-Ning Hsu and Michael Auli},
      year={2022},
      eprint={2212.07525},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data2vec		data2vec
datasets		datasets
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
data2vec.ipynb		data2vec.ipynb
installations.sh		installations.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data2vec

data2vec

datasets

datasets

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

init.py

init.py

data2vec.ipynb

data2vec.ipynb

installations.sh

installations.sh

requirements.txt

requirements.txt

Repository files navigation

Data2Vec 2.0

Run in a Free GPU powered Gradient Notebook

Setup

Downloading datasets & Start training

Original Code

Reference

License

About

Releases

Packages

Contributors 2

Languages

License

ashutosh1919/data2vec-pytorch

Folders and files

Latest commit

History

Repository files navigation

Data2Vec 2.0

Run in a Free GPU powered Gradient Notebook

Setup

Downloading datasets & Start training

Original Code

Reference

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages