GitHub - WadeYin9712/Dynosaur: Code and data for "Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation" (EMNLP 2023)

Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation (EMNLP 2023)

| Paper | Project Website | 🤗 Data | 🤗 Model |

Da Yin, Xiao Liu, Fan Yin, Ming Zhong, Hritik Bansal, Jiawei Han, Kai-Wei Chang

Dynosaur aims to 1) build a dynamically growing instruction tuning dataset without low cost for maintenance, and 2) provide a venue to study how to dynamically improve instruction tuning models. The repo contains:

Dynosaur data
Crawling code for capturing dataset metadata information
Instruction generation process with ChatGPT based on metadata
Model checkpoints

Usage and License Notices: All the generated task instructions (except the instances of each task) are released under Apache-2.0 license. The instances of each tasks are subject to the license under which the original dataset was released. These license information are available in Dynosaur data and instruction_data/license_info.json.

Updates

Nov 30, 2023: We release the Dynosaur training data for Longform evaluation at 🤗 Huggingface Datasets.
Oct 10, 2023: Dynosaur is accepted at EMNLP 2023! We update Dynosaur performance on Longform in the camera-ready version. See you in Singapore!
Jul 6, 2023: Dynosaur v1 is coming! Dynosaur-full and Dynosaur-sub-superni are released at 🤗 Huggingface Datasets. T5-3B and LLAMA-7B fine-tuned with Dynosaur are released at 🤗 Huggingface Models.
May 23, 2023: Dynosaur v0 is here! Dynosaur-full data, metadata collection method and instruction generation code are released! Will upload them to Huggingface soon!

Overview

We propose Dynosaur, a large-scale instruction tuning dataset obtained automatically with significantly lower generation costs. Dynosaur leverages the metadata of existing NLP datasets to generate task instructions and organize corresponding inputs/outputs. By utilizing LLMs, we generate multiple task instructions applicable to various NLP domains and determine the relevant data fields for constructing instruction tuning data.

Dynosaur offers several advantages, including

Lower generation costs ($11.5 for generating 800K instruction tuning data)
Decent quality of instruction tuning data (better performance than Alpaca and Instruction GPT-4 on Super-NI)
Ability to grow dynamically by incorporating new datasets from Huggingface Datasets Platform

Data Release

We offer dynosaur-full, containing all the generated instruction tuning data in Dynosaur. It covers most licensed and non-null English datasets in 🤗 Huggingface Datasets as of Feb 23, 2023. The data is a dictionary containing the following keys:

instruction: str, describes task instructions
input: str, input text for the task
output: str, ground-truth output text for the task and input text
license: str, license of the source dataset
website: str, Huggingface website url of the source dataset
taskname: str, Huggingface dataset name

We also provide the collected metadata huggingface_metadata.jsonl and data huggingface_data.jsonl from Huggingface. They are the very foundation of synthesizing Dynosaur instructions.

Data Generation Process

To generate Dynosaur-full, please follow the following commands step by step:

Metadata Collection

python parse_hf.py                                            # crawl Huggingface datasets and collect metadata
python license_info.py                                        # capture license information of each dataset
python check_multilingual.py                                  # select English-only datasets

If you want to skip these steps, you may use the collected metadata huggingface_metadata.jsonl and data huggingface_data.jsonl on AWS S3.

Instruction Generation and Filtering

python generate_tasks_with_description.py                     # generation description-aware tasks
python generate_tasks_without_description.py                  # generation description-unaware tasks
python filter_invalid_tasks.py                                # filter out invalid tasks
python organize_data.py                                       # organize instruction data

These result in the instruction data instruction_data/instruction-full.json and instruction tuning dataset dynosaur-full. We will provide the sampled data for evaluation on Super-NI and user instructions soon.

Fine-tuning

We fine-tune T5-3B and LLaMA-7B with the following hyperparameters:

Hyperparameter	T5-3B	LLaMA-7B
Batch size	16	128
Learning rate	1e-5	3e-4
Epochs	2	3
Max length	512	512

To fine-tune your own models, please refer to the training code of Tk-Instruct and Stanford Alpaca or Alpaca-LORA.

Limitations

Dynosaur is still under development and needs a lot of improvements. We are still studying method that can better control the instruction quality and generate more diverse instructions. We are also trying to control any biases introduced in Dynosaur. Stay tuned!

Citation

If you find this work is relevant with your research, please feel free to cite our work!

@article{yin2023dynosaur,
  title={Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation},
  author={Yin, Da and Liu, Xiao and Yin, Fan and Zhong, Ming and Bansal, Hritik and Han, Jiawei and Chang, Kai-Wei},
  journal={EMNLP},
  year={2023}
}

As our Dynosaur data is based on Huggingface Datasets Resources, please also cite Huggingface Datasets paper:

Datasets: A Community Library for Natural Language Processing. Lhoest et al., 2021. EMNLP 2021 demo.

Acknowledgements

We greatly appreciate Huggingface for their great effort in open-sourcing fantastic NLP datasets! We also sincerely thank the authors of all the datasets we incorporate in Dynosaur.

We also thank Yizhong Wang for providing the code for diversity analysis plot and Tk-instruct training code, and Stanford Alpaca Team for releasing the code to finetune LLAMA.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
code		code
imgs		imgs
instruction_data		instruction_data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

imgs

imgs

instruction_data

instruction_data

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation (EMNLP 2023)

| Paper | Project Website | 🤗 Data | 🤗 Model |

Da Yin, Xiao Liu, Fan Yin, Ming Zhong, Hritik Bansal, Jiawei Han, Kai-Wei Chang

Updates

Overview

Data Release

Data Generation Process

Metadata Collection

Instruction Generation and Filtering

Fine-tuning

Limitations

Citation

Acknowledgements

About

Releases

Packages

Languages

License

WadeYin9712/Dynosaur

Folders and files

Latest commit

History

Repository files navigation

Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation (EMNLP 2023)

| Paper | Project Website | 🤗 Data | 🤗 Model |

Da Yin*, Xiao Liu*, Fan Yin*, Ming Zhong*, Hritik Bansal, Jiawei Han, Kai-Wei Chang

Updates

Overview

Data Release

Data Generation Process

Metadata Collection

Instruction Generation and Filtering

Fine-tuning

Limitations

Citation

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Da Yin, Xiao Liu, Fan Yin, Ming Zhong, Hritik Bansal, Jiawei Han, Kai-Wei Chang