AutoSynth

Automatically create synthetic data using SOTA techniques (Self Instruct, Magpie, Agent Instruct, Arena Learning, Genstruct, Instruction Synthesizer, Self-Curation) using your LLMs.

## Research log

11-09-2024
----------
implemented [Instruction Synthesizer](./techniques/instruction_synthesizer/) 
using the instruction-synthesizer pretrained model from huggingface hub. Defined custom pipeline 
steps to generate multiple instruction response pairs from a single text corpus. Fixed the 
multiprocessing conflicts arising when using external models using transformers library in the custom Step sub-classes.

10-09-2024
----------
implemented [Genstruct](./techniques/genstruct/) technique using the provided distilabel task. 
Used the Genstruct-7B pretrained models to generate instruction-response pairs from raw text corpora.

31-08-2024
----------
implemented [Magpie](./techniques/magpie/) technique for conversation generation using the 
provided distilabel task. Defined custom distilabel pipeline step to perform list 
splitting and converting conversation to instruction-response pairs.

30-08-2024
----------
implemented Self-Curation technique for Data Curation as described in this [paper](https://huggingface.co/blog/akjindal53244/llama31-storm8b). 
Used strong models like Meta Llama and Facebook Bert Classifier to separate quality data.

26-08-2024
----------
implemented [Self Instruct](./techniques/self_instruct/) technique. Coded custom distilabel 
pipeline steps to perform tasks like renaming columns or swapping columns during pipeline processing. 
Used the distilabel Self-Instruct task to generate n instructions from a raw text corpora.

24-08-2024
----------
implemented [Arena Learning](./techniques/arena_learning/). Used Phi-3 mini, TinyLlama, OpenAI GPT 4o 
mini for testing. It generates multiple answers to a question using different models 
and then ranks them using a stronger model. Used GPT 4o mini for ranking.

22-08-2024
----------
implemented [Agent Instruct](./techniques/agent_instruct/). Used Phi-3 mini for testing. It generates instructions using multiple agents.

✅ Techniques

Setup

1. Install Dependencies:

pip install -r requirements.txt

2. Setup configs:

The directory of each technique in techniques contains an example config.yaml file. Adjust the values of datasets, models, and other parameters in that file.

3. Run the main.py script: Specify the config.yaml file path and the technique name while runnning this command.

python main.py --config=./techniques/genstruct/config.yaml --technique=genstruct

For help run,

python main.py --help

🦋 Citation

If you find this code useful, please cite the following:

@misc{Zain2024AutoSynth,
  author = {Zain, Abideen},
  title = {AutoSynth},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/abideenml/AutoSynth}},
}

Connect with me

If you'd love to have some more AI-related content in your life 🤓, consider:

Connect and reach me on LinkedIn and Twitter
Follow me on 📚 Medium
Check out my 🤗 HuggingFace

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
techniques		techniques
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
test.py		test.py
test.yaml		test.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoSynth

✅ Techniques

Setup

🦋 Citation

Connect with me

About

Languages

License

abideenml/AutoSynth

Folders and files

Latest commit

History

Repository files navigation

AutoSynth

✅ Techniques

Setup

🦋 Citation

Connect with me

About

Topics

Resources

License

Stars

Watchers

Forks

Languages