Skip to content

πŸ‘©πŸ€πŸ€– A curated list of datasets for large language models (LLMs), RLHF and related resources (continually updated)

License

Notifications You must be signed in to change notification settings

argilla-io/awesome-llm-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 

Repository files navigation

πŸ‘©πŸ€πŸ€– awesome-llm-datasets

This repository is a collection of useful links related to datasets for Language Model models (LLMs) and Reinforcement Learning with Human Feedback (RLHF).

It includes a variety of open datasets, as well as tools, pre-trained models, and research papers that can help researchers and developers work with LLMs and RLHF from a data perspective.

Follow and star for the latest and greatest links related to datasets for LLMs and RLHF.

Table of Contents

  1. πŸ“¦ Datasets
    1. πŸ“š For pre-training
      1. 2023
      2. Before 2023
    2. πŸ—£οΈ For instruction-tuning
    3. πŸ‘©πŸ€πŸ€– For RLHF
    4. βš–οΈ For evaluation
    5. πŸ‘½ For other purposes
  2. 🦾 Models and their datasets
  3. 🧰 Tools and methods
  4. πŸ“” Papers

Datasets

For pre-training

2023

RedPajama Data:

1.2 Trillion tokens Dataset in English:

Dataset Token Count
Commoncrawl 878 Billion
C4 175 Billion
GitHub 59 Billion
Books 26 Billion
ArXiv 28 Billion
Wikipedia 24 Billion
StackExchange 20 Billion
Total 1.2 Trillion

Also includes code for data preparation, deduplication, tokenization, and visualization.

Created by Ontocord.ai, MILA QuΓ©bec AI Institute, ETH DS3Lab, UniversitΓ© de MontrΓ©al, Stanford Center for Research on Foundation Models (CRFM), Stanford Hazy Research research group and LAION.

Before 2023

For instruction-tuning

For RLHF & Alignment

For evaluation

For other purposes

Models and their datasets

LLaMA

Overview: A collection of open source foundation models ranging in size from 7B to 65B parameters released by Meta AI.

License: Non-commercial bespoke (model), GPL-3.0 (code)

πŸ“ Release blog post πŸ“„ arXiv publication πŸƒ Model card

Vicuna

Overview: A 13B parameter open source chatbot model fine-tuned on LLaMA and ~70k ChatGPT conversations that maintains 92% of ChatGPT’s performance and outperforms LLaMA and Alpaca.

License: Non-commercial bespoke license (model), Apache 2.0 (code).

πŸ“¦ Repo

πŸ“ Release blog post

πŸ”— ShareGPT dataset

πŸ€— Models

πŸ€– Gradio demo

Dolly 2.0

Overview: A fully open source 12B parameter instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.

License: CC BY-SA 3.0 (model), CC BY-SA 3.0 (dataset), Apache 2.0 (code).

πŸ“¦ Repo

πŸ“ Release blog post

πŸ€— Models

LLaVA

Overview: A multi-modal LLM that combines a vision encoder and Vicuna for general-purpose visual and language understanding, with capabilities similar to GPT-4.

License: Non-commercial bespoke (model), CC BY NC 4.0 (dataset), Apache 2.0 (code).

πŸ“¦ Repo

πŸ“ Project homepage

πŸ“„ arXiv publication

πŸ€— Dataset & models

πŸ€– Gradio demo

StableLM

Overview: A suite of low-parameter (3B, 7B) LLMs trained on a new dataset built on The Pile, with 1.5 trillion tokens of content.

License: CC BY-SA-4.0 (models).

πŸ“¦ Repo

πŸ“ Release blog post

πŸ€— Models

πŸ€– Gradio demo

Alpaca

Overview: A partially open source instruction-following model fine-tuned on LLaMA which is smaller and cheaper and performs similarly to GPT-3.5.

License: Non-commercial bespoke (model), CC BY-NC 4.0 (dataset), Apache 2.0 (code).

πŸ“ Release blog post

πŸ€— Dataset

Tools and methods

Papers

About

πŸ‘©πŸ€πŸ€– A curated list of datasets for large language models (LLMs), RLHF and related resources (continually updated)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published