Open-LLM-datasets

Repository for organizing datasets used in Open LLM.

Datasets

To download or access information about the most commonly used datasets: https://huggingface.co/datasets

General Open Access Datasets for Alignment

Open Datasets for Pretraining

Domain-specific datasets and Private datasets

Potential Overlap

	OIG	hh-rlhf	xP3	Natural instruct	AlpacaDataCleaned	GPT-4-LLM	Alpaca-CoT
OIG	-	Contains	Overlap	Overlap	Overlap		Overlap
hh-rlhf	Part of	-					Overlap
xP3	Overlap		-	Overlap			Overlap
Natural instruct	Overlap		Overlap	-			Overlap
AlpacaDataCleaned	Overlap				-	Overlap	Overlap
GPT-4-LLM					Overlap	-	Overlap
Alpaca-CoT	Overlap	Overlap	Overlap	Overlap	Overlap	Overlap	-

Papers

Pre-trained LLM

Switch Transformer: Paper
GLaM: Paper
PaLM: Paper
MT-NLG: Paper
J1-Jumbo: api, Paper
OPT: api, ckpt, Paper, OPT-175B License Agreement
BLOOM: api, ckpt, Paper, BigScience RAIL License v1.0
GPT 3.0: api, Paper
LaMDA: Paper
GLM: ckpt, Paper, The GLM-130B License
YaLM: ckpt, Blog, Apache 2.0 License
LLaMA: ckpt, Paper, Non-commercial bespoke license
GPT-NeoX: ckpt, Paper, Apache 2.0 License
UL2: ckpt, Paper, Apache 2.0 License
T5: ckpt, Paper, Apache 2.0 License
CPM-Bee: api, Paper
rwkv-4: ckpt, Github, Apache 2.0 License
GPT-J: ckpt, Github, Apache 2.0 License
GPT-Neo: ckpt, Github, MIT License

Instruction finetuned LLM

Flan-PaLM: Link
BLOOMZ: Link
InstructGPT: Link
Galactica: Link
OpenChatKit: Link
Flan-UL2: Link
Flan-T5: Link
T0: Link
Alpaca: Link

Aligned LLM

GPT 4: Blog
ChatGPT: Demo | API
Sparrow: Paper
Claude: Demo | API

Open LLM

LLM Leader Board

Visuallization of Open LLM Leader Board: https://github.com/dsdanielpark/Open-LLM-Leaderboard-Report
Open LLM Leader Board: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

LLaMA - A foundational, 65-billion-parameter large language model.
Alpaca - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations.
Flan-Alpaca - Instruction Tuning from Humans and Machines.
Baize - Baize is an open-source chat model trained with LoRA.
Cabrita - A Portuguese finetuned instruction LLaMA.
Vicuna - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
Llama-X - Open Academic Research on Improving LLaMA to SOTA LLM.
Chinese-Vicuna - A Chinese Instruction-following LLaMA-based Model.
GPTQ-for-LLaMA - 4 bits quantization of LLaMA using GPTQ.
GPT4All - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa.
Koala - A Dialogue Model for Academic Research.
BELLE - Be Everyone's Large Language model Engine.
StackLLaMA - A hands-on guide to train LLaMA with RLHF.
RedPajama - An Open Source Recipe to Reproduce LLaMA training dataset.
Chimera - Latin Phoenix.
CaMA - a Chinese-English Bilingual LLaMA Model.
BLOOM - BigScience Large Open-science Open-access Multilingual Language Model.
BLOOMZ&mT0 - a family of models capable of following human instructions in dozens of languages zero-shot.
Phoenix
T5 - Text-to-Text Transfer Transformer.
T0 - Multitask Prompted Training Enables Zero-Shot Task Generalization.
OPT - Open Pre-trained Transformer Language Models.
UL2 - a unified framework for pretraining models that are universally effective across datasets and setups.
GLM- GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.
ChatGLM-6B - ChatGLM-6B is an open-source, supporting Chinese and English dialogue language model based on General Language Model (GLM) architecture.
RWKV - Parallelizable RNN with Transformer-level LLM Performance.
ChatRWKV - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model.
StableLM - Stability AI Language Models.
YaLM - a GPT-like neural network for generating and processing text.
GPT-Neo - An implementation of model & data parallel GPT3-like models.
GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile.
Dolly - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT.
Pythia - Interpreting Autoregressive Transformers Across Time and Scale.
Dolly 2.0 - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
OpenFlamingo - an open-source reproduction of DeepMind's Flamingo model.
Cerebras-GPT - A Family of Open, Compute-efficient, Large Language Models.
GALACTICA - The GALACTICA models are trained on a large-scale scientific corpus.
GALPACA - GALACTICA 30B fine-tuned on the Alpaca dataset.
Palmyra - Palmyra Base was primarily pre-trained with English text.
Camel - a state-of-the-art instruction-following large language model.
h2oGPT
PanGu-α - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model.
MOSS - MOSS is an open-source dialogue language model that supports Chinese and English.
Open-Assistant - a project meant to give everyone access to a great chat-based large language model.
HuggingChat - Powered by Open Assistant's latest model – the best open-source chat model right now and @huggingface Inference API.
StarCoder - Hugging Face LLM for Code
MPT-7B - Open LLM for commercial use by MosaicML

LLM Training Frameworks

LLM Optimization

State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods

github: https://github.com/huggingface/peft
abstract: Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. Fine-tuning large-scale PLMs is often prohibitively costly. In this regard, PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning. Seamlessly integrated with hugging face Accelerate for large scale models leveraging DeepSpeed and Big Model Inference.
Supported methods: Supported methods:

LoRA: LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
Prefix Tuning: Prefix-Tuning: Optimizing Continuous Prompts for Generation, P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
P-Tuning: GPT Understands, Too
Prompt Tuning: The Power of Scale for Parameter-Efficient Prompt Tuning
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Tools for deploying LLM

Tutorials about LLM

[Andrej Karpathy] State of GPT video
[Hyung Won Chung] Instruction finetuning and RLHF lecture Youtube
[Jason Wei] Scaling, emergence, and reasoning in large language models Slides
[Susan Zhang] Open Pretrained Transformers Youtube
[Ameet Deshpande] How Does ChatGPT Work? Slides
[Yao Fu] The Source of the Capability of Large Language Models: Pretraining, Instructional Fine-tuning, Alignment, and Specialization Bilibili
[Hung-yi Lee] ChatGPT: Analyzing the Principle Youtube
[Jay Mody] GPT in 60 Lines of NumPy Link
[ICML 2022] Welcome to the "Big Model" Era: Techniques and Systems to Train and Serve Bigger Models Link
[NeurIPS 2022] Foundational Robustness of Foundation Models Link
[Andrej Karpathy] Let's build GPT: from scratch, in code, spelled out. Video|Code
[DAIR.AI] Prompt Engineering Guide Link
[Philipp Schmid] Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers Link
[HuggingFace] Illustrating Reinforcement Learning from Human Feedback (RLHF) Link
[HuggingFace] What Makes a Dialog Agent Useful? Link
[HeptaAI] ChatGPT Kernel: InstructGPT, PPO Reinforcement Learning Based on Feedback Instructions Link
[Yao Fu] How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources Link
[Stephen Wolfram] What Is ChatGPT Doing ... and Why Does It Work? Link
[Jingfeng Yang] Why did all of the public reproduction of GPT-3 fail? Link
[Hung-yi Lee] ChatGPT (possibly) How It Was Created - The Socialization Process of GPT Video
Open AI Improving mathematical reasoning with process supervision

Courses about LLM

[DeepLearning.AI] ChatGPT Prompt Engineering for Developers Homepage
[Princeton] Understanding Large Language Models Homepage
[Stanford] CS224N-Lecture 11: Prompting, Instruction Finetuning, and RLHF Slides
[Stanford] CS324-Large Language Models Homepage
[Stanford] CS25-Transformers United V2 Homepage
[Stanford Webinar] GPT-3 & Beyond Video
[MIT] Introduction to Data-Centric AI Homepage

Opinions about LLM

Google "We Have No Moat, And Neither Does OpenAI" [2023-05-05]
AI competition statement [2023-04-20] [petergabriel]
Noam Chomsky: The False Promise of ChatGPT [2023-03-08][Noam Chomsky]
Is ChatGPT 175 Billion Parameters? Technical Analysis [2023-03-04][Owen]
The Next Generation Of Large Language Models [2023-02-07][Forbes]
Large Language Model Training in 2023 [2023-02-03][Cem Dilmegani]
What Are Large Language Models Used For? [2023-01-26][NVIDIA]
Large Language Models: A New Moore's Law [2021-10-26][Huggingface]

Other Awesome Lists

Other Useful Resources

How to Contribute

Since this repository focuses on collecting various datasets for LLM, you are welcome to contribute and add datasets in any form you prefer.

References

[1]https://github.com/KennethanCeyer/awesome-llm
[2]https://github.com/Hannibal046/Awesome-LLM
[3]https://github.com/Zjh-819/LLMDataHub
[4]https://huggingface.co/datasets

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Open-LLM-datasets

Table of Contents

Datasets

General Open Access Datasets for Alignment

Open Datasets for Pretraining

Domain-specific datasets and Private datasets

Potential Overlap

Papers

Pre-trained LLM

Instruction finetuned LLM

Aligned LLM

Open LLM

LLM Leader Board

LLM Training Frameworks

LLM Optimization

State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods

Tools for deploying LLM

Tutorials about LLM

Courses about LLM

Opinions about LLM

Other Awesome Lists

Other Useful Resources

How to Contribute

References

About

Releases

Packages

License

dsdanielpark/open-llm-datasets

Folders and files

Latest commit

History

Repository files navigation

Open-LLM-datasets

Table of Contents

Datasets

General Open Access Datasets for Alignment

Open Datasets for Pretraining

Domain-specific datasets and Private datasets

Potential Overlap

Papers

Pre-trained LLM

Instruction finetuned LLM

Aligned LLM

Open LLM

LLM Leader Board

LLM Training Frameworks

LLM Optimization

State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods

Tools for deploying LLM

Tutorials about LLM

Courses about LLM

Opinions about LLM

Other Awesome Lists

Other Useful Resources

How to Contribute

References

About

Topics

Resources

License

Stars

Watchers

Forks