Skip to content

dsdanielpark/open-llm-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Open-LLM-datasets

Repository for organizing datasets used in Open LLM.


Table of Contents


Datasets

To download or access information about the most commonly used datasets: https://huggingface.co/datasets

General Open Access Datasets for Alignment

Open Datasets for Pretraining

Domain-specific datasets and Private datasets

Potential Overlap

OIG hh-rlhf xP3 Natural instruct AlpacaDataCleaned GPT-4-LLM Alpaca-CoT
OIG - Contains Overlap Overlap Overlap Overlap
hh-rlhf Part of - Overlap
xP3 Overlap - Overlap Overlap
Natural instruct Overlap Overlap - Overlap
AlpacaDataCleaned Overlap - Overlap Overlap
GPT-4-LLM Overlap - Overlap
Alpaca-CoT Overlap Overlap Overlap Overlap Overlap Overlap -

Papers

Pre-trained LLM

Instruction finetuned LLM

Aligned LLM


Open LLM

LLM Leader Board


  • LLaMA - A foundational, 65-billion-parameter large language model.
  • Alpaca - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations.
  • Flan-Alpaca - Instruction Tuning from Humans and Machines.
  • Baize - Baize is an open-source chat model trained with LoRA.
  • Cabrita - A Portuguese finetuned instruction LLaMA.
  • Vicuna - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
  • Llama-X - Open Academic Research on Improving LLaMA to SOTA LLM.
  • Chinese-Vicuna - A Chinese Instruction-following LLaMA-based Model.
  • GPTQ-for-LLaMA - 4 bits quantization of LLaMA using GPTQ.
  • GPT4All - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa.
  • Koala - A Dialogue Model for Academic Research.
  • BELLE - Be Everyone's Large Language model Engine.
  • StackLLaMA - A hands-on guide to train LLaMA with RLHF.
  • RedPajama - An Open Source Recipe to Reproduce LLaMA training dataset.
  • Chimera - Latin Phoenix.
  • CaMA - a Chinese-English Bilingual LLaMA Model.
  • BLOOM - BigScience Large Open-science Open-access Multilingual Language Model.
  • BLOOMZ&mT0 - a family of models capable of following human instructions in dozens of languages zero-shot.
  • Phoenix
  • T5 - Text-to-Text Transfer Transformer.
  • T0 - Multitask Prompted Training Enables Zero-Shot Task Generalization.
  • OPT - Open Pre-trained Transformer Language Models.
  • UL2 - a unified framework for pretraining models that are universally effective across datasets and setups.
  • GLM- GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.
  • ChatGLM-6B - ChatGLM-6B is an open-source, supporting Chinese and English dialogue language model based on General Language Model (GLM) architecture.
  • RWKV - Parallelizable RNN with Transformer-level LLM Performance.
  • ChatRWKV - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model.
  • StableLM - Stability AI Language Models.
  • YaLM - a GPT-like neural network for generating and processing text.
  • GPT-Neo - An implementation of model & data parallel GPT3-like models.
  • GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile.
  • Dolly - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT.
  • Pythia - Interpreting Autoregressive Transformers Across Time and Scale.
  • Dolly 2.0 - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
  • OpenFlamingo - an open-source reproduction of DeepMind's Flamingo model.
  • Cerebras-GPT - A Family of Open, Compute-efficient, Large Language Models.
  • GALACTICA - The GALACTICA models are trained on a large-scale scientific corpus.
  • GALPACA - GALACTICA 30B fine-tuned on the Alpaca dataset.
  • Palmyra - Palmyra Base was primarily pre-trained with English text.
  • Camel - a state-of-the-art instruction-following large language model.
  • h2oGPT
  • PanGu-α - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model.
  • MOSS - MOSS is an open-source dialogue language model that supports Chinese and English.
  • Open-Assistant - a project meant to give everyone access to a great chat-based large language model.
  • HuggingChat - Powered by Open Assistant's latest model – the best open-source chat model right now and @huggingface Inference API.
  • StarCoder - Hugging Face LLM for Code
  • MPT-7B - Open LLM for commercial use by MosaicML

LLM Training Frameworks

LLM Optimization

State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods

  • github: https://github.com/huggingface/peft
  • abstract: Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. Fine-tuning large-scale PLMs is often prohibitively costly. In this regard, PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning. Seamlessly integrated with hugging face Accelerate for large scale models leveraging DeepSpeed and Big Model Inference.
  • Supported methods: Supported methods:
  1. LoRA: LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
  2. Prefix Tuning: Prefix-Tuning: Optimizing Continuous Prompts for Generation, P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
  3. P-Tuning: GPT Understands, Too
  4. Prompt Tuning: The Power of Scale for Parameter-Efficient Prompt Tuning
  5. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Tools for deploying LLM

Tutorials about LLM

  • [Andrej Karpathy] State of GPT video
  • [Hyung Won Chung] Instruction finetuning and RLHF lecture Youtube
  • [Jason Wei] Scaling, emergence, and reasoning in large language models Slides
  • [Susan Zhang] Open Pretrained Transformers Youtube
  • [Ameet Deshpande] How Does ChatGPT Work? Slides
  • [Yao Fu] The Source of the Capability of Large Language Models: Pretraining, Instructional Fine-tuning, Alignment, and Specialization Bilibili
  • [Hung-yi Lee] ChatGPT: Analyzing the Principle Youtube
  • [Jay Mody] GPT in 60 Lines of NumPy Link
  • [ICML 2022] Welcome to the "Big Model" Era: Techniques and Systems to Train and Serve Bigger Models Link
  • [NeurIPS 2022] Foundational Robustness of Foundation Models Link
  • [Andrej Karpathy] Let's build GPT: from scratch, in code, spelled out. Video|Code
  • [DAIR.AI] Prompt Engineering Guide Link
  • [Philipp Schmid] Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers Link
  • [HuggingFace] Illustrating Reinforcement Learning from Human Feedback (RLHF) Link
  • [HuggingFace] What Makes a Dialog Agent Useful? Link
  • [HeptaAI] ChatGPT Kernel: InstructGPT, PPO Reinforcement Learning Based on Feedback Instructions Link
  • [Yao Fu] How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources Link
  • [Stephen Wolfram] What Is ChatGPT Doing ... and Why Does It Work? Link
  • [Jingfeng Yang] Why did all of the public reproduction of GPT-3 fail? Link
  • [Hung-yi Lee] ChatGPT (possibly) How It Was Created - The Socialization Process of GPT Video
  • Open AI Improving mathematical reasoning with process supervision

Courses about LLM

  • [DeepLearning.AI] ChatGPT Prompt Engineering for Developers Homepage
  • [Princeton] Understanding Large Language Models Homepage
  • [Stanford] CS224N-Lecture 11: Prompting, Instruction Finetuning, and RLHF Slides
  • [Stanford] CS324-Large Language Models Homepage
  • [Stanford] CS25-Transformers United V2 Homepage
  • [Stanford Webinar] GPT-3 & Beyond Video
  • [MIT] Introduction to Data-Centric AI Homepage

Opinions about LLM

Other Awesome Lists

Other Useful Resources


How to Contribute

Since this repository focuses on collecting various datasets for LLM, you are welcome to contribute and add datasets in any form you prefer.

References

[1]https://github.com/KennethanCeyer/awesome-llm
[2]https://github.com/Hannibal046/Awesome-LLM
[3]https://github.com/Zjh-819/LLMDataHub
[4]https://huggingface.co/datasets