- Best practices for the end-to-end pre-training LLMs for science on HPC
- Open releases of a set of foundation models (and domain datasets) on scientific corpus
- Propose scientific related down-stream benchmarks for evaluating LLMs for science
- Provide heuristics for large-batch training and communication requirment
- Evaluate current practices and share our observations
Model | #Params | #Tokens | Link |
---|---|---|---|
Forge-bio | 1.44B | 38B | download |
Forge-che | 1.44B | 41B | download |
Forge-eng | 1.44B | 29B | download |
Forge-mat | 1.44B | 15B | download |
Forge-phy | 1.44B | 32B | download |
Forge-soc | 1.44B | 90B | download |
Forge-s1 | 1.44B | 10B | download |
Forge-s2 | 1.44B | 20B | download |
Forge-s3 | 1.44B | 30B | download |
Forge-s4 | 1.44B | 257B | download |
Forge-m1 | 13B | 30B | download |
Forge-m2 | 13B | 257B | download |
Forge-l | 22.4B | 257B | download |
- CORE: https://core.ac.uk/documentation/dataset (core_2020-12-20)
- MAG: https://www.microsoft.com/en-us/research/project/open-academic-graph/ (v2-1)
- Aminer: https://www.microsoft.com/en-us/research/project/open-academic-graph/ (v2-1)
- Arixv: https://huggingface.co/datasets/arxiv_dataset
- Scopus: 6M abstracts for the DOIs extracted via Scopus API
- Forge models can be used using standard Hugging Face API
from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast
model = GPTNeoXForCausalLM.from_pretrained("path_to_forge_model")
tokenizer = GPTNeoXTokenizerFast.from_pretrained("path_to_forge_model")
prompt = "high entropy alloy applications include"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_tokens = model.generate(input_ids,
do_sample=True,
temperature=0.7,
max_length=100)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)
high entropy alloy applications include high strength steels, alloys, composites, as well some metal alloys. In recent years, there has been much interest the use of such materials for manufacturing parts, components, machinery. For example, automotive sector an increasing number applications. most widely used is steels.
- Steps on preprocessing CORE, MAG and Aminer
- Steps on domain partitioning
- Software envrionment, configurations, and steps on pre-training
- The raw performance data including computation performance, loss, downstream evaluations, etc are available
- The jupyter notebook to plot is also provided
@INPROCEEDINGS{10.1145/3581784.3613215,
author={Junqi Yin and Sajal Dash and Feiyi Wang and Mallikarjun Shankar},
title={FORGE: Pre-training Open Foundation Models for Science},
booktitle={SC23: International Conference for High Performance Computing, Networking, Storage and Analysis},
year={2023},
doi={10.1145/3581784.3613215}}