-
Notifications
You must be signed in to change notification settings - Fork 441
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #213 from allenai/llm-inference
LLM inference
- Loading branch information
Showing
16 changed files
with
1,242 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
# LLM inference | ||
|
||
The goal here is to run inference for all OLMo models (up to 70 GB) on a single A100. Our approach for this at present is *post-hoc model quantization*. | ||
|
||
Here's the [inference workstream google doc](https://docs.google.com/document/d/1DpCOsmTluGS0NDutgV7h_QNtiiVC8ocUEqEzqG76yfM/edit?usp=sharing). | ||
|
||
## Available methods | ||
|
||
We chatted with [Tim Dettmers](https://timdettmers.com/), who is an expert on model quantization. | ||
|
||
- According to Tim, [GPTQ](https://arxiv.org/abs/2210.17323) is state-of-the-art for post-hoc 4-bit model quantization. Based on this, we're currently using GPTQ. | ||
- Tim has more recently released [QLoRA](https://arxiv.org/abs/2305.14314), which can be used for 4-bit finetuning. I'm not sure if this technique is relevant for our use case. **It might be worth checking whether we should switch to this**, because the code is likely easier to work with (it's available through Huggingface). | ||
|
||
## GPTQ implementations | ||
|
||
There are a number of implementations available for GPTQ: | ||
|
||
- Original GPTQ code from paper authors: <https://github.com/IST-DASLab/gptq>. | ||
- GPTQ-for-LLaMa: <https://github.com/qwopqwop200/GPTQ-for-LLaMa>. What is sounds like; this is GPTQ adapted to work with LLaMa models. | ||
- AutoGPTQ: <https://github.com/PanQiWei/AutoGPTQ>. This builds on the original code, but it's more nicely-engineered and makes it pretty easy to add new models via inheritance. **This is what we're using now**. | ||
|
||
### Progress so far | ||
|
||
#### Compressing LLaMa models with GPTQ | ||
|
||
I've used AutoGPTQ to compress Hamish and Yizhong's instructed-tuned LLaMa models. Models up to 70B run on a single GPU. Code to do this is here: <https://github.com/allenai/open-instruct/tree/compress/quantize>. Most of my work is on the `compress` branch; some of it has been merged into `main` but not all. | ||
|
||
There's also some [code](https://github.com/allenai/open-instruct/tree/compress/quantize/efficiency-benchmark) to run Hao's [efficiency benchmarking code](https://github.com/allenai/efficiency-benchmark) on compressed models. I haven't examined the results of this thoroughly, but the code runs and provides stats on energy usage, latency, etc. | ||
|
||
##### Things that could be improved | ||
|
||
- Inference latency. Roughly 200ms / token for the 70B model. It's possible that [hidet](https://pytorch.org/blog/introducing-hidet/) could speed this up. It's also possible that the AutoGPTQ code is just better now than it was a month ago and that latency would be lower if the models were quantized now. | ||
- Evaluation. I've implemented accuracy evaluation MMLU (see [eval_on_mmlu](https://github.com/allenai/open-instruct/blob/compress/quantize/scripts/eval_on_mmlu.sh), but evals on more datasets would be good. This requires slightly modifying the evaluation code to accommodate `AutoGPTQ` models, as I did [here](https://github.com/allenai/open-instruct/blob/compress/eval/mmlu_eval/evaluate_hf_lm.py#LL114C18-L114C18). | ||
- I added the results to [Yizhong's spreadsheet](https://docs.google.com/spreadsheets/d/1jt_bkJXBmNN5ZmEFZg4NKsu8F9PtKpHWwG1WcqNb17E/edit?usp=sharing) under the tab `efficiency-stuff`; right now it's just a single number that confirms that nothing disastrous happens on MMLU at 65B. This could be improved. | ||
|
||
#### Implementing GPTQ for OLMo | ||
|
||
There are two steps: | ||
|
||
- Add an `olmo.py` module [here](https://github.com/PanQiWei/AutoGPTQ/tree/main/auto_gptq/modeling) that inherits from `BaseGPTQForCausalLM`; see [llama.py](https://github.com/PanQiWei/AutoGPTQ/blob/main/auto_gptq/modeling/llama.py) for an example. | ||
- Register the olmo implementation in the list of [supported models](https://github.com/PanQiWei/AutoGPTQ/blob/main/auto_gptq/modeling/_const.py#L10). | ||
|
||
To add an `olmo.py` module, we can basically just imitate what was done for other models (e.g. LLaMa), mapping LLaMa model component names to OLMo names (Akshita, I shared a version of what I have for this). To do the mapping, I just instantiated a couple models and printed them out. I'm including a dump of all the models below for reference. | ||
|
||
There's one important wrinkle here: some OLMo models use *fused linear attention*. I'm not sure how GPTQ handles this or whether any existing supported models implement attention the same way. This might be something to discuss with Dirk and Pete. | ||
|
||
```python | ||
Olmo( | ||
(transformer): ModuleDict( | ||
(wte): Embedding(50304, 768) | ||
(emb_drop): Dropout(p=0.1, inplace=False) | ||
(blocks): ModuleList( | ||
(0-11): 12 x OlmoSequentialBlock( | ||
(dropout): Dropout(p=0.1, inplace=False) | ||
(norm): LayerNorm() | ||
(act): SwiGLU() | ||
(attn_out): Linear(in_features=768, out_features=768, bias=True) | ||
(ff_out): Linear(in_features=1536, out_features=768, bias=True) | ||
(att_proj): Linear(in_features=768, out_features=2304, bias=True) | ||
(ff_proj): Linear(in_features=768, out_features=3072, bias=True) | ||
) | ||
) | ||
(ln_f): LayerNorm() | ||
(wpe): Embedding(1024, 768) | ||
) | ||
) | ||
|
||
LlamaForCausalLM( | ||
(model): LlamaModel( | ||
(embed_tokens): Embedding(32000, 4096, padding_idx=0) | ||
(layers): ModuleList( | ||
(0-3): 4 x LlamaDecoderLayer( | ||
(self_attn): LlamaAttention( | ||
(q_proj): Linear(in_features=4096, out_features=4096, bias=False) | ||
(k_proj): Linear(in_features=4096, out_features=4096, bias=False) | ||
(v_proj): Linear(in_features=4096, out_features=4096, bias=False) | ||
(o_proj): Linear(in_features=4096, out_features=4096, bias=False) | ||
(rotary_emb): LlamaRotaryEmbedding() | ||
) | ||
(mlp): LlamaMLP( | ||
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False) | ||
(down_proj): Linear(in_features=11008, out_features=4096, bias=False) | ||
(up_proj): Linear(in_features=4096, out_features=11008, bias=False) | ||
(act_fn): SiLUActivation() | ||
) | ||
(input_layernorm): LlamaRMSNorm() | ||
(post_attention_layernorm): LlamaRMSNorm() | ||
) | ||
) | ||
(norm): LlamaRMSNorm() | ||
) | ||
(lm_head): Linear(in_features=4096, out_features=32000, bias=False) | ||
) | ||
|
||
BloomModel( | ||
(word_embeddings): Embedding(250880, 64) | ||
(word_embeddings_layernorm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) | ||
(h): ModuleList( | ||
(0-1): 2 x BloomBlock( | ||
(input_layernorm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) | ||
(self_attention): BloomAttention( | ||
(query_key_value): Linear(in_features=64, out_features=192, bias=True) | ||
(dense): Linear(in_features=64, out_features=64, bias=True) | ||
(attention_dropout): Dropout(p=0.0, inplace=False) | ||
) | ||
(post_attention_layernorm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) | ||
(mlp): BloomMLP( | ||
(dense_h_to_4h): Linear(in_features=64, out_features=256, bias=True) | ||
(gelu_impl): BloomGelu() | ||
(dense_4h_to_h): Linear(in_features=256, out_features=64, bias=True) | ||
) | ||
) | ||
) | ||
(ln_f): LayerNorm((64,), eps=1e-05, elementwise_affine=True) | ||
) | ||
|
||
GPTJForCausalLM( | ||
(transformer): GPTJModel( | ||
(wte): Embedding(50400, 4096) | ||
(drop): Dropout(p=0.0, inplace=False) | ||
(h): ModuleList( | ||
(0-27): 28 x GPTJBlock( | ||
(ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True) | ||
(attn): GPTJAttention( | ||
(attn_dropout): Dropout(p=0.0, inplace=False) | ||
(resid_dropout): Dropout(p=0.0, inplace=False) | ||
(k_proj): Linear(in_features=4096, out_features=4096, bias=False) | ||
(v_proj): Linear(in_features=4096, out_features=4096, bias=False) | ||
(q_proj): Linear(in_features=4096, out_features=4096, bias=False) | ||
(out_proj): Linear(in_features=4096, out_features=4096, bias=False) | ||
) | ||
(mlp): GPTJMLP( | ||
(fc_in): Linear(in_features=4096, out_features=16384, bias=True) | ||
(fc_out): Linear(in_features=16384, out_features=4096, bias=True) | ||
(act): NewGELUActivation() | ||
(dropout): Dropout(p=0.0, inplace=False) | ||
) | ||
) | ||
) | ||
(ln_f): LayerNorm((4096,), eps=1e-05, elementwise_affine=True) | ||
) | ||
(lm_head): Linear(in_features=4096, out_features=50400, bias=True) | ||
) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
# LLM Inference | ||
|
||
## Compress | ||
|
||
Run the following: | ||
|
||
``` | ||
python compression/run_compression.py \ | ||
--pretrained-model facebook/opt-125m \ | ||
--quantized-model-dir quantized_opt125m \ | ||
--n-samples 128 | ||
``` | ||
|
||
## Run accuracy benchmark | ||
|
||
Run the following: | ||
|
||
``` | ||
cd eval/mmlu | ||
./eval_on_mmlu.sh ../../quantized_opt125m facebook/opt-125m /net/nfs.cirrascale/allennlp/akshitab/data/mmlu eval_results | ||
``` | ||
|
||
Output format: | ||
|
||
``` | ||
Average accuracy 0.202 - math | ||
Average accuracy 0.232 - health | ||
Average accuracy 0.219 - physics | ||
Average accuracy 0.270 - business | ||
Average accuracy 0.198 - biology | ||
Average accuracy 0.172 - chemistry | ||
Average accuracy 0.267 - computer science | ||
Average accuracy 0.204 - economics | ||
Average accuracy 0.234 - engineering | ||
Average accuracy 0.238 - philosophy | ||
Average accuracy 0.236 - other | ||
Average accuracy 0.233 - history | ||
Average accuracy 0.177 - geography | ||
Average accuracy 0.204 - politics | ||
Average accuracy 0.225 - psychology | ||
Average accuracy 0.250 - culture | ||
Average accuracy 0.250 - law | ||
Average accuracy 0.212 - STEM | ||
Average accuracy 0.241 - humanities | ||
Average accuracy 0.215 - social sciences | ||
Average accuracy 0.238 - other (business, health, misc.) | ||
Average accuracy: 0.229 | ||
``` | ||
|
||
|
||
## Run efficiency benchmark | ||
|
||
Run the following: | ||
|
||
``` | ||
cd efficiency | ||
./run_efficiency_benchmark.sh facebook/opt-125m quantized_opt125m | ||
``` | ||
|
||
Output format: | ||
|
||
``` | ||
Time Elapsed: 500.91 s | ||
Max GPU memory usage: 2.09 GiB. | ||
Average GPU power: 9.00e+01 W. | ||
Average power: 2.04e+02 W. | ||
Total energy: 7.49e-02 kWh. | ||
CO2 emission: 6.35e-03 kg. | ||
Throughput: 0.20 instances / s. | ||
Throughput: 47.30 words / s. | ||
Latency: 5009.10 ms / batch. | ||
``` |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
from auto_gptq.modeling._base import BaseGPTQForCausalLM | ||
|
||
# NOTE: In progress; may change if OLMo model is updated. | ||
|
||
|
||
class OlmoGPTQForCausalLM(BaseGPTQForCausalLM): | ||
# Attribute name of Transformer layer block. | ||
layers_block_name = "transformer.blocks" # NOTE(wadden) Correct | ||
|
||
# Attribute names of other modules in the same level as transformer layer block. | ||
# Excludes `transformer.emb_drop`, which has no parameters; this is consistent with | ||
# GPT-J. | ||
|
||
# TODO(wadden) Figure out if I need wpe | ||
outside_layer_modules = ["transformer.wte", "transformer.ln_f", "transformer.wpe"] | ||
|
||
# Attribute names of linear layers in the transformer layer module. | ||
# These should be ordered as they are executed, which is usually: | ||
# - Attention Q / K / V projection | ||
# - Attention output projection | ||
# - MLP projection | ||
# - MLP output | ||
|
||
# NOTE(wadden) For other models, layer norm, dropout, and activation functions are | ||
# not included; I do the same here. | ||
# TODO deal with case of fused attention. | ||
inside_layer_modules = [ | ||
["transformer.blocks.att_proj"], | ||
["transformer.blocks.att_out"], | ||
["transformer.blocks.ff_proj"], | ||
["transformer.blocks.ff_out"], | ||
] | ||
|
||
|
||
__all__ = ["OlmoGPTQForCausalLM"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
""" | ||
Run 4-bit model quantization with GPTQ, using Wikitext as train data. | ||
Based on `examples/quantization/basic_usage_wikitext2` in AutoGPT. | ||
Usage example (runs on a single GPU): | ||
python quantize_autogptq.py \ | ||
--pretrained_model_dir "/net/nfs.cirrascale/allennlp/hamishi/open-instruct/alpaca_fixed_65b" \ | ||
--quantized_model_dir "/net/nfs.cirrascale/allennlp/davidw/checkpoints/gptq_alpaca_fixed_65b" | ||
""" | ||
|
||
|
||
import argparse | ||
import time | ||
|
||
import numpy as np | ||
import torch | ||
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig | ||
from datasets import load_dataset | ||
from transformers import AutoTokenizer | ||
|
||
|
||
def get_wikitext2(nsamples, seed, seqlen, model): | ||
traindata = load_dataset("wikitext", "wikitext-2-raw-v1", split="train") | ||
testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") | ||
|
||
tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False) | ||
trainenc = tokenizer("\n\n".join(traindata["text"]), return_tensors="pt") | ||
testenc = tokenizer("\n\n".join(testdata["text"]), return_tensors="pt") | ||
|
||
import random | ||
|
||
random.seed(seed) | ||
np.random.seed(0) | ||
torch.random.manual_seed(0) | ||
|
||
traindataset = [] | ||
for _ in range(nsamples): | ||
i = random.randint(0, trainenc.input_ids.shape[1] - seqlen - 1) | ||
j = i + seqlen | ||
inp = trainenc.input_ids[:, i:j] | ||
attention_mask = torch.ones_like(inp) | ||
traindataset.append({"input_ids": inp, "attention_mask": attention_mask}) | ||
return traindataset, testenc | ||
|
||
|
||
def get_args(): | ||
parser = argparse.ArgumentParser(description="Run 4-bit model quantization using GPTQ.") | ||
parser.add_argument( | ||
"--pretrained-model", | ||
type=str, | ||
help="Path to the unquantized model / Name of the unquantized huggingface model.", | ||
) | ||
parser.add_argument("--quantized-model-dir", type=str, help="Output path for the quantized model.") | ||
parser.add_argument("--n-samples", type=int, help="Number of samples from Wikitext", default=128) | ||
args = parser.parse_args() | ||
|
||
return args | ||
|
||
|
||
def main(): | ||
"Run quantization." | ||
args = get_args() | ||
|
||
print("Getting data.") | ||
trainloader, testenc = get_wikitext2(args.n_samples, 0, 2048, args.pretrained_model_dir) | ||
print("Done.") | ||
|
||
quantize_config = BaseQuantizeConfig( | ||
bits=4, # quantize model to 4-bit | ||
group_size=128, # it is recommended to set the value to 128 | ||
) | ||
|
||
print("Loading unquantized model") | ||
# Load un-quantized model, the model will always be force loaded into cpu | ||
model = AutoGPTQForCausalLM.from_pretrained(args.pretrained_model_dir, quantize_config) | ||
print("Done") | ||
|
||
# Quantize model, the examples should be list of dict whose keys can only be | ||
# "input_ids" and "attention_mask" with value under torch.LongTensor type. | ||
print("Quantizing") | ||
tick = time.time() | ||
model.quantize(trainloader, use_triton=True) | ||
elapsed = (time.time() - tick) / 60 | ||
print(f"Elapsed time:{elapsed:0.2f} minutes.") | ||
|
||
# save quantized model | ||
print("Saving") | ||
model.save_quantized(args.quantized_model_dir) | ||
print("Done") | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
Oops, something went wrong.