Merge pull request #213 from allenai/llm-inference

LLM inference
allenai · Jun 12, 2023 · fd1cfe8 · fd1cfe8
2 parents ccb3869 + 22491fc
commit fd1cfe8
Show file tree

Hide file tree

Showing 16 changed files with 1,242 additions and 0 deletions.
diff --git a/inference/NOTES.md b/inference/NOTES.md
@@ -0,0 +1,144 @@
+# LLM inference
+
+The goal here is to run inference for all OLMo models (up to 70 GB) on a single A100. Our approach for this at present is *post-hoc model quantization*.
+
+Here's the [inference workstream google doc](https://docs.google.com/document/d/1DpCOsmTluGS0NDutgV7h_QNtiiVC8ocUEqEzqG76yfM/edit?usp=sharing).
+
+## Available methods
+
+We chatted with [Tim Dettmers](https://timdettmers.com/), who is an expert on model quantization.
+
+- According to Tim, [GPTQ](https://arxiv.org/abs/2210.17323) is state-of-the-art for post-hoc 4-bit model quantization. Based on this, we're currently using GPTQ.
+- Tim has more recently released [QLoRA](https://arxiv.org/abs/2305.14314), which can be used for 4-bit finetuning. I'm not sure if this technique is relevant for our use case. **It might be worth checking whether we should switch to this**, because the code is likely easier to work with (it's available through Huggingface).
+
+## GPTQ implementations
+
+There are a number of implementations available for GPTQ:
+
+- Original GPTQ code from paper authors: <https://github.com/IST-DASLab/gptq>.
+- GPTQ-for-LLaMa: <https://github.com/qwopqwop200/GPTQ-for-LLaMa>. What is sounds like; this is GPTQ adapted to work with LLaMa models.
+- AutoGPTQ: <https://github.com/PanQiWei/AutoGPTQ>. This builds on the original code, but it's more nicely-engineered and makes it pretty easy to add new models via inheritance. **This is what we're using now**.
+
+### Progress so far
+
+#### Compressing LLaMa models with GPTQ
+
+I've used AutoGPTQ to compress Hamish and Yizhong's instructed-tuned LLaMa models. Models up to 70B run on a single GPU. Code to do this is here: <https://github.com/allenai/open-instruct/tree/compress/quantize>. Most of my work is on the `compress` branch; some of it has been merged into `main` but not all.
+
+There's also some [code](https://github.com/allenai/open-instruct/tree/compress/quantize/efficiency-benchmark) to run Hao's [efficiency benchmarking code](https://github.com/allenai/efficiency-benchmark) on compressed models. I haven't examined the results of this thoroughly, but the code runs and provides stats on energy usage, latency, etc.
+
+##### Things that could be improved
+
+- Inference latency. Roughly 200ms / token for the 70B model. It's possible that [hidet](https://pytorch.org/blog/introducing-hidet/) could speed this up. It's also possible that the AutoGPTQ code is just better now than it was a month ago and that latency would be lower if the models were quantized now.
+- Evaluation. I've implemented accuracy evaluation MMLU (see [eval_on_mmlu](https://github.com/allenai/open-instruct/blob/compress/quantize/scripts/eval_on_mmlu.sh), but evals on more datasets would be good. This requires slightly modifying the evaluation code to accommodate `AutoGPTQ` models, as I did [here](https://github.com/allenai/open-instruct/blob/compress/eval/mmlu_eval/evaluate_hf_lm.py#LL114C18-L114C18).
+  - I added the results to [Yizhong's spreadsheet](https://docs.google.com/spreadsheets/d/1jt_bkJXBmNN5ZmEFZg4NKsu8F9PtKpHWwG1WcqNb17E/edit?usp=sharing) under the tab `efficiency-stuff`; right now it's just a single number that confirms that nothing disastrous happens on MMLU at 65B. This could be improved.
+
+#### Implementing GPTQ for OLMo
+
+There are two steps:
+
+- Add an `olmo.py` module [here](https://github.com/PanQiWei/AutoGPTQ/tree/main/auto_gptq/modeling) that inherits from `BaseGPTQForCausalLM`; see [llama.py](https://github.com/PanQiWei/AutoGPTQ/blob/main/auto_gptq/modeling/llama.py) for an example.
+- Register the olmo implementation in the list of [supported models](https://github.com/PanQiWei/AutoGPTQ/blob/main/auto_gptq/modeling/_const.py#L10).
+
+To add an `olmo.py` module, we can basically just imitate what was done for other models (e.g. LLaMa), mapping LLaMa model component names to OLMo names (Akshita, I shared a version of what I have for this). To do the mapping, I just instantiated a couple models and printed them out. I'm including a dump of all the models below for reference.
+
+There's one important wrinkle here: some OLMo models use *fused linear attention*. I'm not sure how GPTQ handles this or whether any existing supported models implement attention the same way. This might be something to discuss with Dirk and Pete.
+
+```python
+Olmo(
+  (transformer): ModuleDict(
+    (wte): Embedding(50304, 768)
+    (emb_drop): Dropout(p=0.1, inplace=False)
+    (blocks): ModuleList(
+      (0-11): 12 x OlmoSequentialBlock(
+        (dropout): Dropout(p=0.1, inplace=False)
+        (norm): LayerNorm()
+        (act): SwiGLU()
+        (attn_out): Linear(in_features=768, out_features=768, bias=True)
+        (ff_out): Linear(in_features=1536, out_features=768, bias=True)
+        (att_proj): Linear(in_features=768, out_features=2304, bias=True)
+        (ff_proj): Linear(in_features=768, out_features=3072, bias=True)
+      )
+    )
+    (ln_f): LayerNorm()
+    (wpe): Embedding(1024, 768)
+  )
+)
+
+LlamaForCausalLM(
+  (model): LlamaModel(
+    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
+    (layers): ModuleList(
+      (0-3): 4 x LlamaDecoderLayer(
+        (self_attn): LlamaAttention(
+          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
+          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
+          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
+          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
+          (rotary_emb): LlamaRotaryEmbedding()
+        )
+        (mlp): LlamaMLP(
+          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
+          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
+          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
+          (act_fn): SiLUActivation()
+        )
+        (input_layernorm): LlamaRMSNorm()
+        (post_attention_layernorm): LlamaRMSNorm()
+      )
+    )
+    (norm): LlamaRMSNorm()
+  )
+  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
+)
+
+BloomModel(
+  (word_embeddings): Embedding(250880, 64)
+  (word_embeddings_layernorm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
+  (h): ModuleList(
+    (0-1): 2 x BloomBlock(
+      (input_layernorm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
+      (self_attention): BloomAttention(
+        (query_key_value): Linear(in_features=64, out_features=192, bias=True)
+        (dense): Linear(in_features=64, out_features=64, bias=True)
+        (attention_dropout): Dropout(p=0.0, inplace=False)
+      )
+      (post_attention_layernorm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
+      (mlp): BloomMLP(
+        (dense_h_to_4h): Linear(in_features=64, out_features=256, bias=True)
+        (gelu_impl): BloomGelu()
+        (dense_4h_to_h): Linear(in_features=256, out_features=64, bias=True)
+      )
+    )
+  )
+  (ln_f): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
+)
+
+GPTJForCausalLM(
+  (transformer): GPTJModel(
+    (wte): Embedding(50400, 4096)
+    (drop): Dropout(p=0.0, inplace=False)
+    (h): ModuleList(
+      (0-27): 28 x GPTJBlock(
+        (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
+        (attn): GPTJAttention(
+          (attn_dropout): Dropout(p=0.0, inplace=False)
+          (resid_dropout): Dropout(p=0.0, inplace=False)
+          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
+          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
+          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
+          (out_proj): Linear(in_features=4096, out_features=4096, bias=False)
+        )
+        (mlp): GPTJMLP(
+          (fc_in): Linear(in_features=4096, out_features=16384, bias=True)
+          (fc_out): Linear(in_features=16384, out_features=4096, bias=True)
+          (act): NewGELUActivation()
+          (dropout): Dropout(p=0.0, inplace=False)
+        )
+      )
+    )
+    (ln_f): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
+  )
+  (lm_head): Linear(in_features=4096, out_features=50400, bias=True)
+)
+```
diff --git a/inference/README.md b/inference/README.md
@@ -0,0 +1,72 @@
+# LLM Inference
+
+## Compress
+
+Run the following:
+
+```
+python compression/run_compression.py \
+    --pretrained-model facebook/opt-125m \
+    --quantized-model-dir quantized_opt125m \
+    --n-samples 128
+```
+
+## Run accuracy benchmark
+
+Run the following:
+
+```
+cd eval/mmlu
+./eval_on_mmlu.sh ../../quantized_opt125m facebook/opt-125m /net/nfs.cirrascale/allennlp/akshitab/data/mmlu eval_results
+```
+
+Output format:
+
+```
+Average accuracy 0.202 - math
+Average accuracy 0.232 - health
+Average accuracy 0.219 - physics
+Average accuracy 0.270 - business
+Average accuracy 0.198 - biology
+Average accuracy 0.172 - chemistry
+Average accuracy 0.267 - computer science
+Average accuracy 0.204 - economics
+Average accuracy 0.234 - engineering
+Average accuracy 0.238 - philosophy
+Average accuracy 0.236 - other
+Average accuracy 0.233 - history
+Average accuracy 0.177 - geography
+Average accuracy 0.204 - politics
+Average accuracy 0.225 - psychology
+Average accuracy 0.250 - culture
+Average accuracy 0.250 - law
+Average accuracy 0.212 - STEM
+Average accuracy 0.241 - humanities
+Average accuracy 0.215 - social sciences
+Average accuracy 0.238 - other (business, health, misc.)
+Average accuracy: 0.229
+```
+
+
+## Run efficiency benchmark
+
+Run the following:
+
+```
+cd efficiency
+./run_efficiency_benchmark.sh facebook/opt-125m quantized_opt125m
+```
+
+Output format:
+
+```
+Time Elapsed: 500.91 s
+Max GPU memory usage:  2.09 GiB.
+Average GPU power:  9.00e+01 W.
+Average power:  2.04e+02 W.
+Total energy:  7.49e-02 kWh.
+CO2 emission:  6.35e-03 kg.
+Throughput:  0.20 instances / s.
+Throughput:  47.30 words / s.
+Latency:  5009.10 ms / batch.
+```
diff --git a/inference/__init__.py b/inference/__init__.py
diff --git a/inference/compression/olmo_compression.py b/inference/compression/olmo_compression.py
@@ -0,0 +1,35 @@
+from auto_gptq.modeling._base import BaseGPTQForCausalLM
+
+# NOTE: In progress; may change if OLMo model is updated.
+
+
+class OlmoGPTQForCausalLM(BaseGPTQForCausalLM):
+    # Attribute name of Transformer layer block.
+    layers_block_name = "transformer.blocks"  # NOTE(wadden) Correct
+
+    # Attribute names of other modules in the same level as transformer layer block.
+    # Excludes `transformer.emb_drop`, which has no parameters; this is consistent with
+    # GPT-J.
+
+    # TODO(wadden) Figure out if I need wpe
+    outside_layer_modules = ["transformer.wte", "transformer.ln_f", "transformer.wpe"]
+
+    # Attribute names of linear layers in the transformer layer module.
+    # These should be ordered as they are executed, which is usually:
+    # - Attention Q / K / V projection
+    # - Attention output projection
+    # - MLP projection
+    # - MLP output
+
+    # NOTE(wadden) For other models, layer norm, dropout, and activation functions are
+    # not included; I do the same here.
+    # TODO deal with case of fused attention.
+    inside_layer_modules = [
+        ["transformer.blocks.att_proj"],
+        ["transformer.blocks.att_out"],
+        ["transformer.blocks.ff_proj"],
+        ["transformer.blocks.ff_out"],
+    ]
+
+
+__all__ = ["OlmoGPTQForCausalLM"]
diff --git a/inference/compression/run_compression.py b/inference/compression/run_compression.py
@@ -0,0 +1,93 @@
+"""
+Run 4-bit model quantization with GPTQ, using Wikitext as train data.
+Based on `examples/quantization/basic_usage_wikitext2` in AutoGPT.
+
+Usage example (runs on a single GPU):
+python quantize_autogptq.py \
+    --pretrained_model_dir "/net/nfs.cirrascale/allennlp/hamishi/open-instruct/alpaca_fixed_65b" \
+    --quantized_model_dir "/net/nfs.cirrascale/allennlp/davidw/checkpoints/gptq_alpaca_fixed_65b"
+"""
+
+
+import argparse
+import time
+
+import numpy as np
+import torch
+from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
+from datasets import load_dataset
+from transformers import AutoTokenizer
+
+
+def get_wikitext2(nsamples, seed, seqlen, model):
+    traindata = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
+    testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
+
+    tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)
+    trainenc = tokenizer("\n\n".join(traindata["text"]), return_tensors="pt")
+    testenc = tokenizer("\n\n".join(testdata["text"]), return_tensors="pt")
+
+    import random
+
+    random.seed(seed)
+    np.random.seed(0)
+    torch.random.manual_seed(0)
+
+    traindataset = []
+    for _ in range(nsamples):
+        i = random.randint(0, trainenc.input_ids.shape[1] - seqlen - 1)
+        j = i + seqlen
+        inp = trainenc.input_ids[:, i:j]
+        attention_mask = torch.ones_like(inp)
+        traindataset.append({"input_ids": inp, "attention_mask": attention_mask})
+    return traindataset, testenc
+
+
+def get_args():
+    parser = argparse.ArgumentParser(description="Run 4-bit model quantization using GPTQ.")
+    parser.add_argument(
+        "--pretrained-model",
+        type=str,
+        help="Path to the unquantized model / Name of the unquantized huggingface model.",
+    )
+    parser.add_argument("--quantized-model-dir", type=str, help="Output path for the quantized model.")
+    parser.add_argument("--n-samples", type=int, help="Number of samples from Wikitext", default=128)
+    args = parser.parse_args()
+
+    return args
+
+
+def main():
+    "Run quantization."
+    args = get_args()
+
+    print("Getting data.")
+    trainloader, testenc = get_wikitext2(args.n_samples, 0, 2048, args.pretrained_model_dir)
+    print("Done.")
+
+    quantize_config = BaseQuantizeConfig(
+        bits=4,  # quantize model to 4-bit
+        group_size=128,  # it is recommended to set the value to 128
+    )
+
+    print("Loading unquantized model")
+    # Load un-quantized model, the model will always be force loaded into cpu
+    model = AutoGPTQForCausalLM.from_pretrained(args.pretrained_model_dir, quantize_config)
+    print("Done")
+
+    # Quantize model, the examples should be list of dict whose keys can only be
+    # "input_ids" and "attention_mask" with value under torch.LongTensor type.
+    print("Quantizing")
+    tick = time.time()
+    model.quantize(trainloader, use_triton=True)
+    elapsed = (time.time() - tick) / 60
+    print(f"Elapsed time:{elapsed:0.2f} minutes.")
+
+    # save quantized model
+    print("Saving")
+    model.save_quantized(args.quantized_model_dir)
+    print("Done")
+
+
+if __name__ == "__main__":
+    main()