-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Community] OPT Inference in HF Transformers #88
Comments
Thanks @patrickvonplaten ! Do you plan to add code for running generation using those checkpoints? I have been trying to do this through #89 borrowing from your next token prediction scripts |
We'll have the checkpoints added to Transformers by the end of the week, then it should be quite easy to run generation on them :-) |
Thank you Patrick! This is huge for accessibility of the models. Metaseq is notoriously unfriendly. |
@patrickvonplaten that's awesome! Will we just be able to call |
Mind sharing your conversion script? |
Hope this happens because it is another headache to distribute the layers to every GPU. |
@sanxchep We'll open-source something for this tomorrow (hopefully :-)) |
All other checkpoints are available now here: https://huggingface.co/models?other=opt_metasq |
@zhisbug yes: #!/usr/bin/env python3
"""
Script for backing out of the MP-resharded (reshard.pt) files and getting back
a non-flattened state dict.
Particularly useful for converting our models to other repositories.
Usage:
$ ls 125m
dict.txt
gpt2-merges.txt
gpt2-vocab.json
reshard-model_part-0.pt
reshard-model_part-1.pt
$ python -m metaseq.scripts.convert_to_singleton 125m
$ ls 125m
dict.txt
gpt2-merges.txt
gpt2-vocab.json
reshard-model_part-0.pt
reshard-model_part-1.pt
restored.pt
"""
import torch
import argparse
import glob
from metaseq import options, tasks, checkpoint_utils, utils
from metaseq.dataclass.configs import MetaseqConfig
from metaseq.dataclass.utils import convert_namespace_to_omegaconf
from metaseq.distributed import utils as dist_utils
from metaseq.distributed import fsdp_enable_wrap, fsdp_wrap
from metaseq.distributed.stitch_fsdp_ckpt import glue_megatron_parts
def worker_main(cfg: MetaseqConfig):
"""
Load up the model on all workers for Model Parallelism, then
unflatten, move to cpu, and save to "restored.pt".
"""
task = tasks.setup_task(cfg.task)
def _build_model(cfg, task):
# hardcoded to cpu & fp16
model = task.build_model(cfg.model).half().cuda()
return fsdp_wrap(model)
with fsdp_enable_wrap(
cfg.distributed_training,
use_sharded_state=cfg.distributed_training.use_sharded_state,
):
models, _model_args, _task = checkpoint_utils.load_model_ensemble_and_task(
utils.split_paths(cfg.common_eval.path),
arg_overrides=None,
task=task,
suffix=cfg.checkpoint.checkpoint_suffix,
strict=True,
num_shards=cfg.checkpoint.checkpoint_shard_count,
build_model_hook=_build_model,
)
model = models[0]
# consolidate everything on rank0
mp_size = dist_utils.get_model_parallel_world_size()
model_parts = [{} for _ in range(mp_size)]
with model.summon_full_params():
for name, p in model.named_parameters():
gathered = [torch.zeros_like(p) for _ in range(mp_size)]
torch.distributed.all_gather(
gathered, p, group=dist_utils.get_global_group()
)
for r, t in enumerate(gathered):
model_parts[r][name] = t.cpu()
glued = glue_megatron_parts(model_parts)
if "decoder.output_projection.weight" in glued:
del glued["decoder.output_projection.weight"]
_model_args['model'] = vars(_model_args['model'])
_model_args['model']['_name'] = 'transformer_lm'
_model_args['model']['decoder.version'] = torch.tensor([3])
_model_args['criterion'] = vars(_model_args['criterion'])
glued = {'cfg': _model_args, 'model': glued}
if dist_utils.get_global_rank() == 0:
with open(cfg.task.data + "/restored.pt", "wb") as f:
torch.save(glued, f)
def main():
# parser to be used like docstring shows
real_parser = argparse.ArgumentParser()
real_parser.add_argument("location")
args = real_parser.parse_args()
files = glob.glob(f"{args.location}/reshard*.pt")
MP = len(files)
BPE_MERGES = args.location + "/gpt2-merges.txt"
BPE_VOCAB = args.location + "/gpt2-vocab.json"
# Skeleton out all the annoying command line args we can infer
ARGS = [
"--model-parallel-size",
str(MP),
"--distributed-world-size",
str(MP),
"--task",
"language_modeling",
"--bpe-merges",
BPE_MERGES,
"--bpe-vocab",
BPE_VOCAB,
"--bpe",
"hf_byte_bpe",
"--path",
args.location + "/reshard.pt",
"--checkpoint-shard-count",
"1",
"--use-sharded-state",
args.location,
]
# build up the config file
parser = options.get_generation_parser()
# dumb defaults overriding
parser.set_defaults(lr_scheduler=None, criterion=None)
args = options.parse_args_and_arch(parser, input_args=ARGS)
cfg = convert_namespace_to_omegaconf(args)
cfg.distributed_training.distributed_world_size = MP
dist_utils.call_main(cfg, worker_main)
if __name__ == "__main__":
main() Note you should run this on this branch here: #60 |
Thanks for publishing these! Any chance you recognize this familiar error:
I’m using the default arguments. Here's a larger stack trace:
Thanks for any help. |
I end up with the exact same stacktrace: $ bash run.sh
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
setting global batch size to 1
using torch.float32 for parameters ...
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. False
activations_checkpoint_method ................... None
activations_checkpoint_num_layers ............... 1
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
attention_dropout ............................... 0.1
attention_softmax_in_fp32 ....................... False
bert_binary_head ................................ True
bert_load ....................................... None
bf16 ............................................ False
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
clip_grad ....................................... 1.0
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
data_impl ....................................... infer
data_parallel_size .............................. 1
data_path ....................................... None
dataloader_type ................................. single
DDP_impl ........................................ local
decoder_seq_length .............................. None
distribute_checkpointed_activations ............. False
distributed_backend ............................. nccl
embedding_path .................................. None
empty_unused_memory_level ....................... 0
encoder_seq_length .............................. 2048
eod_mask_loss ................................... False
eval_interval ................................... 1000
eval_iters ...................................... 100
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
ffn_hidden_size ................................. 3072
finetune ........................................ False
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
global_batch_size ............................... 1
hidden_dropout .................................. 0.1
hidden_size ..................................... 768
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_dim ......................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
kv_channels ..................................... 64
layernorm_epsilon ............................... 1e-05
lazy_mpu_init ................................... None
load ............................................ None
local_rank ...................................... None
log_batch_size_to_tensorboard ................... False
log_interval .................................... 100
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. None
lr_decay_iters .................................. None
lr_decay_samples ................................ None
lr_decay_style .................................. linear
lr_warmup_fraction .............................. None
lr_warmup_iters ................................. 0
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
mask_prob ....................................... 0.15
masked_softmax_fusion ........................... True
max_position_embeddings ......................... 2048
merge_file ...................................... None
micro_batch_size ................................ 1
min_loss_scale .................................. 1.0
min_lr .......................................... 0.0
mmap_warmup ..................................... False
no_async_tensor_model_parallel_allreduce ........ False
no_load_optim ................................... None
no_load_rng ..................................... None
no_save_optim ................................... None
no_save_rng ..................................... None
num_attention_heads ............................. 12
num_channels .................................... 3
num_classes ..................................... 1000
num_layers ...................................... 12
num_layers_per_virtual_pipeline_stage ........... None
num_workers ..................................... 2
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
override_lr_scheduler ........................... False
params_dtype .................................... torch.float32
patch_dim ....................................... 16
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
sample_rate ..................................... 1.0
save ............................................ None
save_interval ................................... None
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 2048
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
split ........................................... 969, 30, 1
tensor_model_parallel_size ...................... 1
tensorboard_dir ................................. None
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
titles_data_path ................................ None
tokenizer_type .................................. None
train_iters ..................................... None
train_samples ................................... None
use_checkpoint_lr_scheduler ..................... False
use_contiguous_buffers_in_local_ddp ............. True
use_cpu_initialization .......................... None
use_one_sent_docs ............................... False
virtual_pipeline_model_parallel_size ............ None
vocab_extra_ids ................................. 0
vocab_file ...................................... None
weight_decay .................................... 0.01
world_size ...................................... 1
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
> initializing torch distributed ...
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
Traceback (most recent call last):
File "run_model.py", line 19, in <module>
"encoder_seq_length": 2048
File "/home/yves/Megatron-LM/megatron/initialize.py", line 82, in initialize_megatron
finish_mpu_init()
File "/home/yves/Megatron-LM/megatron/initialize.py", line 65, in finish_mpu_init
_set_random_seed(args.seed)
File "/home/yves/Megatron-LM/megatron/initialize.py", line 210, in _set_random_seed
seed = seed_ + (100 * mpu.get_pipeline_model_parallel_rank())
File "/home/yves/Megatron-LM/megatron/mpu/initialize.py", line 294, in get_pipeline_model_parallel_rank
return torch.distributed.get_rank(group=get_pipeline_model_parallel_group())
File "/home/yves/Megatron-LM/megatron/mpu/initialize.py", line 223, in get_pipeline_model_parallel_group
'pipeline_model parallel group is not initialized'
AssertionError: pipeline_model parallel group is not initialized
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 31376) of binary: /home/yves/opt/bin/python3.7
Traceback (most recent call last):
File "/home/yves/opt/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/yves/opt/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/yves/opt/lib/python3.7/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/yves/opt/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/yves/opt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/yves/opt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_model.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-05-12_12:20:21
host : yves-zumbach-3-tcp.tenant-chairesearch-test.svc.cluster.local
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 31376)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================ |
when trying to run opt_metaseq_1300m , I also got errors:
|
I just found HF's published colab notebook https://twitter.com/huggingface/status/1524783493489774592?s=20&t=DZLKFh3FrVadi2zmMs62aA. You might want to try this method. Shoutout @suchenzang for the great community engagement. |
@patrickvonplaten Thanks for the great work! Code: import torch model = AutoModelForCausalLM.from_pretrained("facebook/opt-30b", torch_dtype=torch.float16).cuda() tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", use_fast=False)= prompt = "Hello, I'm am conscious and" input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda() set_seed(32) generated_ids = model.generate(input_ids, do_sample=True) print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True))` Stacktrace: Traceback (most recent call last): Do you know how we can distribute the inference across all available GPUs? |
When I use DataParallel, I see the error below |
@patrickvonplaten I saw the announcement about using opt-30b on a colab notebook by loading the weights from disk. This is pretty cool, but i think some of us (and I believe @sanxchep as well) would be more interested in splitting the weights across multiple GPUs (e.g. 8x16GB GPUs) and run them using tensor parallelism or Zero-3, to achieve real-time inference speed. |
Hi, I followed the scripts in https://colab.research.google.com/drive/14wnxMvD9zsiBQo2FtTpxn6w2cpXCcb-7#scrollTo=y8Ne7jJdaF9F&uniqifier=1 in my local machine. However, I always confront a problem when my code executes at: with init_empty_weights(): The error message is: My environment is as follows, could you please kindly help me point the reasons for the problem? Thanks: absl-py 1.0.0 |
Your PyTorch version isn't new enough for meta tensors. |
Thanks, it works when I upgraded into pytorch 1.11 |
Hi @patrickvonplaten do you have any update on "Big model inference feature". |
Also see: #164 |
After fixing the conversion script in #164, re-converted all singleton metaseq checkpoitns here: https://huggingface.co/models?other=opt_metasq |
Marking as done - any additional issues on this front to be tracked in a new issue. Thanks for all the hard work here! |
Sorry if I missed an announcement that would justify this issue being closed, but where can I find more information about the Big model inference feature? I know it's a comment that's unrelated to the initial issue, so I'm happy to open a new one if it's more appropriate. Right now, unless i'm mistaken, the huggingface implementation sequentially split the model (layer-wise) across the GPUs, which means only one GPU is used at any time. This means out of 8 GPUs, 7 is used strictly for RAM and 1 is used for actual compute. So in theory, we can see a 5-7x improvement on the same hardware if pipeline parallelism is used, which i believe is what the big model inference feature would be beneficial in abstracting the hard part (running deepspeed PP and megatron-lm MP) while benefiting from a higher hardware utilization. Happy to hear your thoughts on that! |
I agree, was going to write on the same, OPT cannot be used efficiently if we don't have parallelism. |
@xhluca @sanxchep for issues with HF Transformers, I think it's best to follow up with the HF team on how they implemented HF inference for OPT-175B (unless there's specific code changes here that you think are necessary for that to happen). There are also other community integrations, like alpa, which may help address some of the concerns here: https://opt.alpa.ai/ |
We'll share a post on Twitter and add a model to the Hub once we've transfered the HF weights to Meta :-) |
You can now use the OPT models in Hugging Face Transformers
Go here for details: https://twitter.com/huggingface/status/1524783489593360385
(Edited by admin. Original post below)
We're working hard at Hugging Face on adding all the checkpoints to Transformers. Thanks to @stephenroller and co. , we've now managed to correctly convert the checkpoints. They are all uploaded here: https://huggingface.co/models?other=opt_metasq
If you go into a specific repo, you'll find a detailed explanation on how to run them.
The text was updated successfully, but these errors were encountered: