Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the API #26

Closed
hunterlang opened this issue May 3, 2022 · 5 comments
Closed

Running the API #26

hunterlang opened this issue May 3, 2022 · 5 comments
Labels
question Further information is requested

Comments

@hunterlang
Copy link

Hi,

Following up on #19 and #23 in a separate issue.

So far I've made the following changes to constants.py:

git diff metaseq/service/constants.py
diff --git a/metaseq/service/constants.py b/metaseq/service/constants.py
index da4ff19..5ba4b94 100644
--- a/metaseq/service/constants.py
+++ b/metaseq/service/constants.py
@@ -29,7 +29,7 @@ except ImportError:
     # reshard-model_part-5.pt
     # reshard-model_part-6.pt
     # reshard-model_part-7.pt
-    MODEL_SHARED_FOLDER = ""
+    MODEL_SHARED_FOLDER = "/home/hlang/opt_models/"
     # LOCAL_SSD is optional, but it's assuming you have some sort of local
     # hard disk where we can cache a copy of the weights for faster loading.
     LOCAL_SSD = ""
@@ -46,9 +46,10 @@ BPE_MERGES = os.path.join(MODEL_SHARED_FOLDER, "gpt2-merges.txt")
 BPE_VOCAB = os.path.join(MODEL_SHARED_FOLDER, "gpt2-vocab.json")

 # where to find the raw files on nfs
-CHECKPOINT_FOLDER = os.path.join(MODEL_SHARED_FOLDER, "175B", "reshard_no_os")
+#CHECKPOINT_FOLDER = os.path.join(MODEL_SHARED_FOLDER, "175B", "reshard_no_os")
+CHECKPOINT_FOLDER = MODEL_SHARED_FOLDER
 # where to store them on SSD for faster loading
-CHECKPOINT_LOCAL = os.path.join(LOCAL_SSD, "175B", "reshard_no_os", "reshard.pt")
+CHECKPOINT_LOCAL = MODEL_SHARED_FOLDER

My /home/hlang/opt_models looks like:

dict.txt
gpt2-merges.txt
gpt2-vocab.json
reshard-model_part-0.pt
reshard-model_part-1.pt

dict.txt is from Stephen's link in #19 and reshard-model_part-0.pt, reshard-model_part-1.pt are from the OPT-125M links.

I found that I also had to modify checkpoint_utils.py because get_paths_to_load wasn't actually finding those .pt files. So I just directly returned them (maybe this is not the right thing?):

root@node001:/home/hlang/metaseq# git diff metaseq/checkpoint_utils.py
diff --git a/metaseq/checkpoint_utils.py b/metaseq/checkpoint_utils.py
index 1ee8eee..0ea74df 100644
--- a/metaseq/checkpoint_utils.py
+++ b/metaseq/checkpoint_utils.py
@@ -344,6 +344,7 @@ def _is_checkpoint_sharded(checkpoint_files) -> bool:


 def get_paths_to_load(local_path, suffix="rank-"):
+    return ['/home/hlang/opt_models/reshard-model_part-0.pt', '/home/hlang/opt_models/reshard-model_part-1.pt']

Now when I run metaseq-api-local I get:

2022-05-03 19:37:06 | INFO | metaseq.hub_utils | loading model(s) from /home/hlang/opt_models/
2022-05-03 19:37:07 | INFO | metaseq.checkpoint_utils | Done reading from disk
Traceback (most recent call last):
  File "/opt/conda/bin/metaseq-api-local", line 33, in <module>
    sys.exit(load_entry_point('metaseq', 'console_scripts', 'metaseq-api-local')())
  File "/home/hlang/metaseq/metaseq_cli/interactive_hosted.py", line 300, in cli_main
    dist_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/hlang/metaseq/metaseq/distributed/utils.py", line 226, in call_main
    main(cfg, **kwargs)
  File "/home/hlang/metaseq/metaseq_cli/interactive_hosted.py", line 156, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/hlang/metaseq/metaseq/hub_utils.py", line 485, in load_model
    models, _model_args, _task = checkpoint_utils.load_model_ensemble_and_task(
  File "/home/hlang/metaseq/metaseq/checkpoint_utils.py", line 504, in load_model_ensemble_and_task
    model = build_model_hook(cfg, task)
  File "/home/hlang/metaseq/metaseq/hub_utils.py", line 474, in _build_model
    model = task.build_model(cfg.model).half().cuda()
  File "/home/hlang/metaseq/metaseq/tasks/language_modeling.py", line 164, in build_model
    model = super().build_model(args)
  File "/home/hlang/metaseq/metaseq/tasks/base_task.py", line 560, in build_model
    model = models.build_model(args, self)
  File "/home/hlang/metaseq/metaseq/models/__init__.py", line 89, in build_model
    return model.build_model(cfg, task)
  File "/home/hlang/metaseq/metaseq/model_parallel/models/transformer_lm.py", line 47, in build_model
    embed_tokens = cls.build_embedding(
  File "/home/hlang/metaseq/metaseq/model_parallel/models/transformer_lm.py", line 82, in build_embedding
    embed_tokens = VocabParallelEmbedding(
  File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__
    self.tensor_model_parallel_size = get_tensor_model_parallel_world_size()
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_tensor_model_parallel_group())
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group
    assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \
AssertionError: intra_layer_model parallel group is not initialized

I tried Stephen's advice from #19 of setting --model-parallel N for N=0, N=1, N=2 but none worked.

@hunterlang hunterlang added the question Further information is requested label May 3, 2022
@stephenroller
Copy link
Contributor

Hm that checkpoint change you made makes me think Model Parallel isn't being picked up. It should be set to 2 for that model.

@hunterlang
Copy link
Author

I couldn't figure out what combination of settings in constants.py would make get_paths_to_load work correctly with the 125M model. So after switching to the single-shard 350M model and reverting my changes to checkpoint_utils.py I have the same problem as #23.

Since #31 has the 350M model working but needs to use torchrun, it seems like the torch distributed environment just isn't getting initialized correctly when we run metaseq-api-local?

@thies1006
Copy link

The 125M checkpoint seems to work on a single node.
I had to remove the distributed-port (otherwise enters in slurm init and I get the error which you got). In addition I think it only works with 8 GPUs present in the node (model parallel and world_size set both to 2 though). I tried on a different machine which only 6 GPUs and it didn't work.

@BlackSamorez
Copy link

I've encountered the same problem and fixed it by forcing utils.py to _infer_single_node_init (somehow i ended up in _infer_slurm_init which is not what you want for this task). I couldn't find where notorious cfg.distributed_port comes from (which leads to slurm backend), so I hardcoded not to take that path

@suchenzang
Copy link
Contributor

Closing this given #88, #78, and #77, which should cover this issue as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants