Running the API #26

hunterlang · 2022-05-03T23:38:48Z

Hi,

Following up on #19 and #23 in a separate issue.

So far I've made the following changes to constants.py:

git diff metaseq/service/constants.py
diff --git a/metaseq/service/constants.py b/metaseq/service/constants.py
index da4ff19..5ba4b94 100644
--- a/metaseq/service/constants.py
+++ b/metaseq/service/constants.py
@@ -29,7 +29,7 @@ except ImportError:
     # reshard-model_part-5.pt
     # reshard-model_part-6.pt
     # reshard-model_part-7.pt
-    MODEL_SHARED_FOLDER = ""
+    MODEL_SHARED_FOLDER = "/home/hlang/opt_models/"
     # LOCAL_SSD is optional, but it's assuming you have some sort of local
     # hard disk where we can cache a copy of the weights for faster loading.
     LOCAL_SSD = ""
@@ -46,9 +46,10 @@ BPE_MERGES = os.path.join(MODEL_SHARED_FOLDER, "gpt2-merges.txt")
 BPE_VOCAB = os.path.join(MODEL_SHARED_FOLDER, "gpt2-vocab.json")

 # where to find the raw files on nfs
-CHECKPOINT_FOLDER = os.path.join(MODEL_SHARED_FOLDER, "175B", "reshard_no_os")
+#CHECKPOINT_FOLDER = os.path.join(MODEL_SHARED_FOLDER, "175B", "reshard_no_os")
+CHECKPOINT_FOLDER = MODEL_SHARED_FOLDER
 # where to store them on SSD for faster loading
-CHECKPOINT_LOCAL = os.path.join(LOCAL_SSD, "175B", "reshard_no_os", "reshard.pt")
+CHECKPOINT_LOCAL = MODEL_SHARED_FOLDER

My /home/hlang/opt_models looks like:

dict.txt
gpt2-merges.txt
gpt2-vocab.json
reshard-model_part-0.pt
reshard-model_part-1.pt

dict.txt is from Stephen's link in #19 and reshard-model_part-0.pt, reshard-model_part-1.pt are from the OPT-125M links.

I found that I also had to modify checkpoint_utils.py because get_paths_to_load wasn't actually finding those .pt files. So I just directly returned them (maybe this is not the right thing?):

root@node001:/home/hlang/metaseq# git diff metaseq/checkpoint_utils.py
diff --git a/metaseq/checkpoint_utils.py b/metaseq/checkpoint_utils.py
index 1ee8eee..0ea74df 100644
--- a/metaseq/checkpoint_utils.py
+++ b/metaseq/checkpoint_utils.py
@@ -344,6 +344,7 @@ def _is_checkpoint_sharded(checkpoint_files) -> bool:


 def get_paths_to_load(local_path, suffix="rank-"):
+    return ['/home/hlang/opt_models/reshard-model_part-0.pt', '/home/hlang/opt_models/reshard-model_part-1.pt']

Now when I run metaseq-api-local I get:

2022-05-03 19:37:06 | INFO | metaseq.hub_utils | loading model(s) from /home/hlang/opt_models/
2022-05-03 19:37:07 | INFO | metaseq.checkpoint_utils | Done reading from disk
Traceback (most recent call last):
  File "/opt/conda/bin/metaseq-api-local", line 33, in <module>
    sys.exit(load_entry_point('metaseq', 'console_scripts', 'metaseq-api-local')())
  File "/home/hlang/metaseq/metaseq_cli/interactive_hosted.py", line 300, in cli_main
    dist_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/hlang/metaseq/metaseq/distributed/utils.py", line 226, in call_main
    main(cfg, **kwargs)
  File "/home/hlang/metaseq/metaseq_cli/interactive_hosted.py", line 156, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/hlang/metaseq/metaseq/hub_utils.py", line 485, in load_model
    models, _model_args, _task = checkpoint_utils.load_model_ensemble_and_task(
  File "/home/hlang/metaseq/metaseq/checkpoint_utils.py", line 504, in load_model_ensemble_and_task
    model = build_model_hook(cfg, task)
  File "/home/hlang/metaseq/metaseq/hub_utils.py", line 474, in _build_model
    model = task.build_model(cfg.model).half().cuda()
  File "/home/hlang/metaseq/metaseq/tasks/language_modeling.py", line 164, in build_model
    model = super().build_model(args)
  File "/home/hlang/metaseq/metaseq/tasks/base_task.py", line 560, in build_model
    model = models.build_model(args, self)
  File "/home/hlang/metaseq/metaseq/models/__init__.py", line 89, in build_model
    return model.build_model(cfg, task)
  File "/home/hlang/metaseq/metaseq/model_parallel/models/transformer_lm.py", line 47, in build_model
    embed_tokens = cls.build_embedding(
  File "/home/hlang/metaseq/metaseq/model_parallel/models/transformer_lm.py", line 82, in build_embedding
    embed_tokens = VocabParallelEmbedding(
  File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__
    self.tensor_model_parallel_size = get_tensor_model_parallel_world_size()
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_tensor_model_parallel_group())
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group
    assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \
AssertionError: intra_layer_model parallel group is not initialized

I tried Stephen's advice from #19 of setting --model-parallel N for N=0, N=1, N=2 but none worked.

The text was updated successfully, but these errors were encountered:

stephenroller · 2022-05-04T00:27:17Z

Hm that checkpoint change you made makes me think Model Parallel isn't being picked up. It should be set to 2 for that model.

hunterlang · 2022-05-04T21:28:43Z

I couldn't figure out what combination of settings in constants.py would make get_paths_to_load work correctly with the 125M model. So after switching to the single-shard 350M model and reverting my changes to checkpoint_utils.py I have the same problem as #23.

Since #31 has the 350M model working but needs to use torchrun, it seems like the torch distributed environment just isn't getting initialized correctly when we run metaseq-api-local?

thies1006 · 2022-05-05T10:53:17Z

The 125M checkpoint seems to work on a single node.
I had to remove the distributed-port (otherwise enters in slurm init and I get the error which you got). In addition I think it only works with 8 GPUs present in the node (model parallel and world_size set both to 2 though). I tried on a different machine which only 6 GPUs and it didn't work.

BlackSamorez · 2022-05-08T12:59:23Z

I've encountered the same problem and fixed it by forcing utils.py to _infer_single_node_init (somehow i ended up in _infer_slurm_init which is not what you want for this task). I couldn't find where notorious cfg.distributed_port comes from (which leads to slurm backend), so I hardcoded not to take that path

suchenzang · 2022-05-12T05:45:34Z

Closing this given #88, #78, and #77, which should cover this issue as well.

hunterlang added the question Further information is requested label May 3, 2022

Mrs-Hudson mentioned this issue May 6, 2022

Using the Metaseq API #57

Closed

suchenzang closed this as completed May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running the API #26

Running the API #26

hunterlang commented May 3, 2022

stephenroller commented May 4, 2022

hunterlang commented May 4, 2022

thies1006 commented May 5, 2022

BlackSamorez commented May 8, 2022

suchenzang commented May 12, 2022

Running the API #26

Running the API #26

Comments

hunterlang commented May 3, 2022

stephenroller commented May 4, 2022

hunterlang commented May 4, 2022

thies1006 commented May 5, 2022

BlackSamorez commented May 8, 2022

suchenzang commented May 12, 2022