AttributeError: 'NoneType' object has no attribute 'get' when running torchrun #86

aminechraibi · 2023-03-03T08:25:46Z

I encountered an error when running torchrun command on my system with the following traceback:

Traceback (most recent call last):
  File "/mnt/f/projects/python/git/llama/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
    result = agent.run()
             ^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 844, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 681, in _initialize_workers
    worker_ids = self._start_workers(worker_group)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 271, in _start_workers
    self._pcontext = start_processes(
                     ^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/__init__.py", line 207, in start_processes
    redirs = to_map(redirects, nprocs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 162, in to_map
    map[i] = val_or_map.get(i, Std.NONE)
             ^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get'

I am using torchrun with --nproc_per_node 1 option and passing the example.py script as an argument. I also provided the --ckpt_dir and --tokenizer_path arguments to the script. I have downloaded the 7B files and verified the checksum, and $TARGET_FOLDER has been set. I am not sure what caused this error and how to resolve it.

Here is the command I ran:

$ torchrun --nproc_per_node 1 example.py --ckpt_dir $TARGET_FOLDER/7B --tokenizer_path $TARGET_FOLDER/tokenizer.model

Can you please help me diagnose the issue and find a solution? Thank you.

The text was updated successfully, but these errors were encountered:

tsaijamey · 2023-03-03T09:19:25Z

Obvious reason:
Input parameter errors:
It is possible that errors or omitted key parameters in the input parameters can cause the program to fail. You may need to check if these issues exist in the program code.

You cannot just copy the sample command.
the $TARGET_FOLDER means you should use your local folder path to replace it, which is the folder you download the models.

In my case, the folder is 'some_path/llama/models/7B', so I would use './models' to replace the $TARGET_FOLDER.

aminechraibi · 2023-03-03T14:56:57Z

@tsaijamey Thank you for your response. I just wanted to clarify that I have already set the $TARGET_FOLDER variable to the correct folder path where the 7B files are located.
I added the following portion of code in the main method to check if the folder is correctly set:

def main(ckpt_dir: str, tokenizer_path: str, temperature: float = 0.8, top_p: float = 0.95):
    print("ckpt_dir: ", ckpt_dir)
    print("tokenizer_path: ", tokenizer_path)
    if not os.path.isfile(tokenizer_path):
        print(f"{tokenizer_path} not exists ")
    if not os.path.isdir(ckpt_dir):
        print(f"{ckpt_dir} not exists ")
    print("all is fine")
    exit()
    local_rank, world_size = setup_model_parallel()
    if local_rank > 0:
        sys.stdout = open(os.devnull, 'w')
    ...

I execute the following command without torchrun, since torchrun gives me the error I already mentioned.

python example.py --ckpt_dir $TARGET_FOLDER/7B --tokenizer_path $TARGET_FOLDER/tokenizer.model

The output:

ckpt_dir:  /mnt/f/projects/python/git/llama/models/7B
tokenizer_path:  /mnt/f/projects/python/git/llama/models/tokenizer.model
all is fine

I hope this helps clarify the issue. If there is anything else that needs to be checked, please let me know.

markasoftware · 2023-03-04T00:39:25Z

It's a pytorch bug, try Python 3.10 until it is fixed.

fche · 2023-03-06T15:39:22Z

this may work as a hack for those trying python 3.11

--- /home/YOU/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py~	2022-12-07 17:11:01.763871538 -0500
+++ /home/YOU/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py	2023-03-03 22:08:41.714570686 -0500
@@ -159,7 +159,7 @@
     else:
         map = {}
         for i in range(local_world_size):
-            map[i] = val_or_map.get(i, Std.NONE)
+            map[i] = val_or_map.get(i, Std.NONE) if val_or_map else Std.NONE
         return map

aminechraibi closed this as completed Mar 8, 2023

Orion-Zheng mentioned this issue Oct 13, 2023

[BUG]: AttributeError: 'NoneType' object has no attribute 'get' hpcaitech/ColossalAI#4891

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'NoneType' object has no attribute 'get' when running torchrun #86

AttributeError: 'NoneType' object has no attribute 'get' when running torchrun #86

aminechraibi commented Mar 3, 2023 •

edited

tsaijamey commented Mar 3, 2023

aminechraibi commented Mar 3, 2023

markasoftware commented Mar 4, 2023

fche commented Mar 6, 2023 •

edited

AttributeError: 'NoneType' object has no attribute 'get' when running torchrun #86

AttributeError: 'NoneType' object has no attribute 'get' when running torchrun #86

Comments

aminechraibi commented Mar 3, 2023 • edited

tsaijamey commented Mar 3, 2023

aminechraibi commented Mar 3, 2023

markasoftware commented Mar 4, 2023

fche commented Mar 6, 2023 • edited

aminechraibi commented Mar 3, 2023 •

edited

fche commented Mar 6, 2023 •

edited