Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'NoneType' object has no attribute 'get' when running torchrun #86

Closed
aminechraibi opened this issue Mar 3, 2023 · 4 comments

Comments

@aminechraibi
Copy link

aminechraibi commented Mar 3, 2023

I encountered an error when running torchrun command on my system with the following traceback:

Traceback (most recent call last):
  File "/mnt/f/projects/python/git/llama/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
    result = agent.run()
             ^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 844, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 681, in _initialize_workers
    worker_ids = self._start_workers(worker_group)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 271, in _start_workers
    self._pcontext = start_processes(
                     ^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/__init__.py", line 207, in start_processes
    redirs = to_map(redirects, nprocs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 162, in to_map
    map[i] = val_or_map.get(i, Std.NONE)
             ^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get'

I am using torchrun with --nproc_per_node 1 option and passing the example.py script as an argument. I also provided the --ckpt_dir and --tokenizer_path arguments to the script. I have downloaded the 7B files and verified the checksum, and $TARGET_FOLDER has been set. I am not sure what caused this error and how to resolve it.

Here is the command I ran:

$ torchrun --nproc_per_node 1 example.py --ckpt_dir $TARGET_FOLDER/7B --tokenizer_path $TARGET_FOLDER/tokenizer.model

Can you please help me diagnose the issue and find a solution? Thank you.

@tsaijamey
Copy link

Obvious reason:
Input parameter errors:
It is possible that errors or omitted key parameters in the input parameters can cause the program to fail. You may need to check if these issues exist in the program code.

You cannot just copy the sample command.
the $TARGET_FOLDER means you should use your local folder path to replace it, which is the folder you download the models.

In my case, the folder is 'some_path/llama/models/7B', so I would use './models' to replace the $TARGET_FOLDER.

@aminechraibi
Copy link
Author

@tsaijamey Thank you for your response. I just wanted to clarify that I have already set the $TARGET_FOLDER variable to the correct folder path where the 7B files are located.
I added the following portion of code in the main method to check if the folder is correctly set:

def main(ckpt_dir: str, tokenizer_path: str, temperature: float = 0.8, top_p: float = 0.95):
    print("ckpt_dir: ", ckpt_dir)
    print("tokenizer_path: ", tokenizer_path)
    if not os.path.isfile(tokenizer_path):
        print(f"{tokenizer_path} not exists ")
    if not os.path.isdir(ckpt_dir):
        print(f"{ckpt_dir} not exists ")
    print("all is fine")
    exit()
    local_rank, world_size = setup_model_parallel()
    if local_rank > 0:
        sys.stdout = open(os.devnull, 'w')
    ...

I execute the following command without torchrun, since torchrun gives me the error I already mentioned.

python example.py --ckpt_dir $TARGET_FOLDER/7B --tokenizer_path $TARGET_FOLDER/tokenizer.model

The output:

ckpt_dir:  /mnt/f/projects/python/git/llama/models/7B
tokenizer_path:  /mnt/f/projects/python/git/llama/models/tokenizer.model
all is fine

I hope this helps clarify the issue. If there is anything else that needs to be checked, please let me know.

@markasoftware
Copy link

It's a pytorch bug, try Python 3.10 until it is fixed.

@fche
Copy link

fche commented Mar 6, 2023

this may work as a hack for those trying python 3.11

--- /home/YOU/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py~	2022-12-07 17:11:01.763871538 -0500
+++ /home/YOU/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py	2023-03-03 22:08:41.714570686 -0500
@@ -159,7 +159,7 @@
     else:
         map = {}
         for i in range(local_world_size):
-            map[i] = val_or_map.get(i, Std.NONE)
+            map[i] = val_or_map.get(i, Std.NONE) if val_or_map else Std.NONE
         return map

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants