Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initializing pipeline error #63

Closed
lurenss opened this issue Mar 2, 2023 · 16 comments
Closed

Initializing pipeline error #63

lurenss opened this issue Mar 2, 2023 · 16 comments
Labels
compatibility issues arising from specific hardware or system configs documentation Improvements or additions to documentation

Comments

@lurenss
Copy link

lurenss commented Mar 2, 2023

Once i have completed the installation and try a test with test.py with the 8B model I had the following error:

(base) lorenzo@lorenzo-desktop:~/Desktop/llama$ torchrun --nproc_per_node 1 example.py --ckpt_dir ./model/model_size --tokenizer_path ./model/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "/home/lorenzo/Desktop/llama/example.py", line 72, in <module>
    fire.Fire(main)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/lorenzo/Desktop/llama/example.py", line 62, in main
    generator = load(ckpt_dir, tokenizer_path, local_rank, world_size)
  File "/home/lorenzo/Desktop/llama/example.py", line 36, in load
    world_size == len(checkpoints)
AssertionError: Loading a checkpoint for MP=0 but world size is 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22343) of binary: /home/lorenzo/miniconda3/bin/python
Traceback (most recent call last):
  File "/home/lorenzo/miniconda3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-02_16:17:21
  host      : lorenzo-desktop
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 22343)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@felipehime
Copy link

I got same error here. Loading checkpoint for MP=0 but world size is 1.

checkpoints variable is also blank when I checked. Like []

Dunno what's happening. By the way, is MP the number of GPU's in a single node?

@lurenss
Copy link
Author

lurenss commented Mar 2, 2023

I found the error to fix it you have to point the model and the tokenizer e.g
torchrun --nproc_per_node 1 example.py --ckpt_dir ./model/7B --tokenizer_path ./model/tokenizer.model

@felipehime
Copy link

Is there a file named tokenizer.model? I just got a params.json

@lurenss
Copy link
Author

lurenss commented Mar 2, 2023

In the folder where you downloaded the model you have the model e.g 7B and also tokenizer.model

@felipehime
Copy link

Well... this is odd. I got checklist.chk, consolidate.pth and params.json there nos tokenizer.model ;/

@felipehime
Copy link

Found it! but yet the problem persist.

@felipehime
Copy link

Ok, problem solved! It was a path problem lol

@Masood-Salik
Copy link

Does anyone know whats the problem with this??
torchrun --nproc_per_node 1 example.py --ckpt_dir ./weights/7B --tokenizer_path ./weights/tokenizer.model
I got this error only with no details: failed to create process.

@jeonbik
Copy link

jeonbik commented Mar 2, 2023

Here is how I got things working,
As per (#41 (comment)), edit download.sh
run ./download.sh
Once you have checkpoints for any model, eg:7B, run

torchrun --nproc_per_node 1 example.py --ckpt_dir ./model/7B --tokenizer_path ./model/tokenizer.model

@neuhaus
Copy link

neuhaus commented Mar 2, 2023

OK, i also cannot get it to run with "torchrun", i get "failed to create process".

Edit:
It does work with Linux. The workaround for Windows with python 3.9.x is to run

python -m torch.distributed.run instead of torchrun.

@emmiehine
Copy link

Ok, problem solved! It was a path problem lol

@felipehime what was the path issue? I'm getting the same error even pointing the command explicitly to the directories.

@kiritoyu
Copy link

i download the model without 7B file,why?

@felipehime
Copy link

Ok, problem solved! It was a path problem lol

@felipehime what was the path issue? I'm getting the same error even pointing the command explicitly to the directories.

specifically the path of tokenizer.model

@ka4on
Copy link

ka4on commented Apr 7, 2023

Does anyone know whats the problem with this?? torchrun --nproc_per_node 1 example.py --ckpt_dir ./weights/7B --tokenizer_path ./weights/tokenizer.model I got this error only with no details: failed to create process.

i also have the same problem, any solutions? Thank you!

@albertodepaola albertodepaola added documentation Improvements or additions to documentation compatibility issues arising from specific hardware or system configs labels Sep 6, 2023
@albertodepaola
Copy link

Closing as original author solved the issue. Feel free to open new issues with specific details on what you are facing for additional guidance. For future reference, check both llama and llama-recipes repos for getting started guides.

@ghost
Copy link

ghost commented Oct 11, 2023

Does anyone know whats the problem with this?? torchrun --nproc_per_node 1 example.py --ckpt_dir ./weights/7B --tokenizer_path ./weights/tokenizer.model I got this error only with no details: failed to create process.

i also have the same problem, any solutions? Thank you!

do you got any solution yet ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compatibility issues arising from specific hardware or system configs documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

9 participants