Initializing pipeline error #63

lurenss · 2023-03-02T15:26:38Z

Once i have completed the installation and try a test with test.py with the 8B model I had the following error:

(base) lorenzo@lorenzo-desktop:~/Desktop/llama$ torchrun --nproc_per_node 1 example.py --ckpt_dir ./model/model_size --tokenizer_path ./model/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "/home/lorenzo/Desktop/llama/example.py", line 72, in <module>
    fire.Fire(main)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/lorenzo/Desktop/llama/example.py", line 62, in main
    generator = load(ckpt_dir, tokenizer_path, local_rank, world_size)
  File "/home/lorenzo/Desktop/llama/example.py", line 36, in load
    world_size == len(checkpoints)
AssertionError: Loading a checkpoint for MP=0 but world size is 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22343) of binary: /home/lorenzo/miniconda3/bin/python
Traceback (most recent call last):
  File "/home/lorenzo/miniconda3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/lorenzo/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-02_16:17:21
  host      : lorenzo-desktop
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 22343)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

The text was updated successfully, but these errors were encountered:

felipehime · 2023-03-02T15:30:01Z

I got same error here. Loading checkpoint for MP=0 but world size is 1.

checkpoints variable is also blank when I checked. Like []

Dunno what's happening. By the way, is MP the number of GPU's in a single node?

lurenss · 2023-03-02T15:45:41Z

I found the error to fix it you have to point the model and the tokenizer e.g
torchrun --nproc_per_node 1 example.py --ckpt_dir ./model/7B --tokenizer_path ./model/tokenizer.model

felipehime · 2023-03-02T15:48:29Z

Is there a file named tokenizer.model? I just got a params.json

lurenss · 2023-03-02T15:50:34Z

In the folder where you downloaded the model you have the model e.g 7B and also tokenizer.model

felipehime · 2023-03-02T15:55:33Z

Well... this is odd. I got checklist.chk, consolidate.pth and params.json there nos tokenizer.model ;/

felipehime · 2023-03-02T15:59:31Z

Found it! but yet the problem persist.

felipehime · 2023-03-02T16:08:26Z

Ok, problem solved! It was a path problem lol

Masood-Salik · 2023-03-02T16:17:18Z

Does anyone know whats the problem with this??
torchrun --nproc_per_node 1 example.py --ckpt_dir ./weights/7B --tokenizer_path ./weights/tokenizer.model
I got this error only with no details: failed to create process.

jeonbik · 2023-03-02T18:03:56Z

Here is how I got things working,
As per (#41 (comment)), edit download.sh
run ./download.sh
Once you have checkpoints for any model, eg:7B, run

torchrun --nproc_per_node 1 example.py --ckpt_dir ./model/7B --tokenizer_path ./model/tokenizer.model

neuhaus · 2023-03-02T21:14:05Z

OK, i also cannot get it to run with "torchrun", i get "failed to create process".

Edit:
It does work with Linux. The workaround for Windows with python 3.9.x is to run

python -m torch.distributed.run instead of torchrun.

emmiehine · 2023-03-02T21:16:51Z

Ok, problem solved! It was a path problem lol

@felipehime what was the path issue? I'm getting the same error even pointing the command explicitly to the directories.

kiritoyu · 2023-03-10T09:45:23Z

i download the model without 7B file,why?

felipehime · 2023-03-17T21:37:11Z

Ok, problem solved! It was a path problem lol

@felipehime what was the path issue? I'm getting the same error even pointing the command explicitly to the directories.

specifically the path of tokenizer.model

ka4on · 2023-04-07T04:21:27Z

Does anyone know whats the problem with this?? torchrun --nproc_per_node 1 example.py --ckpt_dir ./weights/7B --tokenizer_path ./weights/tokenizer.model I got this error only with no details: failed to create process.

i also have the same problem, any solutions? Thank you!

albertodepaola · 2023-09-06T17:09:40Z

Closing as original author solved the issue. Feel free to open new issues with specific details on what you are facing for additional guidance. For future reference, check both llama and llama-recipes repos for getting started guides.

ghost · 2023-10-11T18:41:48Z

Does anyone know whats the problem with this?? torchrun --nproc_per_node 1 example.py --ckpt_dir ./weights/7B --tokenizer_path ./weights/tokenizer.model I got this error only with no details: failed to create process.

i also have the same problem, any solutions? Thank you!

do you got any solution yet ?

andrewssobral mentioned this issue Mar 6, 2023

webapp.py FAILED andrewssobral/llama-webapp#1

Closed

albertodepaola added documentation Improvements or additions to documentation compatibility issues arising from specific hardware or system configs labels Sep 6, 2023

albertodepaola closed this as completed Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initializing pipeline error #63

Initializing pipeline error #63

lurenss commented Mar 2, 2023 •

edited

Loading

felipehime commented Mar 2, 2023

lurenss commented Mar 2, 2023 •

edited

Loading

felipehime commented Mar 2, 2023

lurenss commented Mar 2, 2023

felipehime commented Mar 2, 2023

felipehime commented Mar 2, 2023

felipehime commented Mar 2, 2023

Masood-Salik commented Mar 2, 2023

jeonbik commented Mar 2, 2023

neuhaus commented Mar 2, 2023 •

edited

Loading

emmiehine commented Mar 2, 2023

kiritoyu commented Mar 10, 2023

felipehime commented Mar 17, 2023

ka4on commented Apr 7, 2023

albertodepaola commented Sep 6, 2023

ghost commented Oct 11, 2023

Initializing pipeline error #63

Initializing pipeline error #63

Comments

lurenss commented Mar 2, 2023 • edited Loading

felipehime commented Mar 2, 2023

lurenss commented Mar 2, 2023 • edited Loading

felipehime commented Mar 2, 2023

lurenss commented Mar 2, 2023

felipehime commented Mar 2, 2023

felipehime commented Mar 2, 2023

felipehime commented Mar 2, 2023

Masood-Salik commented Mar 2, 2023

jeonbik commented Mar 2, 2023

neuhaus commented Mar 2, 2023 • edited Loading

emmiehine commented Mar 2, 2023

kiritoyu commented Mar 10, 2023

felipehime commented Mar 17, 2023

ka4on commented Apr 7, 2023

albertodepaola commented Sep 6, 2023

ghost commented Oct 11, 2023

lurenss commented Mar 2, 2023 •

edited

Loading

lurenss commented Mar 2, 2023 •

edited

Loading

neuhaus commented Mar 2, 2023 •

edited

Loading