Loading a checkpoint for MP=0 but world size is 1 #40

etali · 2023-03-02T03:58:45Z

It seems not work. Help!

2533245542 · 2023-03-02T04:15:48Z

Trying to load 7B with MP=1 too but got a memory error. What's the size of GPU you used? I couldn't load it with 24GB.

Raibows · 2023-03-02T04:16:00Z

hi, change MODEL_SIZE to $MODEL_SIZE in your torchrun command.

Raibows · 2023-03-02T04:19:49Z

Trying to load 7B with MP=1 too but got a memory error. What's the size of GPU you used? I couldn't load it with 24GB.

At least 32GB. Maybe you can try fp16 for a 24GB card.

JOHW85 · 2023-03-02T06:43:03Z

Change the line
model = Transformer(model_args)
to
model = Transformer(model_args).cuda().half()
to use FP16.

Deep1994 · 2023-03-02T08:17:28Z

It seems not work. Help!

same here, did you solve it?

JOHW85 · 2023-03-02T08:19:16Z

It seems not work. Help!

same here, did you solve it?

You need $MODEL_SIZE in your line 5.

Deep1994 · 2023-03-02T08:22:35Z

It works! Thanks!

iodine-pku · 2023-03-02T12:06:54Z

@Deep1994 May I ask how large is your VRAM?

etali · 2023-03-02T13:34:24Z

Yes, you guys figured it out!

kanseaveg · 2023-07-18T05:38:36Z

help me please. I can't run the 33B with 4*ntx 3090

export TARGET_FOLDER=./models/llama
torchrun --nproc_per_node 4 example.py --ckpt_dir $TARGET_FOLDER/33B --tokenizer_path $TARGET_FOLDER/tokenizer.model

error message is :

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
> initializing model parallel with size 4
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "/home/amax/euan/code/llm/code/llama/example.py", line 119, in <module>
    fire.Fire(main)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/amax/euan/code/llm/code/llama/example.py", line 78, in main
    generator = load(
  File "/home/amax/euan/code/llm/code/llama/example.py", line 42, in load
    assert world_size == len(
AssertionError: Loading a checkpoint for MP=0 but world size is 4
Traceback (most recent call last):
  File "/home/amax/euan/code/llm/code/llama/example.py", line 119, in <module>
    fire.Fire(main)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/amax/euan/code/llm/code/llama/example.py", line 78, in main
    generator = load(
  File "/home/amax/euan/code/llm/code/llama/example.py", line 42, in load
    assert world_size == len(
AssertionError: Loading a checkpoint for MP=0 but world size is 4
Traceback (most recent call last):
  File "/home/amax/euan/code/llm/code/llama/example.py", line 119, in <module>
    fire.Fire(main)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
Traceback (most recent call last):
  File "/home/amax/euan/code/llm/code/llama/example.py", line 78, in main
  File "/home/amax/euan/code/llm/code/llama/example.py", line 119, in <module>
    generator = load(
  File "/home/amax/euan/code/llm/code/llama/example.py", line 42, in load
    assert world_size == len(
AssertionError: Loading a checkpoint for MP=0 but world size is 4
    fire.Fire(main)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/amax/euan/code/llm/code/llama/example.py", line 78, in main
    generator = load(
  File "/home/amax/euan/code/llm/code/llama/example.py", line 42, in load
    assert world_size == len(
AssertionError: Loading a checkpoint for MP=0 but world size is 4
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22886) of binary: /home/amax/miniconda3/envs/cyy-llama/bin/python
Traceback (most recent call last):
  File "/home/amax/miniconda3/envs/cyy-llama/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-07-18_13:35:29
  host      : admin.cluster.local
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 22888)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-07-18_13:35:29
  host      : admin.cluster.local
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 22891)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-07-18_13:35:29
  host      : admin.cluster.local
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 22894)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-18_13:35:29
  host      : admin.cluster.local
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 22886)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

please help me to fix this problem. Thank you~

sdfgsdfgd · 2023-08-25T06:18:23Z

sweet jesus - applied someone's PR patch for enabling compatibility w/ M2 macbooks, I got 16GB ram - it immediately locked up my system and everything crashed lol

this was on 34B model, also changed some variables to make world size 1. I'm not visiting llama land again, I'm happy with GPT4, aloha !

msz12345 · 2023-09-17T13:10:15Z

Hello everyone! I have also encountered this error issue, but I followed the prompts above to modify my code. The error message remains unchanged. Can someone help me? please！
Server model: GPU: RTX A5000 (24GB) Memory: 28GB Python 3.8 pytorch 2.0 ubuntu20.04
Your detection’s URL is: https://github.com/facebookresearch/llama/tree/llama_v1
I choose 7B as my final version
My folder:

(some Chinese, don’t take care)
When I get to run this instruction:

I use my script to run it,that is:

TARGET_FOLDER=./
MODEL_SIZE=7B
torchrun --nproc_per_node 1 example_small.py --ckpt_dir $TARGET_FOLDER/$MODEL_SIZE --tokenizer_path $TARGET_FOLDER/tokenizer.model

but the result is:

root@autodl-container-07e5119850-d5a71bd1:~/llama-llama_v1# ./bingo.sh

initializing model parallel with size 1
initializing ddp with size 1
initializing pipeline with size 1
Traceback (most recent call last):
File "example.py", line 119, in
fire.Fire(main)
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 78, in main
generator = load(
File "example.py", line 42, in load
assert world_size == len(
AssertionError: Loading a checkpoint for MP=0 but world size is 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1109) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-09-09_20:52:46
host : autodl-container-07e5119850-d5a71bd1
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1109)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

root@autodl-container-07e5119850-d5a71bd1:~/llama-llama_v1#

some time I choose to change the “model = Transformer(model_args)”to “model = Transformer(model_args).cuda().half()”,but at last I failed as well.

etali closed this as completed Mar 2, 2023

yash-s20 mentioned this issue Jul 26, 2023

AssertionError: Loading a checkpoint for MP=1 but world size is 4 #551

Open

csuhan mentioned this issue Jan 4, 2024

AssertionError: Loading a checkpoint for MP=0 but world size is 1 OpenGVLab/LLaMA-Adapter#140

Open

jiazi1 mentioned this issue Jan 24, 2024

OSError: Not found: "./llama-2-7b-chat/tokenizer.model": Too many levels of symbolic links Error #40 #1008

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading a checkpoint for MP=0 but world size is 1 #40

Loading a checkpoint for MP=0 but world size is 1 #40

etali commented Mar 2, 2023

2533245542 commented Mar 2, 2023

Raibows commented Mar 2, 2023

Raibows commented Mar 2, 2023

JOHW85 commented Mar 2, 2023

Deep1994 commented Mar 2, 2023

JOHW85 commented Mar 2, 2023

Deep1994 commented Mar 2, 2023

iodine-pku commented Mar 2, 2023

etali commented Mar 2, 2023

kanseaveg commented Jul 18, 2023

sdfgsdfgd commented Aug 25, 2023 •

edited

Loading

msz12345 commented Sep 17, 2023

Loading a checkpoint for MP=0 but world size is 1 #40

Loading a checkpoint for MP=0 but world size is 1 #40

Comments

etali commented Mar 2, 2023

2533245542 commented Mar 2, 2023

Raibows commented Mar 2, 2023

Raibows commented Mar 2, 2023

JOHW85 commented Mar 2, 2023

Deep1994 commented Mar 2, 2023

JOHW85 commented Mar 2, 2023

Deep1994 commented Mar 2, 2023

iodine-pku commented Mar 2, 2023

etali commented Mar 2, 2023

kanseaveg commented Jul 18, 2023

sdfgsdfgd commented Aug 25, 2023 • edited Loading

msz12345 commented Sep 17, 2023

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-09-09_20:52:46 host : autodl-container-07e5119850-d5a71bd1 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1109) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

sdfgsdfgd commented Aug 25, 2023 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-09-09_20:52:46
host : autodl-container-07e5119850-d5a71bd1
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1109)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html