Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading a checkpoint for MP=0 but world size is 1 #40

Closed
etali opened this issue Mar 2, 2023 · 12 comments
Closed

Loading a checkpoint for MP=0 but world size is 1 #40

etali opened this issue Mar 2, 2023 · 12 comments

Comments

@etali
Copy link

etali commented Mar 2, 2023

image
image
It seems not work. Help!

@2533245542
Copy link

Trying to load 7B with MP=1 too but got a memory error. What's the size of GPU you used? I couldn't load it with 24GB.

@Raibows
Copy link

Raibows commented Mar 2, 2023

hi, change MODEL_SIZE to $MODEL_SIZE in your torchrun command.

@Raibows
Copy link

Raibows commented Mar 2, 2023

Trying to load 7B with MP=1 too but got a memory error. What's the size of GPU you used? I couldn't load it with 24GB.

At least 32GB. Maybe you can try fp16 for a 24GB card.

@JOHW85
Copy link

JOHW85 commented Mar 2, 2023

Change the line
model = Transformer(model_args)
to
model = Transformer(model_args).cuda().half()
to use FP16.

@Deep1994
Copy link

Deep1994 commented Mar 2, 2023

image image It seems not work. Help!

same here, did you solve it?

@JOHW85
Copy link

JOHW85 commented Mar 2, 2023

image image It seems not work. Help!

same here, did you solve it?

You need $MODEL_SIZE in your line 5.

@Deep1994
Copy link

Deep1994 commented Mar 2, 2023

It works! Thanks!

@iodine-pku
Copy link

@Deep1994 May I ask how large is your VRAM?

@etali
Copy link
Author

etali commented Mar 2, 2023

Yes, you guys figured it out!

@etali etali closed this as completed Mar 2, 2023
@kanseaveg
Copy link

help me please. I can't run the 33B with 4*ntx 3090

export TARGET_FOLDER=./models/llama
torchrun --nproc_per_node 4 example.py --ckpt_dir $TARGET_FOLDER/33B --tokenizer_path $TARGET_FOLDER/tokenizer.model

error message is :

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
> initializing model parallel with size 4
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "/home/amax/euan/code/llm/code/llama/example.py", line 119, in <module>
    fire.Fire(main)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/amax/euan/code/llm/code/llama/example.py", line 78, in main
    generator = load(
  File "/home/amax/euan/code/llm/code/llama/example.py", line 42, in load
    assert world_size == len(
AssertionError: Loading a checkpoint for MP=0 but world size is 4
Traceback (most recent call last):
  File "/home/amax/euan/code/llm/code/llama/example.py", line 119, in <module>
    fire.Fire(main)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/amax/euan/code/llm/code/llama/example.py", line 78, in main
    generator = load(
  File "/home/amax/euan/code/llm/code/llama/example.py", line 42, in load
    assert world_size == len(
AssertionError: Loading a checkpoint for MP=0 but world size is 4
Traceback (most recent call last):
  File "/home/amax/euan/code/llm/code/llama/example.py", line 119, in <module>
    fire.Fire(main)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
Traceback (most recent call last):
  File "/home/amax/euan/code/llm/code/llama/example.py", line 78, in main
  File "/home/amax/euan/code/llm/code/llama/example.py", line 119, in <module>
    generator = load(
  File "/home/amax/euan/code/llm/code/llama/example.py", line 42, in load
    assert world_size == len(
AssertionError: Loading a checkpoint for MP=0 but world size is 4
    fire.Fire(main)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/amax/euan/code/llm/code/llama/example.py", line 78, in main
    generator = load(
  File "/home/amax/euan/code/llm/code/llama/example.py", line 42, in load
    assert world_size == len(
AssertionError: Loading a checkpoint for MP=0 but world size is 4
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22886) of binary: /home/amax/miniconda3/envs/cyy-llama/bin/python
Traceback (most recent call last):
  File "/home/amax/miniconda3/envs/cyy-llama/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/amax/miniconda3/envs/cyy-llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-07-18_13:35:29
  host      : admin.cluster.local
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 22888)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-07-18_13:35:29
  host      : admin.cluster.local
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 22891)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-07-18_13:35:29
  host      : admin.cluster.local
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 22894)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-18_13:35:29
  host      : admin.cluster.local
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 22886)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

please help me to fix this problem. Thank you~

@sdfgsdfgd
Copy link

sdfgsdfgd commented Aug 25, 2023

sweet jesus - applied someone's PR patch for enabling compatibility w/ M2 macbooks, I got 16GB ram - it immediately locked up my system and everything crashed lol

this was on 34B model, also changed some variables to make world size 1. I'm not visiting llama land again, I'm happy with GPT4, aloha !

@msz12345
Copy link

Hello everyone! I have also encountered this error issue, but I followed the prompts above to modify my code. The error message remains unchanged. Can someone help me? please!
Server model: GPU: RTX A5000 (24GB) Memory: 28GB Python 3.8 pytorch 2.0 ubuntu20.04
Your detection’s URL is: https://github.com/facebookresearch/llama/tree/llama_v1
I choose 7B as my final version
My folder:
image
(some Chinese, don’t take care)
When I get to run this instruction:
image
I use my script to run it,that is:

TARGET_FOLDER=./
MODEL_SIZE=7B
torchrun --nproc_per_node 1 example_small.py --ckpt_dir $TARGET_FOLDER/$MODEL_SIZE --tokenizer_path $TARGET_FOLDER/tokenizer.model

but the result is:

root@autodl-container-07e5119850-d5a71bd1:~/llama-llama_v1# ./bingo.sh

initializing model parallel with size 1
initializing ddp with size 1
initializing pipeline with size 1
Traceback (most recent call last):
File "example.py", line 119, in
fire.Fire(main)
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 78, in main
generator = load(
File "example.py", line 42, in load
assert world_size == len(
AssertionError: Loading a checkpoint for MP=0 but world size is 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1109) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED


Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-09-09_20:52:46
host : autodl-container-07e5119850-d5a71bd1
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1109)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

root@autodl-container-07e5119850-d5a71bd1:~/llama-llama_v1#

some time I choose to change the “model = Transformer(model_args)”to “model = Transformer(model_args).cuda().half()”,but at last I failed as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants