Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchrun command not found #553

Closed
Micla-SHL opened this issue Nov 2, 2023 · 5 comments
Closed

torchrun command not found #553

Micla-SHL opened this issue Nov 2, 2023 · 5 comments

Comments

@Micla-SHL
Copy link

Micla-SHL commented Nov 2, 2023

Run:
bash dist_trigger_docker.sh hostfile Aquila-chat.yaml aquila-7b aquila_experiment
Error:

[INFO] bmtrain_mgpu.sh: hostfile configfile model_name exp_name exp_version
bmtrain_mgpu.sh: 行 84: torchrun:未找到命令

envs: 1 * 4090

torch                       2.1.0+cu118
torchaudio                  2.1.0
torchmetrics                1.2.0
torchvision                 0.16.0+cu118

hostfile: 192.168.1.5 slots=1
edit ~/FlagAI/flagai/env_args.py
self.parser.add_argument('--local_rank', default=0, type=int, help='start training from saved checkpoint')
be ineffective
What more can be done? I don't feel comfortable downgrading at this time. Are there any other options?

Originally posted by @Micla-SHL in #511 (comment)

@Micla-SHL Micla-SHL changed the title Run: torchrun 未找到命令 torchrun command not found Nov 2, 2023
@ftgreat
Copy link
Collaborator

ftgreat commented Nov 3, 2023

use which torchrun to make sure whether torchrun is installed and can be found.

PS, could you tell us the purpose of using this script, pretraining or finetune? thanks

@Micla-SHL
Copy link
Author

2023-11-03 10-16-43屏幕截图
It depends on the script itself. My thought is to follow the readme step-by-step first, getting everything working, before considering any branching

@Micla-SHL
Copy link
Author

Micla-SHL commented Nov 3, 2023

Update:
I've just realized
dist_trigger_docker.sh # 多机多卡运行的脚本文件,单机可选用local_trigger_docker.sh
run:
bash local_trigger_docker.sh hostfile Aquila-chat.yaml aquila-7b aquila_experiment
Other errors were produced, so for now I'm assuming the torchrun error was skipped over.. torchrun command not found It's not a concern of mine anymore.

For users with multiple GPUs, manually setting slots=1 and running bash dist_trigger_docker.sh hostfile Aquila-chat.yaml aquila-7b aquila_experiment specifically will trigger a "torchrun command not found" error?

@Micla-SHL
Copy link
Author

The problem that just came up:

  File "/Micla/Project/FlagAI/examples/Aquila/Aquila-chat/aquila_chat.py", line 15, in <module>
    from flagai.env_trainer_v1 import EnvTrainer
    ModuleNotFoundError: No module named 'flagai.env_trainer_v1'

~/examples/Aquila/Aquila-chat/aquila_chat.py

from flagai.auto_model.auto_loader import AutoLoader             #run ok
from flagai.data.tokenizer import Tokenizer                           #run ok
from flagai.env_args import EnvArgs
from flagai.env_trainer_v1 import EnvTrainer      ##run error No module named 'flagai.env_trainer_v1'
import jsonlines
import numpy as np
import cyg_conversation as conversation_lib
from flagai.model.tools.peft.prepare_lora import lora_transfer     ##No module named 'flagai.model.tools'

~/_init_.py is None, edit: from . import env_trainer_v1 be ineffective
edit: aquila_chat.py : import sys / sys.path.append("..") be ineffective

What more can be done?

@Micla-SHL
Copy link
Author

Because https://github.com/FlagAI-Open/Aquila2 .issues close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants