torchrun command not found #553

Micla-SHL · 2023-11-02T12:39:16Z

Run:
bash dist_trigger_docker.sh hostfile Aquila-chat.yaml aquila-7b aquila_experiment
Error:

[INFO] bmtrain_mgpu.sh: hostfile configfile model_name exp_name exp_version
bmtrain_mgpu.sh: 行 84: torchrun：未找到命令

envs: 1 * 4090

torch                       2.1.0+cu118
torchaudio                  2.1.0
torchmetrics                1.2.0
torchvision                 0.16.0+cu118

hostfile: 192.168.1.5 slots=1
edit ~/FlagAI/flagai/env_args.py
self.parser.add_argument('--local_rank', default=0, type=int, help='start training from saved checkpoint')
be ineffective
What more can be done? I don't feel comfortable downgrading at this time. Are there any other options?

Originally posted by @Micla-SHL in #511 (comment)

The text was updated successfully, but these errors were encountered:

ftgreat · 2023-11-03T01:08:06Z

use which torchrun to make sure whether torchrun is installed and can be found.

PS, could you tell us the purpose of using this script, pretraining or finetune? thanks

Micla-SHL · 2023-11-03T02:31:34Z

It depends on the script itself. My thought is to follow the readme step-by-step first, getting everything working, before considering any branching

Micla-SHL · 2023-11-03T04:32:19Z

Update:
I've just realized
dist_trigger_docker.sh # 多机多卡运行的脚本文件，单机可选用local_trigger_docker.sh
run:
bash local_trigger_docker.sh hostfile Aquila-chat.yaml aquila-7b aquila_experiment
Other errors were produced, so for now I'm assuming the torchrun error was skipped over.. torchrun command not found It's not a concern of mine anymore.

For users with multiple GPUs, manually setting slots=1 and running bash dist_trigger_docker.sh hostfile Aquila-chat.yaml aquila-7b aquila_experiment specifically will trigger a "torchrun command not found" error？

Micla-SHL · 2023-11-03T04:45:29Z

The problem that just came up:

  File "/Micla/Project/FlagAI/examples/Aquila/Aquila-chat/aquila_chat.py", line 15, in <module>
    from flagai.env_trainer_v1 import EnvTrainer
    ModuleNotFoundError: No module named 'flagai.env_trainer_v1'

~/examples/Aquila/Aquila-chat/aquila_chat.py

from flagai.auto_model.auto_loader import AutoLoader             #run ok
from flagai.data.tokenizer import Tokenizer                           #run ok
from flagai.env_args import EnvArgs
from flagai.env_trainer_v1 import EnvTrainer      ##run error No module named 'flagai.env_trainer_v1'
import jsonlines
import numpy as np
import cyg_conversation as conversation_lib
from flagai.model.tools.peft.prepare_lora import lora_transfer     ##No module named 'flagai.model.tools'

~/_init_.py is None, edit: from . import env_trainer_v1 be ineffective
edit: aquila_chat.py : import sys / sys.path.append("..") be ineffective

What more can be done?

Micla-SHL · 2023-11-03T10:33:56Z

Because https://github.com/FlagAI-Open/Aquila2 .issues close

Micla-SHL changed the title ~~Run: torchrun 未找到命令~~ torchrun command not found Nov 2, 2023

Micla-SHL closed this as completed Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torchrun command not found #553

torchrun command not found #553

Micla-SHL commented Nov 2, 2023 •

edited

Loading

ftgreat commented Nov 3, 2023

Micla-SHL commented Nov 3, 2023

Micla-SHL commented Nov 3, 2023 •

edited

Loading

Micla-SHL commented Nov 3, 2023

Micla-SHL commented Nov 3, 2023

torchrun command not found #553

torchrun command not found #553

Comments

Micla-SHL commented Nov 2, 2023 • edited Loading

ftgreat commented Nov 3, 2023

Micla-SHL commented Nov 3, 2023

Micla-SHL commented Nov 3, 2023 • edited Loading

Micla-SHL commented Nov 3, 2023

Micla-SHL commented Nov 3, 2023

Micla-SHL commented Nov 2, 2023 •

edited

Loading

Micla-SHL commented Nov 3, 2023 •

edited

Loading