Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting FileNotFoundError: [Errno 2] No such file or directory: 'sbatch': 'sbatch' Error, During training of OPT-125M. #235

Closed
pragnakalpdev11 opened this issue Jul 19, 2022 · 1 comment
Labels
question Further information is requested

Comments

@pragnakalpdev11
Copy link

❓ Questions and Help

Hello, Need to Train OPT-125M using our own Dataset.
Getting FileNotFoundError: [Errno 2] No such file or directory: 'sbatch': 'sbatch' Error During training.

What is your question?

Hello, Need to Train OPT-125M using our own Dataset.
Getting FileNotFoundError: [Errno 2] No such file or directory: 'sbatch': 'sbatch' Error During training.

where is the sbatch file located ?

Code

! python metaseq/launcher/opt_baselines.py \ -n 1 -g 2 \ -p test_v0 \ --model-size 125m \ --azure \ --data /content/test_data \ --checkpoints-dir "/content/ck_point" \

What have you tried?

Tried to Train the OPT-125M on Google Colab. following the train.md provided in the repo. (https://github.com/facebookresearch/metaseq/blob/main/docs/training.md)

Facing Below Error

"Traceback (most recent call last):
File "metaseq/launcher/opt_baselines.py", line 342, in
cli_main()
File "metaseq/launcher/opt_baselines.py", line 337, in cli_main
get_grid, postprocess_hyperparams, add_extra_options_func=add_extra_options_func
File "/usr/local/lib/python3.7/dist-packages/metaseq-0.0.1-py3.7-linux-x86_64.egg/metaseq/launcher/sweep.py", line 378, in main
backend_main(get_grid, postprocess_hyperparams, args)
File "/usr/local/lib/python3.7/dist-packages/metaseq-0.0.1-py3.7-linux-x86_64.egg/metaseq/launcher/slurm.py", line 41, in main
launch_train(args, grid, grid_product, dry_run, postprocess_hyperparams)
File "/usr/local/lib/python3.7/dist-packages/metaseq-0.0.1-py3.7-linux-x86_64.egg/metaseq/launcher/slurm.py", line 465, in launch_train
job_id, stdout = run_batch(env, sbatch_cmd_str, sbatch_cmd)
File "/usr/local/lib/python3.7/dist-packages/metaseq-0.0.1-py3.7-linux-x86_64.egg/metaseq/launcher/slurm.py", line 350, in run_batch
with subprocess.Popen(sbatch_cmd, stdout=subprocess.PIPE, env=env) as train_proc:
File "/usr/lib/python3.7/subprocess.py", line 800, in init
restore_signals, start_new_session)
File "/usr/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'sbatch': 'sbatch'"

What's your environment?

  • metaseq Version (e.g., 1.0 or master): Forked Repo
  • PyTorch Version (e.g., 1.0) : 1.11.0
  • OS (e.g., Linux): Linux
  • How you installed metaseq (pip, source): source
  • Build command you used (if compiling from source):
  • Python version: 3.7
  • CUDA/cuDNN version: cuda_11.1.TC455_06.29190527_0
  • GPU models and configuration: 1 Tesla p100
  • Any other relevant information:
@pragnakalpdev11 pragnakalpdev11 added the question Further information is requested label Jul 19, 2022
@stephenroller
Copy link
Contributor

Our launcher assumes we're running on a slurm cluster. If you look at the stuff inside the --wrap argument launched, that's the actual training command (removing the srun prefix)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants