Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] unrecognized arguments: --local-rank=0 #48

Closed
leon-costa opened this issue May 11, 2023 · 3 comments
Closed

[Bug] unrecognized arguments: --local-rank=0 #48

leon-costa opened this issue May 11, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@leon-costa
Copy link

Describe the bug

I followed the installation instructions in https://github.com/Westlake-AI/openmixup/blob/main/docs/en/install.md#install-openmixup and everything went well (except Apex but it's optional).

When I run the first Getting Started example command I get the following error:

$ bash tools/dist_train.sh configs/classification/imagenet/resnet/resnet50_rsb_a3_sz160_8xb256_ep100.py 1 --auto_resume
/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
  warnings.warn(
Please install scikit-image.
Please install scikit-image with PyPi
usage: train.py [-h] [--work_dir WORK_DIR] [--resume_from RESUME_FROM] [--auto_resume] [--pretrained PRETRAINED] [--load_checkpoint LOAD_CHECKPOINT]
                [--gpus GPUS | --gpu_ids GPU_IDS [GPU_IDS ...] | --gpu-id GPU_ID] [--seed SEED] [--diff-seed] [--deterministic]
                [--cfg-options CFG_OPTIONS [CFG_OPTIONS ...]] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK] [--port PORT]
                config
train.py: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 865532) of binary: /home/leon/.conda/envs/openmixup2/bin/python
Traceback (most recent call last):
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-11_18:06:40
  host      : leon
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 865532)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

To Reproduce

Follow the installation instructions and execute the example command as described above.

Post related information

  1. The output of pip list | grep "openmixup\|^torch"
openmixup               0.2.7       /home/leon/projects/openmixup
torch                   2.0.1
torchaudio              2.0.2
torchgeometry           0.1.2
torchvision             0.15.2
  1. Your config file if you modified it or created a new one.

No modified config. I just changed the 8 gpus to 1 gpu in the example command.

Additional context

I initially tried to install everything by following the instructions here but the last command python setup.py develop failed with this error:

AttributeError: module 'cv2' has no attribute '__version__'
@leon-costa leon-costa added the bug Something isn't working label May 11, 2023
@Lupin1998
Copy link
Member

Hi, @leon-costa, sorry for the late reply. I try to run bash tools/dist_train.sh configs/classification/imagenet/resnet/resnet50_rsb_a3_sz160_8xb256_ep100.py 1 --auto_resume and haven't found the error of error: unrecognized arguments: --local-rank=0. I suggest that you can run OpenMixup with PyTorch<=1.13.1 and check whether you are using the latest source code of OpenMixup, which I haven't found errors in installation and DDP training. Currently, OpenMixup has some errors in running with PyTorch==2.0.1. You can try the following scripts,

conda create -n openmixup python=3.8 pytorch=1.13 cudatoolkit=11.6 torchvision -c pytorch -y
conda activate openmixup
pip install openmim
mim install mmcv-full
pip install opencv-python
git clone https://github.com/Westlake-AI/openmixup.git
cd openmixup
python setup.py develop

@Lupin1998 Lupin1998 self-assigned this May 15, 2023
@leon-costa
Copy link
Author

Hi. Thank you for your reply.

Yes I'm on the latest commit on the main branch.

I tried your commands:

  • conda create -n openmixup python=3.8 pytorch=1.13 cudatoolkit=11.6 torchvision -c pytorch -y failed with PackagesNotFoundError: The following packages are not available from current channels: - cudatoolkit=11.6
  • I tried with cudatoolkit=10.1 instead (like in the install.md) and it worked
  • all the other steps worked
  • when I ran bash tools/dist_train.sh configs/classification/imagenet/resnet/resnet50_rsb_a3_sz160_8xb256_ep100.py 1 --auto_resume again it failed with a new error: AttributeError: module 'cv2' has no attribute 'COLOR_BGR2RGB'
  • I tried installing opencv-python 4.5.4.60 as suggested in Cv2 error open-mmlab/mmocr#720 and it fixed the error
  • then when I ran resnet50_rsb_a3_sz160_8xb256_ep100.py I got a ZeroDivisionError: integer division or modulo by zero, that's caused by torch.cuda.device_count() returning 0
  • It looks like this torch version is not compiled with cuda support (AssertionError: Torch not compiled with CUDA enabled when calling torch.zeros(1).cuda())
  • I tried again by following pytorch instructions to install 1.13.1 and mixing them with your recommendations and the other fixes:
conda create -n openmixup python=3.8 pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia -y
conda activate openmixup
mim install mmcv-full
pip install opencv-python==4.5.4.60
git clone https://github.com/Westlake-AI/openmixup.git
cd openmixup
python setup.py develop

And it worked, I was able to start a training.

@Lupin1998
Copy link
Member

Thanks for your detailed solutions! @leon-costa👍 We will add a reference to this issue in install.md. To summarize, the main problems are attributed to PyTorch installation and the version of opencv-python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants