Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

The error when training #39

Closed
daixiaolei623 opened this issue Oct 9, 2021 · 4 comments
Closed

The error when training #39

daixiaolei623 opened this issue Oct 9, 2021 · 4 comments

Comments

@daixiaolei623
Copy link

Thank you for your great work.
However, when i train the maskformer_swin_large_IN21k_384_bs16_160k_res640.yaml using the commend:
./train_net.py --num-gpus 2 --config-file configs/ade20k-150/swin/maskformer_swin_large_IN21k_384_bs16_160k_res640.yaml .

I got the following errors:
`MaskFormer Training Script.

This script is a simplified version of the training script in detectron2/tools.
: No such file or directory
import-im6.q16: not authorized copy' @ error/constitute.c/WriteImage/1037. import-im6.q16: not authorized itertools' @ error/constitute.c/WriteImage/1037.
import-im6.q16: not authorized logging' @ error/constitute.c/WriteImage/1037. import-im6.q16: not authorized os' @ error/constitute.c/WriteImage/1037.
from: can't read /var/mail/collections
from: can't read /var/mail/typing
import-im6.q16: not authorized torch' @ error/constitute.c/WriteImage/1037. import-im6.q16: not authorized comm' @ error/constitute.c/WriteImage/1037.
from: can't read /var/mail/detectron2.checkpoint
from: can't read /var/mail/detectron2.config
from: can't read /var/mail/detectron2.data
from: can't read /var/mail/detectron2.engine
./train_net.py: line 21: syntax error near unexpected token (' ./train_net.py: line 21: from detectron2.evaluation import ('`

Could you please tell me what is the problem and how to solve it?
thank you very much!

@bowenc0221
Copy link
Contributor

Add python

@daixiaolei623
Copy link
Author

@bowenc0221
Thank you.
However, i have add python and install cuda-11.1, i run python ./train_net.py --num-gpus 2 --config-file /home/dai/code/semantic_segmentation/27/MaskFormer-master/configs/ade20k-150/swin/maskformer_swin_large_IN21k_384_bs16_160k_res640.yaml, and got the following error:

`Command Line Args: Namespace(config_file='/home/dai/code/semantic_segmentation/27/MaskFormer-master/configs/ade20k-150/swin/maskformer_swin_large_IN21k_384_bs16_160k_res640.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=[], resume=False)
/home/dai/TOOL/anaconda3/envs/maskfromer/lib/python3.7/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
File "./train_net.py", line 270, in
args=(args,),
File "/home/dai/TOOL/anaconda3/envs/maskfromer/lib/python3.7/site-packages/detectron2/engine/launch.py", line 79, in launch
daemon=False,
File "/home/dai/TOOL/anaconda3/envs/maskfromer/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/dai/TOOL/anaconda3/envs/maskfromer/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/dai/TOOL/anaconda3/envs/maskfromer/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/dai/TOOL/anaconda3/envs/maskfromer/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/dai/TOOL/anaconda3/envs/maskfromer/lib/python3.7/site-packages/detectron2/engine/launch.py", line 95, in _distributed_worker
assert torch.cuda.is_available(), "cuda is not available. Please check your installation."
AssertionError: cuda is not available. Please check your installation.`

@daixiaolei623
Copy link
Author

@bowenc0221
thank you , i have solved the above error, but my GPU is 1080Ti, which is out of memory, i want to train on CPU, my CPU is 64G,
Could you please tell me how to train it on CPU?
thank you.

@bowenc0221
Copy link
Contributor

You can try adding MODEL.DEVICE 'cpu' at the end of your command, but I have never tested it with CPU.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants