Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Creating MTGP constants failed #15

Closed
ivanlengyel opened this issue Nov 29, 2021 · 9 comments
Closed

RuntimeError: Creating MTGP constants failed #15

ivanlengyel opened this issue Nov 29, 2021 · 9 comments

Comments

@ivanlengyel
Copy link

Hi, I am trying to implement this repo.
I've downloaded the ade20k checkpoints and created a conda env following your yaml file.

When I run the testing command python test.py --name oasis_ade20k --dataset_mode ade20k --gpu_ids 0 \ azureuser@ivan-fantasia-default --dataroot test_images --batch_size 1 I get the following error:

/opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCTensorScatterGather.cu:176: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [232,0,0], thread: [101,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
Traceback (most recent call last):
  File "test.py", line 25, in <module>
    generated = model(None, label, "generate", None)
  File "/anaconda/envs/oasis/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/anaconda/envs/oasis/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 141, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/anaconda/envs/oasis/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/azureuser/IM/OASIS/models/models.py", line 72, in forward
    fake = self.netEMA(label)
  File "/anaconda/envs/oasis/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/azureuser/IM/OASIS/models/generator.py", line 36, in forward
    z = torch.randn(seg.size(0), self.opt.z_dim, dtype=torch.float32, device=dev)
RuntimeError: Creating MTGP constants failed. at /opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCTensorRandom.cu:35

I am running on test_image folder which are some ade20k images.

Any suggestion?
Thanks ;)

@SushkoVadim
Copy link
Contributor

Hi,

I am not totally sure, but it seems that you are not following the official structure of the Ade20k dataset.
Our dataloader expects the test data to lie under --dataroot/${...}/validation, and also expects to have label maps:

path_lab = os.path.join(self.opt.dataroot, "annotations", mode)

Could you try to re-arrange the folders so that they follow the official Ade20k structure?

@ivanlengyel
Copy link
Author

ivanlengyel commented Nov 29, 2021

Hi, thanks for the reply ;)
I tried to follow the official Ade20k structure I think

$ tree test_images -L 2
test_images
├── annotations
│   └── validation
└── images
    └── validation

in images/validation/ are the files.jpg and in annotations/validation the files with the annotations .png.

I don't know if this has something to be: but I am exploring the outputs of the dataloaders in this case

OASIS/test.py

Lines 23 to 24 in 6e728ec

for i, data_i in enumerate(dataloader_val):
_, label = models.preprocess_input(opt, data_i)

and for data_i

np.unique(data_i['label'].cpu().numpy())
array([  0.,  10.,  26.,  39.,  51.,  56.,  70.,  77.,  80.,  83.,  90.,
       102., 116., 128., 153., 179., 181., 204., 230., 255.],
      dtype=float32)

and data_i['label'].cpu().numpy().shape = (1, 3, 256, 256)

I don't know if this is fine or not. I am just commenting this because of the error Assertion indexValue >= 0 && indexValue < tensor.sizes[dim] failed

and I read in the the dataloader is expecting

opt.label_nc = 150
opt.contain_dontcare_label = True
opt.semantic_nc = 151 # label_nc + unknown

, that is up to 150 labels and maybe I am providing more somehow?

@SushkoVadim
Copy link
Contributor

Thanks for trying out!
Another guess: could it happen that we use different versions of Ade20k?
See here: #7

In our version we had only 150 classes, yours seems to have more?

@ivanlengyel
Copy link
Author

ivanlengyel commented Nov 29, 2021

It seems to me that the annotations.png images are loaded as regular images, and then we have (batch_size, 3, 256, 256) labels instead of (batch_size, nc, 256, 256).

Do you think that this could be the problem?

edit: in my case I obtained values that go between 0..255 because of the RGB colors.

@ivanlengyel
Copy link
Author

ivanlengyel commented Nov 29, 2021

I don't know if there is any mapper color-->label or how the labels are supposed to be created from the png 3 channels images.

EDIT: I am following this answer to see if I can make it work:
#7 (comment)

@SushkoVadim
Copy link
Contributor

The output of data_i['label'].shape is torch.Size([1, 1, 256, 256]) for me.
So something is wrong in the label map format (or PIL version, but don't think so)

@ivanlengyel
Copy link
Author

Well, I have good and bad news.

The good news is that after downloading the dataset that you pointed out in #7 (comment) I don't obtain that error any more.
So I guess that the validation images should be GRAY values (that is 1 channel) images. I don't know why I downloaded some images of the ADE challenge in which the annotations are in color. That is the reason of the error.

The bad news is that I have a new error XD:

python test.py --name oasis_ade20k --dataset_mode ade20k --gpu_ids 0  \
--dataroot ADEChallengeData2016 --batch_size 1
----------------- Options ---------------
                EMA_decay: 0.9999
               batch_size: 1
               channels_G: 64
          checkpoints_dir: ./checkpoints
                ckpt_iter: best
                 dataroot: ADEChallengeData2016          	[default: ./datasets/cityscapes/]
             dataset_mode: ade20k                        	[default: coco]
                  gpu_ids: 0
                     name: oasis_ade20k                  	[default: label2coco]
               no_3dnoise: False
                   no_EMA: False
                  no_flip: False
         no_spectral_norm: False
           num_res_blocks: 6
          param_free_norm: syncbatch
                    phase: test                          	[default: train]
              results_dir: ./results/
                     seed: 42
                 spade_ks: 3
                    z_dim: 64
----------------- End -------------------
Created Ade20kDataset, size train: 2000, size val: 2000
Created OASIS_Generator with 74314691 parameters
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
Traceback (most recent call last):
  File "test.py", line 25, in <module>
    generated = model(None, label, "generate", None)
  File "/anaconda/envs/oasis/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/anaconda/envs/oasis/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 141, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/anaconda/envs/oasis/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/azureuser/IM/OASIS/models/models.py", line 72, in forward
    fake = self.netEMA(label)
  File "/anaconda/envs/oasis/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/azureuser/IM/OASIS/models/generator.py", line 43, in forward
    x = self.body[i](x, seg)
  File "/anaconda/envs/oasis/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/azureuser/IM/OASIS/models/generator.py", line 78, in forward
    dx = self.conv_0(self.activ(self.norm_0(x, seg)))
  File "/anaconda/envs/oasis/lib/python3.7/site-packages/torch/nn/modules/module.py", line 485, in __call__
    hook(self, input)
  File "/anaconda/envs/oasis/lib/python3.7/site-packages/torch/nn/utils/spectral_norm.py", line 100, in __call__
    setattr(module, self.name, self.compute_weight(module, do_power_iteration=module.training))
  File "/anaconda/envs/oasis/lib/python3.7/site-packages/torch/nn/utils/spectral_norm.py", line 86, in compute_weight
    sigma = torch.dot(u, torch.mv(weight_mat, v))
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCBlas.cu:116
(oasis) -----

I guess I'll have to debug this new error. However there is not much info in the error message :S

@ivanlengyel
Copy link
Author

ivanlengyel commented Nov 29, 2021

It seems that this error is related with some cuda-pytorch combinations.

I will try to use newer versions of cuda and torch to see if I can fix this error.

However what intrigues me is that I am using the same versions as you since I create the env using the provided yaml.
Anyway, I will investigate on this and see if I can fix it. If I can I'll make sure to post the fix here so if other users are having the same problem they have an answer.

@ivanlengyel
Copy link
Author

I close this issue since the original problem was fixed.

🚫 The problem is that the dataloader is expecting 1channel gray-scale images and somehow I downloaded a test set of ADE in which the labels are in color. Then My labels where (batch_size, 3, 256, 256) instead of (batch_size, nc, 256, 256).

✅ Using gray scale images fixed my original issue.

Thanks @SushkoVadim for the fast reply and the nice repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants