Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Data Resolution for Training #67

Open
Gaussianer opened this issue Sep 8, 2021 · 10 comments
Open

Custom Data Resolution for Training #67

Gaussianer opened this issue Sep 8, 2021 · 10 comments

Comments

@Gaussianer
Copy link

Hello @chenwydj,

We have already asked how we train FasterSeg with Custom Data, see here. However, we would still have a question regarding the image resolution and the necessary adjustments in the code. We have found several places that match the image resolution or at least have a correlation with it. See here, here, here, here, here, here, here, here, here, here and here.

Do all these values need to be adjusted to the resolution of the data set?

Thank you for providing FasterSeg and the support from your side.

@ogkdmr
Copy link

ogkdmr commented Sep 14, 2021

I'm very interested in hearing what @chenwydj has to say about this.

@i-am-nut
Copy link

hey @Gaussianer did you manage to trans FasterSeg with customdataset following guidelines in #46 ?

@Gaussianer
Copy link
Author

Hey @EmersonJr ,
yes we have provided a repo for this as well.
Look here: https://github.com/Gaussianer/FasterSeg

@Gaussianer
Copy link
Author

However, we cannot yet provide any information in the repo about how far the code has to be adapted to the resolution.
We have trained several models, but we wonder if the resolution needs to be adjusted to improve the results.

@i-am-nut
Copy link

i-am-nut commented Dec 1, 2021

no worries @Gaussianer thanks for replying here :)
I'm also a master student working with real time image segmentation, in my case its aimed for images containing sugar canes and weeds.

I got some questions you probably could help since you did custom training , i'm jus not sure here is the best place though but anyways...

I basically want to train FasterSeg with custom dataset as well, but my classes has nothing to do with any of the Cityscapes classes, my classes are: Sugar Cane and Weeds (should I count Background for the number of clasess as well?)
I'm coding them on the ground truth images (annotations) as following: Sugar Canes pixels are [0,0,0], Weeds are [1,1,1] and everything else (Background) is [255,255,255], here's an example image (the image is 1024x2048 by mistake, i know i'll need to generate 2048x1024 instead):

e

What should I change in your repo code to train with dataset containing these images?
Thanks beforehand mate!

@Gaussianer
Copy link
Author

Gaussianer commented Dec 1, 2021

On the one hand you have to create the dataset according to the description. For this you have to generate the provided labelDefinitions.csv according to the template. Here you can also see the corresponding attributes for the background (unlabeled). Just try to go through our description, maybe some parts are not documented yet, if you have problems, please contact me. Then I can also improve it, so that others can profit from it.

@i-am-nut
Copy link

i-am-nut commented Dec 2, 2021

Thanks @Gaussianer
So, I've followed description and also created my own labelDefinitions.csv. here it is:

name,id,trainId,category,catId,hasInstances,ignoreInEval,color_r,color_g,color_b
unlabeled,0,255,void,0,False,False,0,0,0
sugar cane,1,0,void,0,False,False,100,50,15
weeds,2,1,void,0,False,False,247,103,0

Created that way cause my background (unlabeled) pixels on _labelTrainIds.png are [255,255,255], Sugar Canes are [0,0,0] and Weeds are [1,1,1].

I also did edit config_search.py and config_train.py to set C.num_classes = 3 for my case. However, when I run CUDA_VISIBLE_DEVICES=0 python train_search.py I do get the error shown below:

root@5be7442709af:/home/FasterSeg/search# CUDA_VISIBLE_DEVICES=0 python train_search.py
use TensorRT for latency test
use TensorRT for latency test
Experiment dir : search-pretrain-256x512_F12.L16_batch3-20211202-161628
02 16:16:28 args = {'seed': 12345, 'repo_name': 'FasterSeg', 'abs_dir': '/home/FasterSeg/search', 'this_dir': 'search', 'root_dir': '/home/FasterSeg', 'dataset_path': '/home/FasterSeg/dataset', 'img_root_folder': '/home/FasterSeg/dataset', 'gt_root_folder': '/home/FasterSeg/dataset', 'train_source': '/home/FasterSeg/dataset/train_mapping_list.txt', 'eval_source': '/home/FasterSeg/dataset/val_mapping_list.txt', 'num_classes': 3, 'background': -1, 'image_mean': array([0.485, 0.456, 0.406]), 'image_std': array([0.229, 0.224, 0.225]), 'down_sampling': 2, 'image_height': 256, 'image_width': 512, 'gt_down_sampling': 8, 'num_train_imgs': 50, 'num_eval_imgs': 25, 'bn_momentum': 0.1, 'bn_eps': 1e-05, 'lr': 0.02, 'momentum': 0.9, 'weight_decay': 0.0005, 'num_workers': 4, 'train_scale_array': [0.75, 1, 1.25], 'eval_stride_rate': 0.8333333333333334, 'eval_scale_array': [1], 'eval_flip': False, 'eval_height': 1024, 'eval_width': 2048, 'grad_clip': 5, 'train_portion': 0.5, 'arch_learning_rate': 0.0003, 'arch_weight_decay': 0, 'layers': 16, 'branch': 2, 'pretrain': True, 'prun_modes': ['max', 'arch_ratio'], 'Fch': 12, 'width_mult_list': [0.3333333333333333, 0.5, 0.6666666666666666, 0.8333333333333334, 1.0], 'stem_head_width': [(1, 1), (0.6666666666666666, 0.6666666666666666)], 'FPS_min': [0, 155.0], 'FPS_max': [0, 175.0], 'batch_size': 3, 'niters_per_epoch': 400, 'latency_weight': [0, 0], 'nepochs': 20, 'save': 'search-pretrain-256x512_F12.L16_batch3-20211202-161628', 'unrolled': False}
02 16:16:36 params = 2.568351MB, FLOPs = 71.064453GB
architect initialized!
using downsampling: 2
Found 25 images
using downsampling: 2
Found 25 images
using downsampling: 2
Found 25 images
  0%|                                                    | 0/20 [00:00<?, ?it/s]02 16:25:11 True
02 16:25:11 search-pretrain-256x512_F12.L16_batch3-20211202-161628
02 16:25:11 lr: 0.02
02 16:25:11 update arch: False
[Epoch 1/20][trTraceback (most recent call last):        | 0/20 [00:00<?, ?it/s]
  File "train_search.py", line 307, in <module>
    main(pretrain=config.pretrain) 
  File "train_search.py", line 137, in main
    train(pretrain, train_loader_model, train_loader_arch, model, architect, ohem_criterion, optimizer, lr_policy, logger, epoch, update_arch=update_arch)
  File "train_search.py", line 246, in train
    loss = model._loss(imgs, target, pretrain)
  File "/home/FasterSeg/search/model_search.py", line 489, in _loss
    logits = self(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/FasterSeg/search/model_search.py", line 287, in forward
    out_prev = [[stem(input), None]] # stem: one cell
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/FasterSeg/search/operations.py", line 127, in forward
    x = self.conv(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/pin_memory.py", line 21, in _pin_memory_loop
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 276, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 493, in Client
    answer_challenge(c, authkey)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

Btw the container i'm running stems from installation by Dockerfile process. If I follow the same steps and run the training command above in your provided image from Dockerhub it doesn't detect TensorRT is installed and I get this error:

root@be035b6f0647:/home/FasterSeg/search# CUDA_VISIBLE_DEVICES=0 python train_search.py
/home/FasterSeg/tools/utils/darts_utils.py:179: UserWarning: TensorRT (or pycuda) is not installed. compute_latency_ms_tensorrt() cannot be used.
  warnings.warn("TensorRT (or pycuda) is not installed. compute_latency_ms_tensorrt() cannot be used.")
use PyTorch for latency test
use PyTorch for latency test
Experiment dir : search-pretrain-256x512_F12.L16_batch3-20211202-152200
02 15:22:00 args = {'seed': 12345, 'repo_name': 'FasterSeg', 'abs_dir': '/home/FasterSeg/search', 'this_dir': 'search', 'root_dir': '/home/FasterSeg', 'dataset_path': '/home/FasterSeg/dataset', 'img_root_folder': '/home/FasterSeg/dataset', 'gt_root_folder': '/home/FasterSeg/dataset', 'train_source': '/home/FasterSeg/dataset/train_mapping_list.txt', 'eval_source': '/home/FasterSeg/dataset/val_mapping_list.txt', 'num_classes': 3, 'background': -1, 'image_mean': array([0.485, 0.456, 0.406]), 'image_std': array([0.229, 0.224, 0.225]), 'down_sampling': 2, 'image_height': 256, 'image_width': 512, 'gt_down_sampling': 8, 'num_train_imgs': 0, 'num_eval_imgs': 0, 'bn_momentum': 0.1, 'bn_eps': 1e-05, 'lr': 0.02, 'momentum': 0.9, 'weight_decay': 0.0005, 'num_workers': 4, 'train_scale_array': [0.75, 1, 1.25], 'eval_stride_rate': 0.8333333333333334, 'eval_scale_array': [1], 'eval_flip': False, 'eval_height': 1024, 'eval_width': 2048, 'grad_clip': 5, 'train_portion': 0.5, 'arch_learning_rate': 0.0003, 'arch_weight_decay': 0, 'layers': 16, 'branch': 2, 'pretrain': True, 'prun_modes': ['max', 'arch_ratio'], 'Fch': 12, 'width_mult_list': [0.3333333333333333, 0.5, 0.6666666666666666, 0.8333333333333334, 1.0], 'stem_head_width': [(1, 1), (0.6666666666666666, 0.6666666666666666)], 'FPS_min': [0, 155.0], 'FPS_max': [0, 175.0], 'batch_size': 3, 'niters_per_epoch': 400, 'latency_weight': [0, 0], 'nepochs': 20, 'save': 'search-pretrain-256x512_F12.L16_batch3-20211202-152200', 'unrolled': False}
02 15:22:09 params = 2.568351MB, FLOPs = 71.064453GB
Traceback (most recent call last):
  File "train_search.py", line 306, in <module>
    main(pretrain=config.pretrain) 
  File "train_search.py", line 69, in main
    model = model.cuda()
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 265, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 205, in _apply
    self._buffers[key] = fn(buf)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 265, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 162, in _lazy_init
    _check_driver()
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 82, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError: 
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

So i'm following with the first container. Do you have any idea what's happening in this case?

@Gaussianer
Copy link
Author

Gaussianer commented Dec 3, 2021

@EmersonJr Did you install the Docker NVIDIA container runtime as in the installation description?

Regarding TensorRT. Yes we had to remove TensorRT from the environment because it always led to errors during training.

@i-am-nut
Copy link

i-am-nut commented Dec 3, 2021

@Gaussianer I noticed that I didn't by mistake. Installed now and retried training but it's still giving the same error. (yes, I did restart Docker service, rebooted, even ran a new container). Have any ideas?

@Gaussianer
Copy link
Author

@EmersonJr Have you installed the appropriate graphics card driver as well as CUDA 10.1 and CUDNN? We have provided a guide for CentOS 7 for the setup with Podman.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants