Program exits unexpectedly #13

nitthilan · 2020-10-19T03:53:31Z

Describe the bug
When we run the code for the Bike dataset, I see that the code exits unexpectedly. The dataloader process gets terminated

To Reproduce
Command I used for running:

python -u train.py ${DATASET} --user-dir fairnr --task single_object_rendering --train-views "0..100" --view-resolution "800x800" --max-sentences 1 --view-per-batch 2 --pixel-per-view 2048 --no-preload --sampling-on-mask 1.0 --no-sampling-at-reader --valid-views "100..200" --valid-view-resolution "400x400" --valid-view-per-batch 1 --transparent-background "1.0,1.0,1.0" --background-stop-gradient --arch nsvf_base --initial-boundingbox ${DATASET}/bbox.txt --raymarching-stepsize-ratio 0.125 --discrete-regularization --color-weight 128.0 --alpha-weight 1.0 --optimizer "adam" --adam-betas "(0.9, 0.999)" --lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000 --criterion "srn_loss" --clip-norm 0.0 --num-workers 2 --seed 2 --save-interval-updates 500 --max-update 150000 --virtual-epoch-steps 5000 --save-interval 1 --half-voxel-size-at "5000,25000,75000" --reduce-step-size-at "5000,25000,75000" --pruning-every-steps 2500 --keep-interval-updates 5 --keep-last-epochs 5 --log-format simple --log-interval 1 --save-dir ${SAVE} --tensorboard-logdir ${SAVE}/tensorboard | tee -a $SAVE/train.log

Probably this is something minor.

Regards,
K. J. Nitthilan

nitthilan · 2020-10-19T03:58:13Z

Attaching the log file for the same

train.log

MultiPath · 2020-10-19T03:58:41Z

Hi, your log looks a bit strange too me. Why it will output information where all the tensors are in "torch.float64"? I meant it should be float32.

MultiPath · 2020-10-19T03:59:35Z

Also I noticed you got out of memory at the second step. Can you pull the latest code and run with Python 3.7 ?

nitthilan · 2020-10-19T04:01:04Z

Not sure why. Is there a parameter I have to set to use to move it from float64 to another datatype?

Sure will try with Python 3.7

MultiPath · 2020-10-19T04:08:19Z

It should be automatically working in float32 I think.

nitthilan · 2020-10-19T08:48:26Z

The problem seems to be related to memory allocation. When I used python3.7 it seems to not happen. Also, I have to reduce the --view-per-batch 3 instead of 4.

nitthilan · 2020-10-19T10:01:29Z

The issue seems to happen is we enable the --num-workers 1 instead of 0. I assume with this we would be able to use the GPU better since it preloads data better. Is this supported?

Another query is there is a warning which says training would be faster with --fp16 flag. Does this flag work? When I try it it throws a error.

nitthilan · 2020-10-19T10:14:32Z

Adding another query. If I change the resolution to train from 800x800 to 400x400, it throws the following error.

How to reduce the size of the images used to input?

MultiPath · 2020-10-19T16:45:28Z

The issue seems to happen is we enable the --num-workers 1 instead of 0. I assume with this we would be able to use the GPU better since it preloads data better. Is this supported?

In my experience, --num-workers bigger than 0 performs much slower in this codebase. I am not too sure why and always keep workers equal to 0 which will use the master thread to read data.

Another query is there is a warning which says training would be faster with --fp16 flag. Does this flag work? When I try it it throws a error.

This is an option to speed-up training with mixed precision. What error do you got? It will use float16 instead of float32 to compute. I have not actively tested fp16 recently.

MultiPath · 2020-10-19T16:48:24Z

Adding another query. If I change the resolution to train from 800x800 to 400x400, it throws the following error.

How to reduce the size of the images used to input?

Do you have the script for this?

nitthilan · 2020-10-21T09:20:19Z

The command I used to execute is as below:

export DATASET="../../../data/NSVF/Synthetic_NSVF/Bike/"
export SAVE="./bike_chkpts_3/"
python -u train.py ${DATASET}
--user-dir fairnr
--task single_object_rendering
--train-views "0..100" --view-resolution "400x400"
--max-sentences 1 --view-per-batch 2 --pixel-per-view 2048
--no-preload
--sampling-on-mask 1.0 --no-sampling-at-reader
--valid-views "100..200" --valid-view-resolution "400x400"
--valid-view-per-batch 1
--transparent-background "1.0,1.0,1.0" --background-stop-gradient
--arch nsvf_base
--initial-boundingbox ${DATASET}/bbox.txt
--use-octree
--raymarching-stepsize-ratio 0.125
--discrete-regularization
--color-weight 128.0 --alpha-weight 1.0
--optimizer "adam" --adam-betas "(0.9, 0.999)"
--lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000
--criterion "srn_loss" --clip-norm 0.0
--num-workers 0
--seed 2
--save-interval-updates 500 --max-update 150000
--virtual-epoch-steps 5000 --save-interval 1
--half-voxel-size-at "5000,25000,75000"
--reduce-step-size-at "5000,25000,75000"
--pruning-every-steps 2500
--keep-interval-updates 5 --keep-last-epochs 5
--log-format simple --log-interval 1
--save-dir ${SAVE}
--tensorboard-logdir ${SAVE}/tensorboard
| tee -a $SAVE/train.log

MultiPath · 2020-10-21T18:26:56Z

The scripts look ok to me. It fails when changing "800x800" to "400x400"?

nitthilan · 2020-10-21T19:28:15Z

Yes. Just changing from 800x800 to 400x400

MultiPath · 2020-10-22T03:48:31Z

Ok, I checked and found it should be a bug in loss function which I did not noticed before.
Please see the recent commit which I have fixed the bug:
ed039d2

nitthilan closed this as completed Oct 19, 2020

nitthilan reopened this Oct 19, 2020

MultiPath closed this as completed Oct 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Program exits unexpectedly #13

Program exits unexpectedly #13

nitthilan commented Oct 19, 2020

nitthilan commented Oct 19, 2020

MultiPath commented Oct 19, 2020 •

edited

Loading

MultiPath commented Oct 19, 2020

nitthilan commented Oct 19, 2020

MultiPath commented Oct 19, 2020

nitthilan commented Oct 19, 2020

nitthilan commented Oct 19, 2020

nitthilan commented Oct 19, 2020

MultiPath commented Oct 19, 2020 •

edited

Loading

MultiPath commented Oct 19, 2020

nitthilan commented Oct 21, 2020

MultiPath commented Oct 21, 2020

nitthilan commented Oct 21, 2020

MultiPath commented Oct 22, 2020 •

edited

Loading

Program exits unexpectedly #13

Program exits unexpectedly #13

Comments

nitthilan commented Oct 19, 2020

nitthilan commented Oct 19, 2020

MultiPath commented Oct 19, 2020 • edited Loading

MultiPath commented Oct 19, 2020

nitthilan commented Oct 19, 2020

MultiPath commented Oct 19, 2020

nitthilan commented Oct 19, 2020

nitthilan commented Oct 19, 2020

nitthilan commented Oct 19, 2020

MultiPath commented Oct 19, 2020 • edited Loading

MultiPath commented Oct 19, 2020

nitthilan commented Oct 21, 2020

MultiPath commented Oct 21, 2020

nitthilan commented Oct 21, 2020

MultiPath commented Oct 22, 2020 • edited Loading

MultiPath commented Oct 19, 2020 •

edited

Loading

MultiPath commented Oct 19, 2020 •

edited

Loading

MultiPath commented Oct 22, 2020 •

edited

Loading