Skip to content
This repository has been archived by the owner on Apr 1, 2024. It is now read-only.

Program exits unexpectedly #13

Closed
nitthilan opened this issue Oct 19, 2020 · 14 comments
Closed

Program exits unexpectedly #13

nitthilan opened this issue Oct 19, 2020 · 14 comments

Comments

@nitthilan
Copy link

Describe the bug
When we run the code for the Bike dataset, I see that the code exits unexpectedly. The dataloader process gets terminated

To Reproduce
Command I used for running:

python -u train.py ${DATASET} --user-dir fairnr --task single_object_rendering --train-views "0..100" --view-resolution "800x800" --max-sentences 1 --view-per-batch 2 --pixel-per-view 2048 --no-preload --sampling-on-mask 1.0 --no-sampling-at-reader --valid-views "100..200" --valid-view-resolution "400x400" --valid-view-per-batch 1 --transparent-background "1.0,1.0,1.0" --background-stop-gradient --arch nsvf_base --initial-boundingbox ${DATASET}/bbox.txt --raymarching-stepsize-ratio 0.125 --discrete-regularization --color-weight 128.0 --alpha-weight 1.0 --optimizer "adam" --adam-betas "(0.9, 0.999)" --lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000 --criterion "srn_loss" --clip-norm 0.0 --num-workers 2 --seed 2 --save-interval-updates 500 --max-update 150000 --virtual-epoch-steps 5000 --save-interval 1 --half-voxel-size-at "5000,25000,75000" --reduce-step-size-at "5000,25000,75000" --pruning-every-steps 2500 --keep-interval-updates 5 --keep-last-epochs 5 --log-format simple --log-interval 1 --save-dir ${SAVE} --tensorboard-logdir ${SAVE}/tensorboard | tee -a $SAVE/train.log

image

Probably this is something minor.

Regards,
K. J. Nitthilan

@nitthilan
Copy link
Author

Attaching the log file for the same

train.log

@MultiPath
Copy link
Contributor

MultiPath commented Oct 19, 2020

Hi, your log looks a bit strange too me. Why it will output information where all the tensors are in "torch.float64"? I meant it should be float32.

@MultiPath
Copy link
Contributor

Also I noticed you got out of memory at the second step. Can you pull the latest code and run with Python 3.7 ?

@nitthilan
Copy link
Author

Not sure why. Is there a parameter I have to set to use to move it from float64 to another datatype?

Sure will try with Python 3.7

@MultiPath
Copy link
Contributor

It should be automatically working in float32 I think.

@nitthilan
Copy link
Author

The problem seems to be related to memory allocation. When I used python3.7 it seems to not happen. Also, I have to reduce the --view-per-batch 3 instead of 4.

@nitthilan nitthilan reopened this Oct 19, 2020
@nitthilan
Copy link
Author

The issue seems to happen is we enable the --num-workers 1 instead of 0. I assume with this we would be able to use the GPU better since it preloads data better. Is this supported?

Another query is there is a warning which says training would be faster with --fp16 flag. Does this flag work? When I try it it throws a error.

@nitthilan
Copy link
Author

Adding another query. If I change the resolution to train from 800x800 to 400x400, it throws the following error.

image

How to reduce the size of the images used to input?

@MultiPath
Copy link
Contributor

MultiPath commented Oct 19, 2020

The issue seems to happen is we enable the --num-workers 1 instead of 0. I assume with this we would be able to use the GPU better since it preloads data better. Is this supported?

In my experience, --num-workers bigger than 0 performs much slower in this codebase. I am not too sure why and always keep workers equal to 0 which will use the master thread to read data.

Another query is there is a warning which says training would be faster with --fp16 flag. Does this flag work? When I try it it throws a error.

This is an option to speed-up training with mixed precision. What error do you got? It will use float16 instead of float32 to compute. I have not actively tested fp16 recently.

@MultiPath
Copy link
Contributor

Adding another query. If I change the resolution to train from 800x800 to 400x400, it throws the following error.

image

How to reduce the size of the images used to input?

Do you have the script for this?

@nitthilan
Copy link
Author

The command I used to execute is as below:

export DATASET="../../../data/NSVF/Synthetic_NSVF/Bike/"
export SAVE="./bike_chkpts_3/"
python -u train.py ${DATASET}
--user-dir fairnr
--task single_object_rendering
--train-views "0..100" --view-resolution "400x400"
--max-sentences 1 --view-per-batch 2 --pixel-per-view 2048
--no-preload
--sampling-on-mask 1.0 --no-sampling-at-reader
--valid-views "100..200" --valid-view-resolution "400x400"
--valid-view-per-batch 1
--transparent-background "1.0,1.0,1.0" --background-stop-gradient
--arch nsvf_base
--initial-boundingbox ${DATASET}/bbox.txt
--use-octree
--raymarching-stepsize-ratio 0.125
--discrete-regularization
--color-weight 128.0 --alpha-weight 1.0
--optimizer "adam" --adam-betas "(0.9, 0.999)"
--lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000
--criterion "srn_loss" --clip-norm 0.0
--num-workers 0
--seed 2
--save-interval-updates 500 --max-update 150000
--virtual-epoch-steps 5000 --save-interval 1
--half-voxel-size-at "5000,25000,75000"
--reduce-step-size-at "5000,25000,75000"
--pruning-every-steps 2500
--keep-interval-updates 5 --keep-last-epochs 5
--log-format simple --log-interval 1
--save-dir ${SAVE}
--tensorboard-logdir ${SAVE}/tensorboard
| tee -a $SAVE/train.log

@MultiPath
Copy link
Contributor

The scripts look ok to me. It fails when changing "800x800" to "400x400"?

@nitthilan
Copy link
Author

Yes. Just changing from 800x800 to 400x400

@MultiPath
Copy link
Contributor

MultiPath commented Oct 22, 2020

Ok, I checked and found it should be a bug in loss function which I did not noticed before.
Please see the recent commit which I have fixed the bug:
ed039d2

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants