-
Notifications
You must be signed in to change notification settings - Fork 92
Program exits unexpectedly #13
Comments
Attaching the log file for the same |
Hi, your log looks a bit strange too me. Why it will output information where all the tensors are in "torch.float64"? I meant it should be float32. |
Also I noticed you got out of memory at the second step. Can you pull the latest code and run with Python 3.7 ? |
Not sure why. Is there a parameter I have to set to use to move it from float64 to another datatype? Sure will try with Python 3.7 |
It should be automatically working in float32 I think. |
The problem seems to be related to memory allocation. When I used python3.7 it seems to not happen. Also, I have to reduce the --view-per-batch 3 instead of 4. |
The issue seems to happen is we enable the --num-workers 1 instead of 0. I assume with this we would be able to use the GPU better since it preloads data better. Is this supported? Another query is there is a warning which says training would be faster with --fp16 flag. Does this flag work? When I try it it throws a error. |
In my experience,
This is an option to speed-up training with mixed precision. What error do you got? It will use float16 instead of float32 to compute. I have not actively tested fp16 recently. |
The command I used to execute is as below: export DATASET="../../../data/NSVF/Synthetic_NSVF/Bike/" |
The scripts look ok to me. It fails when changing "800x800" to "400x400"? |
Yes. Just changing from 800x800 to 400x400 |
Ok, I checked and found it should be a bug in loss function which I did not noticed before. |
Describe the bug
When we run the code for the Bike dataset, I see that the code exits unexpectedly. The dataloader process gets terminated
To Reproduce
Command I used for running:
python -u train.py ${DATASET} --user-dir fairnr --task single_object_rendering --train-views "0..100" --view-resolution "800x800" --max-sentences 1 --view-per-batch 2 --pixel-per-view 2048 --no-preload --sampling-on-mask 1.0 --no-sampling-at-reader --valid-views "100..200" --valid-view-resolution "400x400" --valid-view-per-batch 1 --transparent-background "1.0,1.0,1.0" --background-stop-gradient --arch nsvf_base --initial-boundingbox ${DATASET}/bbox.txt --raymarching-stepsize-ratio 0.125 --discrete-regularization --color-weight 128.0 --alpha-weight 1.0 --optimizer "adam" --adam-betas "(0.9, 0.999)" --lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000 --criterion "srn_loss" --clip-norm 0.0 --num-workers 2 --seed 2 --save-interval-updates 500 --max-update 150000 --virtual-epoch-steps 5000 --save-interval 1 --half-voxel-size-at "5000,25000,75000" --reduce-step-size-at "5000,25000,75000" --pruning-every-steps 2500 --keep-interval-updates 5 --keep-last-epochs 5 --log-format simple --log-interval 1 --save-dir ${SAVE} --tensorboard-logdir ${SAVE}/tensorboard | tee -a $SAVE/train.log
Probably this is something minor.
Regards,
K. J. Nitthilan
The text was updated successfully, but these errors were encountered: