Skip to content
This repository has been archived by the owner on Apr 1, 2024. It is now read-only.

About apex and the args "--fp16" #33

Closed
yumi-cn opened this issue Dec 11, 2020 · 3 comments
Closed

About apex and the args "--fp16" #33

yumi-cn opened this issue Dec 11, 2020 · 3 comments

Comments

@yumi-cn
Copy link

yumi-cn commented Dec 11, 2020

I already install the nvidia/apex module in my env(which is optional said in your project README).

When I try to add args "--fp16" to the train script:

python -u train.py ${DATASET} \
    ... \
    --fp16 \
    ... \
    --tensorboard-logdir ${SAVE}/tensorboard \
    | tee -a $SAVE/train.log

It will occur some errors, the main Error Report is about c10:Error

...
terminate called after throwing an instance of 'c10::Error'
...

Something similar to fairsep issue#1683 - closed&no response

I try to find ways to solve this, like add args "--ddp-backend=no_c10d",but this just cause the same error.

I haven't read all the main codes of project, but I guess you guys maybe more familiar with these problem, so I try to post this issue.

Thanks for replying.

BTW:train without "--fp16" is always fine, and the env is almost the same as the requirement file in README.

@MultiPath
Copy link
Contributor

Hi, I am sorry for replying late as I was busy with other things.
--fp16 (mixed precision training) only works for certain GPUs such as Nvidia V100. It will help to reduce GPU usage.
Maybe your GPU did not support that?

@yumi-cn
Copy link
Author

yumi-cn commented Dec 12, 2020

Hi, I am sorry for replying late as I was busy with other things.
--fp16 (mixed precision training) only works for certain GPUs such as Nvidia V100. It will help to reduce GPU usage.
Maybe your GPU did not support that?

My GPUs are RTX2080Ti(11GB) x 4 in the server docker env, which I check, it is Turning Arch and has Tensor Core support.

Maybe I use the --fp16 in a wrong way in the command? or other env setting problem, confusing.

@MultiPath
Copy link
Contributor

I will check --fp16 recently. I think it should work as I always use fp16 in my early experiments. However, I am afraid it may cause inaccurate rendering results, so I usually turned that off later.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants