Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--devnum parsing error when using multi-GPU functionality #152

Closed
Hong-Rui opened this issue Sep 16, 2021 · 4 comments
Closed

--devnum parsing error when using multi-GPU functionality #152

Hong-Rui opened this issue Sep 16, 2021 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@Hong-Rui
Copy link

Hi developers,

I'm trying to use the new multi-GPU features in AD-GPU 1.5.1, and with OVERLAP=ON during compiling.
There are 8 GPU cards in my local computing machine. When I specified the cuda device numbers, the command line parsers in main.cpp seems not correctly parsing the input CUDA IDs:

When I try autodock_gpu_128wi -B ligand_conf_batch.dat -n 10 -D 1 , it seems to work correctly, with outputting the cuda info as

Running Job #7:
    Device: GeForce RTX 2080 Ti (#1 / 8)
    Grid map file: protein.maps.fld
    Ligand file: ligand_2_isomer_0_conf_0_split_0.pdbqt
    Using heuristics: (capped) number of evaluations set to 2068966
    Local-search chosen method is: ADADELTA (ad)

When I try autodock_gpu_128wi -B ligand_conf_batch.dat -n 10 -D 2 , it gives me an cuda error like

Running Job #13:
    Device: GeForce RTX 2080 Ti (#2 / 8)
    Grid map file: protein.maps.fld
    Ligand file: ligand_8_isomer_0_conf_0_split_0.pdbqt
    Using heuristics: (capped) number of evaluations set to 2068966
    Local-search chosen method is: ADADELTA (ad)
gpu_calc_initpop_kernel an illegal memory access was encountered
autodock_gpu_128wi: ./cuda/kernel1.cu:65: void gpu_calc_initpop(uint32_t, uint32_t, float*, float*): Assertion `0' failed.
[1]    38987 abort (core dumped)  autodock_gpu_128wi -B ligand_conf_batch.dat -n 10 -D 2

The error output is the same for -D to (2,3,4,5,6,7,8), it does not work for all cuda device indices except for -D 1.

And when I try autodock_gpu_128wi -B ligand_conf_batch.dat -n 10 -D 2, (Note there is a comma at the end), it seems to duplicate the parsing of cuda idx 2, but runs normally:

Cuda device:                              GeForce RTX 2080 Ti (#2 / 8)
Available memory on device:               10439 MB (total: 11019 MB)

CUDA Setup time 0.280641s
Cuda device:                              GeForce RTX 2080 Ti (#2 / 8)
Available memory on device:               8301 MB (total: 11019 MB)

CUDA Setup time 0.000852s

Running Job #14:
    Device: GeForce RTX 2080 Ti (#2 / 8)
    Grid map file: protein.maps.fld
    Ligand file: ligand_9_isomer_0_conf_0_split_0.pdbqt
    Using heuristics: (capped) number of evaluations set to 2068966
    Local-search chosen method is: ADADELTA (ad)

Rest of Setup time 0.036681s

But if i try to use multiple cards, such as autodock_gpu_128wi -B ligand_conf_batch.dat -n 10 -D 2,3,7, it works correctly, with all specified GPU cards detected:

Cuda device:                              GeForce RTX 2080 Ti (#2 / 8)
Available memory on device:               10746 MB (total: 11019 MB)

CUDA Setup time 0.291592s
Cuda device:                              GeForce RTX 2080 Ti (#3 / 8)
Available memory on device:               8883 MB (total: 11019 MB)

CUDA Setup time 0.236695s
Cuda device:                              GeForce RTX 2080 Ti (#7 / 8)
Available memory on device:               9072 MB (total: 11019 MB)

CUDA Setup time 0.212491s

Running Job #24:
    Device: GeForce RTX 2080 Ti (#7 / 8)
    Grid map file: protein.maps.fld
    Ligand file: ligand_17_isomer_0_conf_1_split_0.pdbqt
    Using heuristics: (capped) number of evaluations set to 2280181
    Local-search chosen method is: ADADELTA (ad)

Rest of Setup time 0.021547s

Finally if I use -D all autodock_gpu_128wi -B ligand_conf_batch.dat -n 10 -D all, everything works pretty fine as expected:

Cuda device:                              GeForce RTX 2080 Ti (#1 / 8)
Available memory on device:               2816 MB (total: 11019 MB)

CUDA Setup time 0.175057s
Cuda device:                              GeForce RTX 2080 Ti (#2 / 8)
Available memory on device:               10441 MB (total: 11019 MB)

CUDA Setup time 0.157140s
Cuda device:                              GeForce RTX 2080 Ti (#3 / 8)
Available memory on device:               7385 MB (total: 11019 MB)

CUDA Setup time 0.229185s
Cuda device:                              GeForce RTX 2080 Ti (#4 / 8)
Available memory on device:               3330 MB (total: 11019 MB)

CUDA Setup time 0.194629s
Cuda device:                              GeForce RTX 2080 Ti (#5 / 8)
Available memory on device:               7873 MB (total: 11019 MB)

CUDA Setup time 0.257307s
Cuda device:                              GeForce RTX 2080 Ti (#6 / 8)
Available memory on device:               8565 MB (total: 11019 MB)

CUDA Setup time 0.204192s
Cuda device:                              GeForce RTX 2080 Ti (#7 / 8)
Available memory on device:               8097 MB (total: 11019 MB)

CUDA Setup time 0.226710s
Cuda device:                              GeForce RTX 2080 Ti (#8 / 8)
Available memory on device:               9240 MB (total: 11019 MB)

CUDA Setup time 0.351348s

Running Job #1:
    Device: GeForce RTX 2080 Ti (#1 / 8)
    Grid map file: protein.maps.fld
    Ligand file: ligand_0_isomer_0_conf_0_split_0.pdbqt
    Using heuristics: (capped) number of evaluations set to 2068966
    Local-search chosen method is: ADADELTA (ad)

Rest of Setup time 0.022616s

Hope these info is helpful for identifying the bugs.

Thanks.

@atillack
Copy link
Member

@Hong-Rui Thank you for reporting this issue. Also a big thank you for going the extra mile and being very thorough :-)

I am currently suspecting it might be a cut character when parsing the last argument (which would explain why -D 1 but none of the others work as GPU #1 is used by default) but on a smaller machine (2 OpenCL GPUs) I could so far not reproduce this ...

I'll continue looking into it and hope to be able to reproduce and fix it soon once one of our 8x Cuda machines becomes available.

@atillack
Copy link
Member

atillack commented Sep 17, 2021

@Hong-Rui Fix for the bug is up as PR #153 and should be merged soon. The bug turned out to be the wrong Cuda device being set (the first) for some threads which then lead to the crash as the memory pointer for device # 2 wasn't valid on # 1.

Thank you again for reporting it!

@atillack
Copy link
Member

@Hong-Rui New release v1.5.2 with the fixes is up.

@Hong-Rui
Copy link
Author

Hi @atillack , I just tried this fixes, and it works!
Thank you for the fix and your quick response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants