--devnum parsing error when using multi-GPU functionality #152

Hong-Rui · 2021-09-16T09:38:27Z

Hi developers,

I'm trying to use the new multi-GPU features in AD-GPU 1.5.1, and with OVERLAP=ON during compiling.
There are 8 GPU cards in my local computing machine. When I specified the cuda device numbers, the command line parsers in main.cpp seems not correctly parsing the input CUDA IDs:

When I try autodock_gpu_128wi -B ligand_conf_batch.dat -n 10 -D 1 , it seems to work correctly, with outputting the cuda info as

Running Job #7:
    Device: GeForce RTX 2080 Ti (#1 / 8)
    Grid map file: protein.maps.fld
    Ligand file: ligand_2_isomer_0_conf_0_split_0.pdbqt
    Using heuristics: (capped) number of evaluations set to 2068966
    Local-search chosen method is: ADADELTA (ad)

When I try autodock_gpu_128wi -B ligand_conf_batch.dat -n 10 -D 2 , it gives me an cuda error like

Running Job #13:
    Device: GeForce RTX 2080 Ti (#2 / 8)
    Grid map file: protein.maps.fld
    Ligand file: ligand_8_isomer_0_conf_0_split_0.pdbqt
    Using heuristics: (capped) number of evaluations set to 2068966
    Local-search chosen method is: ADADELTA (ad)
gpu_calc_initpop_kernel an illegal memory access was encountered
autodock_gpu_128wi: ./cuda/kernel1.cu:65: void gpu_calc_initpop(uint32_t, uint32_t, float*, float*): Assertion `0' failed.
[1]    38987 abort (core dumped)  autodock_gpu_128wi -B ligand_conf_batch.dat -n 10 -D 2

The error output is the same for -D to (2,3,4,5,6,7,8), it does not work for all cuda device indices except for -D 1.

And when I try autodock_gpu_128wi -B ligand_conf_batch.dat -n 10 -D 2, (Note there is a comma at the end), it seems to duplicate the parsing of cuda idx 2, but runs normally:

Cuda device:                              GeForce RTX 2080 Ti (#2 / 8)
Available memory on device:               10439 MB (total: 11019 MB)

CUDA Setup time 0.280641s
Cuda device:                              GeForce RTX 2080 Ti (#2 / 8)
Available memory on device:               8301 MB (total: 11019 MB)

CUDA Setup time 0.000852s

Running Job #14:
    Device: GeForce RTX 2080 Ti (#2 / 8)
    Grid map file: protein.maps.fld
    Ligand file: ligand_9_isomer_0_conf_0_split_0.pdbqt
    Using heuristics: (capped) number of evaluations set to 2068966
    Local-search chosen method is: ADADELTA (ad)

Rest of Setup time 0.036681s

But if i try to use multiple cards, such as autodock_gpu_128wi -B ligand_conf_batch.dat -n 10 -D 2,3,7, it works correctly, with all specified GPU cards detected:

Cuda device:                              GeForce RTX 2080 Ti (#2 / 8)
Available memory on device:               10746 MB (total: 11019 MB)

CUDA Setup time 0.291592s
Cuda device:                              GeForce RTX 2080 Ti (#3 / 8)
Available memory on device:               8883 MB (total: 11019 MB)

CUDA Setup time 0.236695s
Cuda device:                              GeForce RTX 2080 Ti (#7 / 8)
Available memory on device:               9072 MB (total: 11019 MB)

CUDA Setup time 0.212491s

Running Job #24:
    Device: GeForce RTX 2080 Ti (#7 / 8)
    Grid map file: protein.maps.fld
    Ligand file: ligand_17_isomer_0_conf_1_split_0.pdbqt
    Using heuristics: (capped) number of evaluations set to 2280181
    Local-search chosen method is: ADADELTA (ad)

Rest of Setup time 0.021547s

Finally if I use -D all autodock_gpu_128wi -B ligand_conf_batch.dat -n 10 -D all, everything works pretty fine as expected:

Cuda device:                              GeForce RTX 2080 Ti (#1 / 8)
Available memory on device:               2816 MB (total: 11019 MB)

CUDA Setup time 0.175057s
Cuda device:                              GeForce RTX 2080 Ti (#2 / 8)
Available memory on device:               10441 MB (total: 11019 MB)

CUDA Setup time 0.157140s
Cuda device:                              GeForce RTX 2080 Ti (#3 / 8)
Available memory on device:               7385 MB (total: 11019 MB)

CUDA Setup time 0.229185s
Cuda device:                              GeForce RTX 2080 Ti (#4 / 8)
Available memory on device:               3330 MB (total: 11019 MB)

CUDA Setup time 0.194629s
Cuda device:                              GeForce RTX 2080 Ti (#5 / 8)
Available memory on device:               7873 MB (total: 11019 MB)

CUDA Setup time 0.257307s
Cuda device:                              GeForce RTX 2080 Ti (#6 / 8)
Available memory on device:               8565 MB (total: 11019 MB)

CUDA Setup time 0.204192s
Cuda device:                              GeForce RTX 2080 Ti (#7 / 8)
Available memory on device:               8097 MB (total: 11019 MB)

CUDA Setup time 0.226710s
Cuda device:                              GeForce RTX 2080 Ti (#8 / 8)
Available memory on device:               9240 MB (total: 11019 MB)

CUDA Setup time 0.351348s

Running Job #1:
    Device: GeForce RTX 2080 Ti (#1 / 8)
    Grid map file: protein.maps.fld
    Ligand file: ligand_0_isomer_0_conf_0_split_0.pdbqt
    Using heuristics: (capped) number of evaluations set to 2068966
    Local-search chosen method is: ADADELTA (ad)

Rest of Setup time 0.022616s

Hope these info is helpful for identifying the bugs.

Thanks.

The text was updated successfully, but these errors were encountered:

atillack · 2021-09-16T15:36:44Z

@Hong-Rui Thank you for reporting this issue. Also a big thank you for going the extra mile and being very thorough :-)

I am currently suspecting it might be a cut character when parsing the last argument (which would explain why -D 1 but none of the others work as GPU #1 is used by default) but on a smaller machine (2 OpenCL GPUs) I could so far not reproduce this ...

I'll continue looking into it and hope to be able to reproduce and fix it soon once one of our 8x Cuda machines becomes available.

atillack · 2021-09-17T02:38:03Z

@Hong-Rui Fix for the bug is up as PR #153 and should be merged soon. The bug turned out to be the wrong Cuda device being set (the first) for some threads which then lead to the crash as the memory pointer for device # 2 wasn't valid on # 1.

Thank you again for reporting it!

atillack · 2021-09-17T03:20:49Z

@Hong-Rui New release v1.5.2 with the fixes is up.

Hong-Rui · 2021-09-17T03:27:16Z

Hi @atillack , I just tried this fixes, and it works!
Thank you for the fix and your quick response!

Hong-Rui assigned atillack Sep 16, 2021

atillack added the bug Something isn't working label Sep 16, 2021

atillack mentioned this issue Sep 17, 2021

Fix for memory double freeing and wrong Cuda device in certain cases #153

Merged

Hong-Rui closed this as completed Sep 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--devnum parsing error when using multi-GPU functionality #152

--devnum parsing error when using multi-GPU functionality #152

Hong-Rui commented Sep 16, 2021

atillack commented Sep 16, 2021

atillack commented Sep 17, 2021 •

edited

Loading

atillack commented Sep 17, 2021

Hong-Rui commented Sep 17, 2021

--devnum parsing error when using multi-GPU functionality #152

--devnum parsing error when using multi-GPU functionality #152

Comments

Hong-Rui commented Sep 16, 2021

atillack commented Sep 16, 2021

atillack commented Sep 17, 2021 • edited Loading

atillack commented Sep 17, 2021

Hong-Rui commented Sep 17, 2021

atillack commented Sep 17, 2021 •

edited

Loading