Multi-GPU not working #136

tpanza · 2024-05-23T20:04:04Z

I'm running marker-pdf 0.2.8 on a RHEL 7 machine with 8 GPUs on it. I am trying to leverage all 8 of those GPUs, but only GPU 0 is getting utilized.

Command I am using:

export INFERENCE_RAM=16
export TORCH_DEVICE=cuda
NUM_DEVICES=8 NUM_WORKERS=24 MIN_LENGTH=1000 marker path/to/my/input/dir path/to/my/output/dir

Console output, after running the above command:

2024-05-23 12:02:29,517 INFO worker.py:1749 -- Started a local Ray instance.
Loaded detection model vikp/surya_det2 on device cuda with dtype torch.float16
Loaded detection model vikp/surya_layout2 on device cuda with dtype torch.float16
Loaded reading order model vikp/surya_order on device cuda with dtype torch.float16
Loaded texify model to cuda with torch.float16 dtype
Converting 40217 pdfs in chunk 1/1 with 5 processes, and storing in path/to/my/output/dir
  1%|█▌                                                                                                                                                            | 382/40217 [56:37<80:05:59,  7.24s/it]

Only GPU 0 gets utilized (75%). The other 7 just have 4 MiB of memory usage, but no utilization and no processes are tied to them. Output from nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           On  | 00000000:08:00.0 Off |                    0 |
| N/A   42C    P0             222W / 250W |  10132MiB / 16384MiB |     75%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE-16GB           On  | 00000000:0B:00.0 Off |                    0 |
| N/A   27C    P0              27W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE-16GB           On  | 00000000:0E:00.0 Off |                    0 |
| N/A   38C    P0              26W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE-16GB           On  | 00000000:11:00.0 Off |                    0 |
| N/A   34C    P0              26W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE-16GB           On  | 00000000:16:00.0 Off |                    0 |
| N/A   27C    P0              24W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE-16GB           On  | 00000000:19:00.0 Off |                    0 |
| N/A   28C    P0              29W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE-16GB           On  | 00000000:1C:00.0 Off |                    0 |
| N/A   32C    P0              23W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla V100-PCIE-16GB           On  | 00000000:22:00.0 Off |                    0 |
| N/A   32C    P0              25W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     37553      C   ray::process_single_pdf                    4026MiB |
|    0   N/A  N/A     37554      C   ray::process_single_pdf                    4026MiB |
|    0   N/A  N/A     38661      C   ray::process_single_pdf                     818MiB |
|    0   N/A  N/A     73566      C   ...conda_envs/py311agentsds/bin/python     1298MiB |
+---------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

VikParuchuri · 2024-05-24T01:31:58Z

Check the README, you have to use the chunk convert script

VikParuchuri closed this as completed May 24, 2024

yinochaos mentioned this issue Jun 10, 2024

marker_chunk_convert multi-GPU not work #178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU not working #136

Multi-GPU not working #136

tpanza commented May 23, 2024

VikParuchuri commented May 24, 2024

Multi-GPU not working #136

Multi-GPU not working #136

Comments

tpanza commented May 23, 2024

VikParuchuri commented May 24, 2024