Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU not working #136

Closed
tpanza opened this issue May 23, 2024 · 1 comment
Closed

Multi-GPU not working #136

tpanza opened this issue May 23, 2024 · 1 comment

Comments

@tpanza
Copy link

tpanza commented May 23, 2024

I'm running marker-pdf 0.2.8 on a RHEL 7 machine with 8 GPUs on it. I am trying to leverage all 8 of those GPUs, but only GPU 0 is getting utilized.

Command I am using:

export INFERENCE_RAM=16
export TORCH_DEVICE=cuda
NUM_DEVICES=8 NUM_WORKERS=24 MIN_LENGTH=1000 marker path/to/my/input/dir path/to/my/output/dir

Console output, after running the above command:

2024-05-23 12:02:29,517 INFO worker.py:1749 -- Started a local Ray instance.
Loaded detection model vikp/surya_det2 on device cuda with dtype torch.float16
Loaded detection model vikp/surya_layout2 on device cuda with dtype torch.float16
Loaded reading order model vikp/surya_order on device cuda with dtype torch.float16
Loaded texify model to cuda with torch.float16 dtype
Converting 40217 pdfs in chunk 1/1 with 5 processes, and storing in path/to/my/output/dir
  1%|█▌                                                                                                                                                            | 382/40217 [56:37<80:05:59,  7.24s/it]

Only GPU 0 gets utilized (75%). The other 7 just have 4 MiB of memory usage, but no utilization and no processes are tied to them. Output from nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           On  | 00000000:08:00.0 Off |                    0 |
| N/A   42C    P0             222W / 250W |  10132MiB / 16384MiB |     75%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE-16GB           On  | 00000000:0B:00.0 Off |                    0 |
| N/A   27C    P0              27W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE-16GB           On  | 00000000:0E:00.0 Off |                    0 |
| N/A   38C    P0              26W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE-16GB           On  | 00000000:11:00.0 Off |                    0 |
| N/A   34C    P0              26W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE-16GB           On  | 00000000:16:00.0 Off |                    0 |
| N/A   27C    P0              24W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE-16GB           On  | 00000000:19:00.0 Off |                    0 |
| N/A   28C    P0              29W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE-16GB           On  | 00000000:1C:00.0 Off |                    0 |
| N/A   32C    P0              23W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla V100-PCIE-16GB           On  | 00000000:22:00.0 Off |                    0 |
| N/A   32C    P0              25W / 250W |      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     37553      C   ray::process_single_pdf                    4026MiB |
|    0   N/A  N/A     37554      C   ray::process_single_pdf                    4026MiB |
|    0   N/A  N/A     38661      C   ray::process_single_pdf                     818MiB |
|    0   N/A  N/A     73566      C   ...conda_envs/py311agentsds/bin/python     1298MiB |
+---------------------------------------------------------------------------------------+
@VikParuchuri
Copy link
Owner

Check the README, you have to use the chunk convert script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants