Training model on doclaynet dataset using multi-gpu #222

Shravan-Ganji · 2023-09-05T14:47:44Z

Bug 💥
I am Trying to train the model on doclaynet dataset using multiple gpu, but facing error as CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I have changed the code in d2_frcnn_train.py as follows:

def main(num_gpus,path_config_yaml,dataset_train,path_weights,config_overwrite,log_dir,build_train_config,dataset_val,build_val_config,metric_name,metric,pipeline_component_name):
launch(train_d2_faster_rcnn,
num_gpus,
1,
0,
"auto",
args=(path_config_yaml,
dataset_train,
path_weights,
config_overwrite,
log_dir,
build_train_config,
dataset_val,
build_val_config,
metric_name,
metric,
pipeline_component_name),)

where i am passing these parameters from Datasets_and_Eval.ipynb

Expected behavior 🧮
Reduce the computational time by some fraction

Screenshots 🖼
If possible, please add a screenshot of the error message, if possible

Desktop (please complete the following information, if any other than the one in the install requirements):

OS: [e.g. iOS]
ubuntu 20.04
cuda 11.3
torch 1.12.1 cuda enabled

Additional context 🧬
its working fine on the single gpu, but i want to train it on multiple gpus

everything has been modified accordingly to the changes like init.py across train and deepdoctection folder

JaMe76 · 2023-09-05T17:50:06Z

Thanks for reporting.

Does this error occur at the first iteration?

Or can you complete one training loop?

My first suggestion is there might be something wrong with the training data but I do not really evidence for that ...

Shravan-Ganji · 2023-09-06T03:17:41Z

Yes, its occurring at the first iteration.
The data for the doclaynet has been downloaded from the link provided from datasets_and_eval.ipynb
Using this data, I am able to train it on single GPU and it has been successfully trained.

JaMe76 added the bug Something isn't working label Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training model on doclaynet dataset using multi-gpu #222

Training model on doclaynet dataset using multi-gpu #222

Shravan-Ganji commented Sep 5, 2023 •

edited

Loading

JaMe76 commented Sep 5, 2023

Shravan-Ganji commented Sep 6, 2023

Training model on doclaynet dataset using multi-gpu #222

Training model on doclaynet dataset using multi-gpu #222

Comments

Shravan-Ganji commented Sep 5, 2023 • edited Loading

JaMe76 commented Sep 5, 2023

Shravan-Ganji commented Sep 6, 2023

Shravan-Ganji commented Sep 5, 2023 •

edited

Loading