Bus error (core dumped) #174

Reid00 · 2020-12-14T01:44:16Z

DGLBACKEND=pytorch dglke_train --model_name TransE_l2 --dataset patient --batch_size 1000 --neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 24000 --log_interval 100 --batch_size_eval 16 -adv --regularization_coef 1.00E-09 --test --gpu 0 1 2 3 --mix_cpu_gpu --data_path ./data/ --format raw_udd_hrt --data_files train.txt valid.txt test.txt --neg_sample_size_eval 10000

does this error means out of memory?

classicsong · 2020-12-14T02:01:34Z

Which DGL version you are using?

Reid00 · 2020-12-14T02:04:58Z

it's 0.4.3 follow the official guideline https://dglke.dgl.ai/doc/install.html

classicsong · 2020-12-14T02:26:32Z

Can you run dmesg to see the linux error log?
Further which pytorch version are you using?

Reid00 · 2020-12-14T02:43:29Z

thank you. i run dmesg , result as below and pytorch version is 1.7.0 + cuda 10.2

classicsong · 2020-12-14T02:50:41Z

Can you do some simple math to see if you have enough memory to hold the embeddings?
It at least needs: 4 * hidden_dim * number_of_nodes + 16 * number_of_edges to hold the whole graph for training without considering the intermediate memory requirements.

Reid00 · 2020-12-14T03:07:24Z

i think memory is enough.
i'm not sure why this happens. when i use single GPU is run well
DGLBACKEND=pytorch dglke_train --model_name TransE_l2 --dataset patient --batch_size 1000 --neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 24000 --log_interval 100 --batch_size_eval 16 -adv --regularization_coef 1.00E-09 --test --gpu 0 --mix_cpu_gpu --data_path ./data/ --format raw_udd_hrt --data_files train.txt valid.txt test.txt --neg_sample_size_eval 10000
go well

DGLBACKEND=pytorch dglke_train --model_name TransE_l2 --dataset patient --batch_size 1000 --neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 24000 --log_interval 100 --batch_size_eval 16 -adv --regularization_coef 1.00E-09 --test --gpu 0 1 2 3 --mix_cpu_gpu --data_path ./data/ --format raw_udd_hrt --data_files train.txt valid.txt test.txt --neg_sample_size_eval 10000
error with Buss error

classicsong · 2020-12-14T03:35:38Z

Then, it maybe related to multi-gpu implementation. Did you try pytorch 1.6?
What kind of hardware you are using?

Reid00 · 2020-12-14T03:38:34Z

i will try pytorch 1.6 later

classicsong · 2020-12-14T03:48:02Z

Thank you.

Reid00 · 2020-12-14T04:24:05Z

i tried torch 1.6 cuda 10.2 and python 3.8. it's still the same error. so what other reason maybe cause to this issue?

classicsong · 2020-12-28T16:39:05Z

Can you try run it with GDB and use backtrace to see where cause the crash?

VoVAllen · 2021-02-02T16:03:26Z

Bus error usually means your shared memory size is not big enough. If you are using docker, please pass --shm-size="4G" or bigger value when running the container.

classicsong closed this as completed Mar 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bus error (core dumped) #174

Bus error (core dumped) #174

Reid00 commented Dec 14, 2020

classicsong commented Dec 14, 2020

Reid00 commented Dec 14, 2020

classicsong commented Dec 14, 2020

Reid00 commented Dec 14, 2020

classicsong commented Dec 14, 2020

Reid00 commented Dec 14, 2020

classicsong commented Dec 14, 2020

Reid00 commented Dec 14, 2020

classicsong commented Dec 14, 2020

Reid00 commented Dec 14, 2020 •

edited

classicsong commented Dec 28, 2020

VoVAllen commented Feb 2, 2021

Bus error (core dumped) #174

Bus error (core dumped) #174

Comments

Reid00 commented Dec 14, 2020

classicsong commented Dec 14, 2020

Reid00 commented Dec 14, 2020

classicsong commented Dec 14, 2020

Reid00 commented Dec 14, 2020

classicsong commented Dec 14, 2020

Reid00 commented Dec 14, 2020

classicsong commented Dec 14, 2020

Reid00 commented Dec 14, 2020

classicsong commented Dec 14, 2020

Reid00 commented Dec 14, 2020 • edited

classicsong commented Dec 28, 2020

VoVAllen commented Feb 2, 2021

Reid00 commented Dec 14, 2020 •

edited