Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus error (core dumped) #174

Closed
Reid00 opened this issue Dec 14, 2020 · 12 comments
Closed

Bus error (core dumped) #174

Reid00 opened this issue Dec 14, 2020 · 12 comments

Comments

@Reid00
Copy link

Reid00 commented Dec 14, 2020

DGLBACKEND=pytorch dglke_train --model_name TransE_l2 --dataset patient --batch_size 1000 --neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 24000 --log_interval 100 --batch_size_eval 16 -adv --regularization_coef 1.00E-09 --test --gpu 0 1 2 3 --mix_cpu_gpu --data_path ./data/ --format raw_udd_hrt --data_files train.txt valid.txt test.txt --neg_sample_size_eval 10000

image

does this error means out of memory?

@classicsong
Copy link
Contributor

Which DGL version you are using?

@Reid00
Copy link
Author

Reid00 commented Dec 14, 2020

it's 0.4.3 follow the official guideline https://dglke.dgl.ai/doc/install.html

@classicsong
Copy link
Contributor

Can you run dmesg to see the linux error log?
Further which pytorch version are you using?

@Reid00
Copy link
Author

Reid00 commented Dec 14, 2020

thank you. i run dmesg , result as below and pytorch version is 1.7.0 + cuda 10.2
image

@classicsong
Copy link
Contributor

Can you do some simple math to see if you have enough memory to hold the embeddings?
It at least needs: 4 * hidden_dim * number_of_nodes + 16 * number_of_edges to hold the whole graph for training without considering the intermediate memory requirements.

@Reid00
Copy link
Author

Reid00 commented Dec 14, 2020

i think memory is enough.
i'm not sure why this happens. when i use single GPU is run well
DGLBACKEND=pytorch dglke_train --model_name TransE_l2 --dataset patient --batch_size 1000 --neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 24000 --log_interval 100 --batch_size_eval 16 -adv --regularization_coef 1.00E-09 --test --gpu 0 --mix_cpu_gpu --data_path ./data/ --format raw_udd_hrt --data_files train.txt valid.txt test.txt --neg_sample_size_eval 10000
go well

DGLBACKEND=pytorch dglke_train --model_name TransE_l2 --dataset patient --batch_size 1000 --neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 24000 --log_interval 100 --batch_size_eval 16 -adv --regularization_coef 1.00E-09 --test --gpu 0 1 2 3 --mix_cpu_gpu --data_path ./data/ --format raw_udd_hrt --data_files train.txt valid.txt test.txt --neg_sample_size_eval 10000
error with Buss error

@classicsong
Copy link
Contributor

Then, it maybe related to multi-gpu implementation. Did you try pytorch 1.6?
What kind of hardware you are using?

@Reid00
Copy link
Author

Reid00 commented Dec 14, 2020

i will try pytorch 1.6 later
image

@classicsong
Copy link
Contributor

Thank you.

@Reid00
Copy link
Author

Reid00 commented Dec 14, 2020

i tried torch 1.6 cuda 10.2 and python 3.8. it's still the same error. so what other reason maybe cause to this issue?

@classicsong
Copy link
Contributor

Can you try run it with GDB and use backtrace to see where cause the crash?

@VoVAllen
Copy link
Contributor

VoVAllen commented Feb 2, 2021

Bus error usually means your shared memory size is not big enough. If you are using docker, please pass --shm-size="4G" or bigger value when running the container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants