'std::bad_alloc' error when evaluating a dataset with large number of entities. #101

HenryYihengXu · 2020-05-21T02:01:54Z

Hello! I'm running into this 'std::bad_alloc' error:

|test|: 17248443
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)

I split my dataset into training, validation, and test datasets. I first used dglke_train and passed these three files to --data_files. It finished training successfully. But when I ran dglke_eval with these three files, it yielded this error.

I'm pretty sure I have enough space on the machine. Do you know what could be the possible problem? Also, I'm confused by the command line arguments --dataset and --data_files of dglke_eval. What's the usage of --dataset when running my own dataset? Should I pass the same files to --data_files for evaluation as those for training?

The text was updated successfully, but these errors were encountered:

classicsong · 2020-05-21T02:28:46Z

Yes you should pass the same files for training to evaluation, otherwise the system does not know how the graph likes like. You also need to keep the dataset name same for train and eval, it is currently used in naming the output files (e.g., .npy)

HenryYihengXu · 2020-05-21T02:39:27Z

Ok thank you!

But I passed the same files, and it still yielded this error. I'm pretty sure I have enough space on the machine. Do you know what could be the possible cause?

classicsong · 2020-05-21T02:59:00Z

Can you show me the CMD for train and eval and can you show the calltrace where std::bad_alloc happened.

HenryYihengXu · 2020-05-21T03:09:40Z

Here is my command:

DGLBACKEND=pytorch dglke_train --model_name ComplEx --data_path ./data --data_files LJ_training.txt LJ_validation.txt LJ_test.txt --format raw_udd_hrt --batch_size 200000 --neg_sample_size 1000 --hidden_dim 100 --gamma 19.9 --lr 0.1 --max_step 2400 --log_interval 100 --batch_size_eval 10000 -adv --regularization_coef 1.00E-09 --test --gpu 1 --num_thread 1 --num_proc 1

And this is the output:

Using backend: pytorch
Logs are being recorded at: ckpts/ComplEx_FB15k_10/train.log
Reading train triples....
Finished. Read 62094396 train triples.
Reading valid triples....
Finished. Read 34496887 valid triples.
Reading test triples....
Finished. Read 34496886 test triples.
|Train|: 62094396
/usr/local/lib/python3.6/dist-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
|valid|: 34496887
|test|: 34496886
Total initialize time 611.986 seconds
[proc 0][Train](100/2400) average pos_loss: 0.6891365647315979
[proc 0][Train](100/2400) average neg_loss: 0.6942214441299438
[proc 0][Train](100/2400) average loss: 0.6916790020465851
[proc 0][Train](100/2400) average regularization: 7.632575531232532e-05
[proc 0][Train] 100 steps take 11.511 seconds
[proc 0]sample: 10.400, forward: 0.435, backward: 0.562, update: 0.113
......
......
[proc 0][Train](2400/2400) average pos_loss: 0.29677054792642593
[proc 0][Train](2400/2400) average neg_loss: 0.731871457695961
[proc 0][Train](2400/2400) average loss: 0.5143210029602051
[proc 0][Train](2400/2400) average regularization: 0.0003860547285876237
[proc 0][Train] 100 steps take 1.440 seconds
[proc 0]sample: 0.388, forward: 0.426, backward: 0.499, update: 0.126
proc 0 takes 45.253 seconds
training takes 45.25495719909668 seconds
terminate called after throwing an instance of 'std::bad_allocterminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
Aborted (core dumped)

Thanks!

classicsong · 2020-05-21T03:17:56Z

It seems you are using huge batch_size especially for evaluation. As in the evaluation, if you do not specify the neg_sample_size_eval, then the whole entity set will be used as candidate negative nodes, this will consume lots of memory. Reduce batch_size_eval to smaller one like 100 or 500. and use neg_sample_size_eval=10000 for example.

Your dataset has 60M edges, which is much larger than fb15k

HenryYihengXu · 2020-05-21T13:01:03Z

Thank you! The error is gone, but the evaluation becomes very slow with --batch_size_eval being 100

classicsong · 2020-05-21T14:16:55Z

How many nodes do you have? If it is large, e.g. millions of nodes, I recommend you to use neg_sample_size_eval=10000.

classicsong · 2020-10-15T06:32:01Z

Since the docs are update. Close this issue.

classicsong changed the title ~~'std::bad_alloc' error when evaluating my own dataset~~ 'std::bad_alloc' error when evaluating a dataset with large number of entities. May 21, 2020

classicsong self-assigned this May 21, 2020

classicsong mentioned this issue Jun 1, 2020

Training process get killed if --valid is turned on #106

Closed

classicsong closed this as completed Oct 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'std::bad_alloc' error when evaluating a dataset with large number of entities. #101

'std::bad_alloc' error when evaluating a dataset with large number of entities. #101

HenryYihengXu commented May 21, 2020 •

edited

Loading

classicsong commented May 21, 2020

HenryYihengXu commented May 21, 2020

classicsong commented May 21, 2020

HenryYihengXu commented May 21, 2020 •

edited

Loading

classicsong commented May 21, 2020

HenryYihengXu commented May 21, 2020

classicsong commented May 21, 2020

classicsong commented Oct 15, 2020

'std::bad_alloc' error when evaluating a dataset with large number of entities. #101

'std::bad_alloc' error when evaluating a dataset with large number of entities. #101

Comments

HenryYihengXu commented May 21, 2020 • edited Loading

classicsong commented May 21, 2020

HenryYihengXu commented May 21, 2020

classicsong commented May 21, 2020

HenryYihengXu commented May 21, 2020 • edited Loading

classicsong commented May 21, 2020

HenryYihengXu commented May 21, 2020

classicsong commented May 21, 2020

classicsong commented Oct 15, 2020

HenryYihengXu commented May 21, 2020 •

edited

Loading

HenryYihengXu commented May 21, 2020 •

edited

Loading