Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does my dglke_eval not work? #138

Open
zhenshiqi1996 opened this issue Aug 19, 2020 · 5 comments
Open

Why does my dglke_eval not work? #138

zhenshiqi1996 opened this issue Aug 19, 2020 · 5 comments

Comments

@zhenshiqi1996
Copy link

When I train the TransE using the giving command and eval the model, the evaluation function dosen't work without error.

image

@classicsong
Copy link
Contributor

Can you check cpu/gpu usage?

@vlad43210
Copy link

I have this same issue -- tried it with and without the GPU. CPU usage just quickly drops to 0 and the program hangs there with no error:

DGLBACKEND=pytorch dglke_eval \
> --data_path (data path) --dataset mydataset \
> --data_files entities.tsv relations.tsv all_ctups_2.tsv valid.tsv test.tsv --format udd_hrt \
> --model_name ComplEx \
> --hidden_dim 512 --gamma 175 \
> --num_proc 8 --num_thread 4 \
> --batch_size_eval 1024 --neg_sample_size_eval 10000 --eval_percent 5 \
> --model_path (model path)
Using backend: pytorch
Reading train triples....
Finished. Read 211968 train triples.
Reading valid triples....
Finished. Read 27026 valid triples.
Reading test triples....
Finished. Read 31796 test triples.
/opt/conda/envs/kge/lib/python3.8/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
|valid|: 27026
|test|: 31796

Package versions:

- dgl==0.4.3post2
- dglke==0.1.2
- torch==1.7.1

@mtoles
Copy link

mtoles commented Sep 22, 2022

Has anyone discovered a fix? I am experiencing a similar problem.

@classicsong
Copy link
Contributor

Can you provide more details?

@classicsong classicsong reopened this Sep 22, 2022
@PoloWitty
Copy link

I meet the same issue, too. But I found it may be because the eval process is too slow.
The cmd are as followed:

dglke_eval \
    --model_name TransE \
    --data_path ~/KGE/data/BIOS \
    --dataset BIOS \
    --format raw_udd_hrt \
    --data_files train.tsv valid.tsv test.tsv \
    --hidden_dim 512 \
    --batch_size_eval 256 \
    --neg_sample_size_eval 256 \
    --model_path ~/KGE/output/TransE_BIOS_2/ \
    --gpu 0

after the process output |valid|: 8449749 |test|: 8449806, it seems that the process have died. But actually the process is still running. I don't know why the eval process is so slow as the train process can be so fast. I also tried to use --batch_size_eval 10000 --neg_sample_size_eval 10000 mentioned #106 here. But the speed is still pretty slow and the GPU util is still as slow as 10% (for A100).
I also tried to use multi-process to eval on cpu, I didn't see any improvement yet.

I wonder if there is any solution to this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants