Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The code fails when running a custom graph #9

Closed
rcap107 opened this issue Dec 3, 2021 · 6 comments
Closed

The code fails when running a custom graph #9

rcap107 opened this issue Dec 3, 2021 · 6 comments

Comments

@rcap107
Copy link

rcap107 commented Dec 3, 2021

Hello,
I am trying to run the code on a graph I have built that is not included among the choices provided in the repo. The graph I am working with is bipartite with typed edges.

To run the code, I prepared the list of triplets to be split in train, valid and test sets.

Triplets are saved in .tsv files:

...
idx__518	id	2449.0
idx__519	id	2452.0
idx__523	id	2469.0
idx__531	id	2484.0
idx__532	id	2487.0
idx__533	id	2494.0
idx__549	id	2545.0
...

To run my dataset, I slightly modified the python scripts in the repo.

In preprocess_dataset.py I added the line datasets = ['mydata'] to read from the folder src/src_data/mydata. I was then able to run the script, which created and filled the folder data/mydata.

DATA_PATH: /content/ssl-relation-prediction/data
Preparing dataset mydata
2681 entities and 9 relations
creating filtering lists
Done processing!
1

In main.py, I modified the list of datasets by adding mydata so that the code wouldn't raise an exception.

Finally, I tried to run the code with the arguments specified in the readme:

! python src/main.py --dataset mydata --score_rel True --model ComplEx --rank 1000 --learning_rate 0.1 --batch_size 1000 --lmbda 0.05 --w_rel 4 --max_epochs 100

Unfortunately, at this point the code fails because there is no train.npy file in the data folder. I assume that the train.npy file should have been created by the preprocessing script, but for some reason that did not happen. The content of the data folder is the following:

total 1320
-rw-r--r-- 1 root root   40662 Dec  3 12:59 ent_id
-rw-r--r-- 1 root root      97 Dec  3 12:59 rel_id
-rw-r--r-- 1 root root   43023 Dec  3 12:59 test.tsv.pickle
-rw-r--r-- 1 root root 1083297 Dec  3 12:59 to_skip.pickle
-rw-r--r-- 1 root root  140727 Dec  3 12:59 train.tsv.pickle
-rw-r--r-- 1 root root   32391 Dec  3 12:59 valid.tsv.pickle

It's not clear to me how to run custom-made datasets from the readme. Could you help me with that?

@yihong-chen
Copy link
Contributor

Hi @rcap107, thanks for the detailed reporting. I guess if you change the name of train.tsv.pickle to train.pickle , valid.tsv.pickle to valid.pickle , test.tsv.pickle to test.pickle, the code should work. Let me know if it still fails.

yihong-chen added a commit that referenced this issue Dec 3, 2021
@rcap107
Copy link
Author

rcap107 commented Dec 6, 2021

Hello @yihong-chen, thanks for the quick reply. I realized later that I had a typo in my code, silly mistake. The code started running after fixing that problem, however now I have something entirely different.

Loading to_skip file ...
my_data Dataset Stat: (2681, 18, 2681)
Train/Valid/Test 5857/1343/1786
Train/Valid/Test 0.652/0.149/0.199
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose 'Don't visualize my results'
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Git commit ID: b'6ef60266c0f0ddac49c1f80b8bed40919585d001\n'
Creating a sampler of size 11714
  0% 0/11714 [00:00<?, ?it/s]Traceback (most recent call last):
  File "src/main.py", line 135, in <module>
    engine.episode()
  File "/content/ssl-relation-prediction/src/engines.py", line 168, in episode
    l.backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [16,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [17,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [19,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [22,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [23,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [25,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [28,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [29,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:455: nll_loss_backward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.

wandb: Waiting for W&B process to finish, PID 526... (failed 1).
wandb: Run summary:
wandb:   epoch_id 0
wandb:    is_done False
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /content/ssl-relation-prediction/wandb/offline-run-20211206_164639-37oeac1i
wandb: Find logs at: ./wandb/offline-run-20211206_164639-37oeac1i/logs/debug.log
wandb: 

Do you have any insight?

I am using a colab notebook.

Thanks

@yihong-chen
Copy link
Contributor

Hi @rcap107, looking at the logs, it seems that the cross entropy loss l_fit isn't calculated properly. The input_batch_train[:, 2] or input_batch_train[:, 1] doesn't fall in the range of [0, the number of predicted classes] (either the number of entities or the number of relations).

I would suggest you check the input files. The repo expects inputs formatted as tab-separated triples like

node    relation-type    node

For example,

London    locates-in    UK

or

User1    Interacts-with   Item

If the input was provided as your above comment,

idx__518 id 2449.0

then the program would expect idx__518 as a node, id as the relation type and 2449.0 as the other node. Is this what you meant them to be?

BTW feel free to share the colab link that reproduces this bug. Very happy to look into it.

@rcap107
Copy link
Author

rcap107 commented Dec 7, 2021

Hello, the colab notebook is here. There really isn't nothing fancy in it, I am modifying the files I mentioned in the OP manually then I import the train, valid and test files from my drive. I've attached those tsv files to this message. To reply to your previous question, the input is indeed in the format I need, and in the format you described.

I am not completely sure how the algorithm is able to distinguish between positive and negative samples simply from the train/valid splits. Indeed, in all train/valid/test files, all samples would be labeled as "true". Is it done internally?
I've attached the train/valid/test files I am working with (not exactly the same as the file I tested above, but with the same format).
test.txt
train.txt
valid.txt

Thanks for the help!

yihong-chen added a commit that referenced this issue Dec 7, 2021
@yihong-chen
Copy link
Contributor

Hi @rcap107, I checked your colab. It works well on CPU. For GPU, I tested it a bit and found that it works well with Pytorch 1.8.2 (LTS). The default pytorch in colab is 1.10. That's why the code fails. I created a colab notebook that runs successfully on your data. Feel free to ping me if it still fails.

I am not completely sure how the algorithm is able to distinguish between positive and negative samples simply from the train/valid splits. Indeed, in all train/valid/test files, all samples would be labelled as "true". Is it done internally?

Yes, every triple in train/valid/test files is a "true/positive" example. As for negative examples, we generate them by corrupting the triples. For example, given a triple (h, r, t), the negatives would be 1) (h, r, t') by replacing t with any other possible entity t' in the graph or 2) (h, r', t) by replacing r with any other possible relation type r'. Hope this answers your question.

@rcap107
Copy link
Author

rcap107 commented Dec 9, 2021

Hello @yihong-chen, thanks a lot for all the help and for answering my questions. I tried the notebook you provided and it seems to work on my side as well. I do not have any further questions for the moment.

@rcap107 rcap107 closed this as completed Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants