-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to do negative sampling with type constraints? #68
Comments
Thank you for using DGL-KE. Great thanks if you can contribute this feature! |
Thanks for the feature request. This is definitely something we should support. DGL-KE does joint negative sampling for efficiency. That is, instead of creating negative edges for each positive edge independently, we corrupt the head/tail node of a group of edges altogether and replace them with a new set of nodes randomly sampled from the graph. We need to extend joint negative sampling to the type constraint setting. We need to maintain the head/tail entities for each relation type. Potentially, we need to control the number of relations in a batch to achieve good efficiency. |
@classicsong @zheng-da thanks for the quick response! Yes, I agree that joint negative sampling is more efficient, so ideally doing joint negative sampling with type constraints would be best. There are probably other ways to do it - batching relations together and applying a special sampler for ever relation type (one sampler only per batch) is one way to do it. I imagine it will take some time for this to be added to the repo - meanwhile on my end, do you think the two-stage procedure suggested above (sampling positive edges first, then based on sampled relation types sample negative edges) is a good way or is there something easier? I spent some time familiarizing myself with your codebase and it seemed this was the easiest way to do it. Thanks again for the great work. |
@asaluja I agree that the two-stage procedure will work and it's something I have in mind as well. The main thing we need to take care of is how to combine this with joint negative sampling. We might need to control the number of relations in a batch so that joint negative sampling can be effective. Our experience is that if we reduce the number of relations in a batch, the performance of the trained embeddings drops. I think we need some experiments to balance computation efficiency and training speed. It'll be great if you can contribute this functionality. Please let us know if you have any questions about the current code base. |
Hi @zheng-da @asaluja I have the same use-case i.e. to sample negative samples with constraints on type of head/tail entity. As suggested by you, I set the seed edges to be the edges that belong to a particular edge-type/relation. However, the |
Hello, I think you have learned the code in detail, so I want to ask you. On the paper, I see when sampling, the pos_g has 1024 edges but the neg_g has also 1024 edges, it corrupts every triplet 1 time, but not k times as mentioned on the paper, is it right? |
Hi, thanks for putting this library together. I will put a feature request together in a similar format to the dgl repo:
🚀 Feature
Negative sampling with type constraints in
dgl.contrib.sampling.EdgeSampler
(viadataloader.sampler.TrainDataset
).Motivation
When using
EdgeSampler
to sample negative edges in knowledge graph link prediction, it would be useful to incorporate domain-specific type constraints. For example, edges (relations) in a KG are often typed (only specific entity types can slot into the head or tail entities), so anEdgeSampler
that only samples negative edges by selecting head/tail nodes from a subset of all possible entities would greatly help.Alternatives
One idea I had was to create different
EdgeSampler
objects for relations and then batch the graph based on relations. That way when sampling a mini-batch we are guaranteed that all facts in the batch have the same relation type and can apply the sameEdgeSampler
object to get negative samples. But it seems doing this requires diving into the C++ sampler code.Another alternative is a two-step sampling procedure in training where I first a) sample positive edges only without replacement and then b) based on the relation types in the positive edges, sample negative edges from the specific
EdgeSampler
with replacement. This seems to be cleaner but also somewhat inefficient. Are there other disadvantages to this?Any guidance and tips on how best to implement this would be great. I'd be happy to contribute it back to the repo.
Pitch
Similar functionality to how type constraints work in OpenKE.
The text was updated successfully, but these errors were encountered: