-
Notifications
You must be signed in to change notification settings - Fork 449
Using pre-trained WikiData embeddings for nearest neighbor search #85
Comments
Most of what you are asking is explained in the documentation so, rather than copy-pasting it here, I suggest you to read up there and let me know if anything is unclear. The one thing that I believe we did not explain in the doc is what we mean by "nearest neighbor" search. You are right in saying that to properly compute how "close" (similar) two entities are one should apply the proper operators and do the dot product. However, it turns out that once the embeddings are fully trained, their distance in L2 space already captures some semantic similarity and can thus be used to get a rough sense of the neighbors. This is an approximation we did in that example that we should have explained better. If you want a more exact search, there's a few options. I believe you can tell FAISS to use the dot product over the L2 norm, although not all indices support it. You cannot tell FAISS to apply the operator for you, but you can apply yourself the operator to your query before searching in the un-transformed embeddings (if you use "standard" relations, this only allows you to query the nearest left-hand side neighbors of a right-hand side entity; if you use dynamic relations you can do it on either side). If for some reason this doesn't work for you, you can drop FAISS entirely and do a slower but more correct evaluation of the scores between an entity and all other ones, similarly to what is done when ranking. As mentioned in the page about the Wikidata embeddings, the TSV is almost the same format as the one of the You will also find in the doc that, in addition to TSV, there's a machine-readable format for these embeddings (i.e., Dynamic relations are explained here. The parameters for the left-hand side operators also appear in the TSV file, with a Then, the FAISS example should work just the same for the Wikidata embeddings. Due to their size you may want to use a different index type for better performance, but that depends on your application and you should turn to the FAISS developers for help with tuning. |
Hello, i would like to work with pyTorch-biggraph, my aim is from a graph data set ,i want to be able to find some entities simularities, and dertermine some simularity between entitites that has numerical attributes (my data are in RDF format) and at the end how can i apply TransEA model with numerical attributes after detemining the simularity between entities |
Your questions are very broad and they are basically about how to design a full ML pipeline, which is something that is up to you, rather than how to employ PBG as one block of it, which is what we're here to help with. I advise you to check out the README and documentation and get back to us if you have specific issues. |
Thank you very much for your advise and your feedback, may you please guide me since i am a newby in this field, is it possible to get vectors from an RDF dataset using PBG tool? if yes which step should follow? secondly after getting vectors i would like to compare these vectors and find which entities vectors are simular so is it also possible to do that with PBG, once again thanks in advance for your guidance and orientation |
PBG, by itself, doesn't read RDF. You need to convert it to either the native format (explained here) or to TSV (tab-separated values), for which there already is an importer (i.e., |
Thank you very much for the details... |
Closing this as I think everything had been answered and there were no follow-ups. If I missed something or new questions arise, please reopen or create a new issue. |
I downloaded the 36 GB gzipped file (wikidata_translation_v1.tsv.gz) containing the pre-trained WikiData embeddings listed on this page: https://torchbiggraph.readthedocs.io/en/latest/pretrained_embeddings.html
I unzipped it and I see that the actual tsv file is 100+ GB.
I would like to utilize these embeddings to perform nearest neighbor search. But to perform this search, I would need to use the same comparator and operator that was used to train these, right?
I see on the page that the comparator used was
dot
and the operator used wastranslation
. Is the learned translation vector already added to the embeddings in the tsv file or do I need to fetch it from some place and manually add it to the embeddings before I perform the dot product?Also, I see that
dynamic_relations
is set to True. I haven't read the section on Dynamic Relations in the docs in detail yet, but it looks like the operator is applied to right-hand side entities under "normal" circumstances anddynamic_relations=True
is not a normal circumstance? If so, in this circumstance, which side should the translation operator be applied to: left, right or both?Also, there is a small example on Nearest Neighbor Search using the FAISS library in the Downstream Tasks section in the docs, although the example does not use the WikiData embeddings. Given how large the WikiData embeddings file is, it would be great if you could share a variant of the same FAISS example that uses the WikiData embeddings!
The text was updated successfully, but these errors were encountered: