Using pre-trained WikiData embeddings for nearest neighbor search #85

g-karthik · 2019-07-24T02:17:40Z

I downloaded the 36 GB gzipped file (wikidata_translation_v1.tsv.gz) containing the pre-trained WikiData embeddings listed on this page: https://torchbiggraph.readthedocs.io/en/latest/pretrained_embeddings.html
I unzipped it and I see that the actual tsv file is 100+ GB.

I would like to utilize these embeddings to perform nearest neighbor search. But to perform this search, I would need to use the same comparator and operator that was used to train these, right?

I see on the page that the comparator used was dot and the operator used was translation. Is the learned translation vector already added to the embeddings in the tsv file or do I need to fetch it from some place and manually add it to the embeddings before I perform the dot product?

Also, I see that dynamic_relations is set to True. I haven't read the section on Dynamic Relations in the docs in detail yet, but it looks like the operator is applied to right-hand side entities under "normal" circumstances and dynamic_relations=True is not a normal circumstance? If so, in this circumstance, which side should the translation operator be applied to: left, right or both?

Also, there is a small example on Nearest Neighbor Search using the FAISS library in the Downstream Tasks section in the docs, although the example does not use the WikiData embeddings. Given how large the WikiData embeddings file is, it would be great if you could share a variant of the same FAISS example that uses the WikiData embeddings!

The text was updated successfully, but these errors were encountered:

lw · 2019-07-26T09:23:36Z

Most of what you are asking is explained in the documentation so, rather than copy-pasting it here, I suggest you to read up there and let me know if anything is unclear.

The one thing that I believe we did not explain in the doc is what we mean by "nearest neighbor" search. You are right in saying that to properly compute how "close" (similar) two entities are one should apply the proper operators and do the dot product. However, it turns out that once the embeddings are fully trained, their distance in L2 space already captures some semantic similarity and can thus be used to get a rough sense of the neighbors. This is an approximation we did in that example that we should have explained better. If you want a more exact search, there's a few options. I believe you can tell FAISS to use the dot product over the L2 norm, although not all indices support it. You cannot tell FAISS to apply the operator for you, but you can apply yourself the operator to your query before searching in the un-transformed embeddings (if you use "standard" relations, this only allows you to query the nearest left-hand side neighbors of a right-hand side entity; if you use dynamic relations you can do it on either side). If for some reason this doesn't work for you, you can drop FAISS entirely and do a slower but more correct evaluation of the scores between an entity and all other ones, similarly to what is done when ranking.

As mentioned in the page about the Wikidata embeddings, the TSV is almost the same format as the one of the export_to_tsv command, which is explained in the readme. The parameters of the operator of each relation type are at the end of the TSV file. They are not pre-applied to the embeddings (this would be impossible if one had more than one relation types, with different parameters).

You will also find in the doc that, in addition to TSV, there's a machine-readable format for these embeddings (i.e., .npy). There we also explain how to load it.

Dynamic relations are explained here. The parameters for the left-hand side operators also appear in the TSV file, with a _reverse_relation suffix.

Then, the FAISS example should work just the same for the Wikidata embeddings. Due to their size you may want to use a different index type for better performance, but that depends on your application and you should turn to the FAISS developers for help with tuning.

kadimaolivier · 2019-08-07T14:11:29Z

Hello, i would like to work with pyTorch-biggraph, my aim is from a graph data set ,i want to be able to find some entities simularities, and dertermine some simularity between entitites that has numerical attributes (my data are in RDF format) and at the end how can i apply TransEA model with numerical attributes after detemining the simularity between entities

lw · 2019-08-07T14:22:29Z

Your questions are very broad and they are basically about how to design a full ML pipeline, which is something that is up to you, rather than how to employ PBG as one block of it, which is what we're here to help with. I advise you to check out the README and documentation and get back to us if you have specific issues.

kadimaolivier · 2019-08-08T12:27:36Z

Thank you very much for your advise and your feedback, may you please guide me since i am a newby in this field, is it possible to get vectors from an RDF dataset using PBG tool? if yes which step should follow? secondly after getting vectors i would like to compare these vectors and find which entities vectors are simular so is it also possible to do that with PBG, once again thanks in advance for your guidance and orientation

lw · 2019-08-08T15:49:56Z

PBG, by itself, doesn't read RDF. You need to convert it to either the native format (explained here) or to TSV (tab-separated values), for which there already is an importer (i.e., torchbiggraph_from_tsv). The N-Tripes format of RDF is somewhat similar to TSV, so that may be easiest. Once you have your data in the right format, you can find in the doc explanations on how to train embeddings for its entities.

kadimaolivier · 2019-08-09T08:48:39Z

Thank you very much for the details...

lw · 2019-09-23T09:35:10Z

Closing this as I think everything had been answered and there were no follow-ups. If I missed something or new questions arise, please reopen or create a new issue.

lw closed this as completed Sep 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using pre-trained WikiData embeddings for nearest neighbor search #85

Using pre-trained WikiData embeddings for nearest neighbor search #85

g-karthik commented Jul 24, 2019

lw commented Jul 26, 2019

kadimaolivier commented Aug 7, 2019

lw commented Aug 7, 2019

kadimaolivier commented Aug 8, 2019

lw commented Aug 8, 2019

kadimaolivier commented Aug 9, 2019

lw commented Sep 23, 2019

Using pre-trained WikiData embeddings for nearest neighbor search #85

Using pre-trained WikiData embeddings for nearest neighbor search #85

Comments

g-karthik commented Jul 24, 2019

lw commented Jul 26, 2019

kadimaolivier commented Aug 7, 2019

lw commented Aug 7, 2019

kadimaolivier commented Aug 8, 2019

lw commented Aug 8, 2019

kadimaolivier commented Aug 9, 2019

lw commented Sep 23, 2019