Skip to content
This repository has been archived by the owner on Mar 14, 2024. It is now read-only.

Using pre-trained WikiData embeddings for nearest neighbor search #85

Closed
g-karthik opened this issue Jul 24, 2019 · 7 comments
Closed

Comments

@g-karthik
Copy link

I downloaded the 36 GB gzipped file (wikidata_translation_v1.tsv.gz) containing the pre-trained WikiData embeddings listed on this page: https://torchbiggraph.readthedocs.io/en/latest/pretrained_embeddings.html
I unzipped it and I see that the actual tsv file is 100+ GB.

I would like to utilize these embeddings to perform nearest neighbor search. But to perform this search, I would need to use the same comparator and operator that was used to train these, right?

I see on the page that the comparator used was dot and the operator used was translation. Is the learned translation vector already added to the embeddings in the tsv file or do I need to fetch it from some place and manually add it to the embeddings before I perform the dot product?

Also, I see that dynamic_relations is set to True. I haven't read the section on Dynamic Relations in the docs in detail yet, but it looks like the operator is applied to right-hand side entities under "normal" circumstances and dynamic_relations=True is not a normal circumstance? If so, in this circumstance, which side should the translation operator be applied to: left, right or both?

Also, there is a small example on Nearest Neighbor Search using the FAISS library in the Downstream Tasks section in the docs, although the example does not use the WikiData embeddings. Given how large the WikiData embeddings file is, it would be great if you could share a variant of the same FAISS example that uses the WikiData embeddings!

@lw
Copy link
Contributor

lw commented Jul 26, 2019

Most of what you are asking is explained in the documentation so, rather than copy-pasting it here, I suggest you to read up there and let me know if anything is unclear.

The one thing that I believe we did not explain in the doc is what we mean by "nearest neighbor" search. You are right in saying that to properly compute how "close" (similar) two entities are one should apply the proper operators and do the dot product. However, it turns out that once the embeddings are fully trained, their distance in L2 space already captures some semantic similarity and can thus be used to get a rough sense of the neighbors. This is an approximation we did in that example that we should have explained better. If you want a more exact search, there's a few options. I believe you can tell FAISS to use the dot product over the L2 norm, although not all indices support it. You cannot tell FAISS to apply the operator for you, but you can apply yourself the operator to your query before searching in the un-transformed embeddings (if you use "standard" relations, this only allows you to query the nearest left-hand side neighbors of a right-hand side entity; if you use dynamic relations you can do it on either side). If for some reason this doesn't work for you, you can drop FAISS entirely and do a slower but more correct evaluation of the scores between an entity and all other ones, similarly to what is done when ranking.

As mentioned in the page about the Wikidata embeddings, the TSV is almost the same format as the one of the export_to_tsv command, which is explained in the readme. The parameters of the operator of each relation type are at the end of the TSV file. They are not pre-applied to the embeddings (this would be impossible if one had more than one relation types, with different parameters).

You will also find in the doc that, in addition to TSV, there's a machine-readable format for these embeddings (i.e., .npy). There we also explain how to load it.

Dynamic relations are explained here. The parameters for the left-hand side operators also appear in the TSV file, with a _reverse_relation suffix.

Then, the FAISS example should work just the same for the Wikidata embeddings. Due to their size you may want to use a different index type for better performance, but that depends on your application and you should turn to the FAISS developers for help with tuning.

@kadimaolivier
Copy link

Hello, i would like to work with pyTorch-biggraph, my aim is from a graph data set ,i want to be able to find some entities simularities, and dertermine some simularity between entitites that has numerical attributes (my data are in RDF format) and at the end how can i apply TransEA model with numerical attributes after detemining the simularity between entities

@lw
Copy link
Contributor

lw commented Aug 7, 2019

Your questions are very broad and they are basically about how to design a full ML pipeline, which is something that is up to you, rather than how to employ PBG as one block of it, which is what we're here to help with. I advise you to check out the README and documentation and get back to us if you have specific issues.

@kadimaolivier
Copy link

Thank you very much for your advise and your feedback, may you please guide me since i am a newby in this field, is it possible to get vectors from an RDF dataset using PBG tool? if yes which step should follow? secondly after getting vectors i would like to compare these vectors and find which entities vectors are simular so is it also possible to do that with PBG, once again thanks in advance for your guidance and orientation

@lw
Copy link
Contributor

lw commented Aug 8, 2019

PBG, by itself, doesn't read RDF. You need to convert it to either the native format (explained here) or to TSV (tab-separated values), for which there already is an importer (i.e., torchbiggraph_from_tsv). The N-Tripes format of RDF is somewhat similar to TSV, so that may be easiest. Once you have your data in the right format, you can find in the doc explanations on how to train embeddings for its entities.

@kadimaolivier
Copy link

Thank you very much for the details...

@lw
Copy link
Contributor

lw commented Sep 23, 2019

Closing this as I think everything had been answered and there were no follow-ups. If I missed something or new questions arise, please reopen or create a new issue.

@lw lw closed this as completed Sep 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants