-
Notifications
You must be signed in to change notification settings - Fork 446
[Question] About Model Inference and Wikidata SPARQL #13
Comments
Hi @loretoparisi, thanks for your comments. We do not currently have the nn api under the PyTorch-BigGraph model. We'll consider to add it. For big graphs one do need to use things like FAISS to query nearest neighbor efficiently. |
I'm a bit wary of adding support for "downstream tasks" (nearest neighbors, classification, clustering, querying, ranking, etc.) directly to PBG. It shouldn't be PBG's business to know what you want to do with the embeddings it produces (except for what concerns the loss function to use). There are plenty of these tasks and every person wants them in a slightly different flavor. It's going to be impossible to support all these tasks in ways that make everyone happy, and trying to do so will put a lot of strain on the PBG team. It will also bloat the code and distract from the core functionality. What I believe we should do is adopt clear standard "interfaces" (more precisely, file formats) to store embeddings in so they can be consumed by other tools to do other things. PBG's "native" checkpoint format is quite custom but is based on the standard HDF5 format. Its specifications are in the documentation and we intend to keep it backwards compatible. Moreover, we welcome converters to and from other more standard formats. For now we only have them for TSV but if RDF and SPARQL (which I'm not familiar with) can be used to represent PBG's data I'd love to see converters for them in PBG. In general, I imagine PBG and the "ecosystem" around it (if I'm allowed to dream a little) as a family of "UNIX-style" tools: self-contained modular utilities that do one job (and do it well) that can be "chained" into a pipeline. |
@lerks thanks for the clarification, it perfectly makes sense to keep PBG as general as possibile. I think having RDF triplets through SPARQL or GraphQL could be a good options. SPARLQ is typically used by Wikimedia BlazeGraph and it is the standard de facto, while FB GraphQL is being used by more recent graph databases like Dgraph. I think that having a SPARLQ converter could be an option. In this terms, could you please provide an example of using PBG and FAISS using the smaller FB15K here. Thank you. |
Here's some proof-of-concept of using the embeddings learned by the FB15k script with FAISS. It uses the "native" output format (the
These four entities are respectively:
Which, with exception of the second, are all actors. |
@lerks amazing thank you. Basically with the FAISS ANN on |
I'm not sure I understand what you mean by "calculate the embedding of a node from the model". The parameters of the model are the vertices' embeddings (plus the relation parameters), and the model allows to calculate a score for every If your question is how to retrieve the embedding of a vertex, then there's either the
If your question is what to do when your graph changes and new entities are added (you said something about "a potential new candidate node"), then there's a few options. Entities that weren't trained on have no embedding. The "proper" way to give them an embedding is to re-train on the new graph, in which they appear. There's some discussion in #17 about how to re-use some previously-learned embeddings when training on a new graph. If re-training is too much effort you may be able to hack together something simpler that could still give decent results, for example using as the embedding of an unseen entity the average of the embeddings of its neighbors. Let me add that from a theoretical point of view this approach doesn't necessarily make sense (PBG's training does not actually try to place entities that are adjacent close to each other; instead it tries to pull together their embeddings after passing them through the relation type's parameters; this only means that entities with similar neighborhoods may end up being nearby). |
thank you very much, sorry I meant both cases that you pointed out:
Regarding this last point, there is a question on FastText Word2vec regarding the online training. Because of new items (in that cases words), the model should support online training or fine tuning to avoid calculating the embeddings from scratch. It's pretty interesting the Back in the Graph, assumed I have the unseen node, we want to get the embeddings of its neighbors. As you mention we could make the centroid averaging the embeddings of the neighbors. This means that we shall know the neighbors before or we could make an hypothesis about possibile relations with other nodes, and then calculate the embeddings. At this point how we can evaluate that prediction of the similar neighborhoods against the model? Thanks a lot. |
Instead of reading in an h5 file when adding to index, how can we read in a tsv file? Or, is there a trained PBG model on the wikidata that is in h5 format? Referring to this line: |
As far as I know, FAISS (for Python) only accepts NumPy arrays. So you'll have to import such an array from the TSV file. This can be achieved with the
|
However beware that the command above will take a very long time to complete. As mentioned in #15 (comment), we now provide also a more machine-friendly version of the Wikidata embeddings. After de-compressing the
|
Thanks for the reply! Currently, I am trying to just test with the fb15k dataset which is much smaller. I am using the joined_embeddings.tsv file that you get from running the torchbiggraph script. Am I able to just pass in the embedding that you described? I get an assertion error when trying to pass in
And is there a reason why you are adding the skiprows=1 parameter? (I'ved tried without the skiprows but I still get an Assertionerror.) I believe that the problem may be that the dimensions of entitities in the fb15k are 400 whereas the dimensions of the relations are 200; is there something I can do to load the embedding for the fb15k tsv file? |
Without the exact error message I can't know for sure what's going on, but my guess is that it's a dimension mismatch: the command I posted loads only the first 200 dimensions, because that's how many there are in the Wikidata model, but you probably initialized the FAISS index with 400, that is, the dimension of the FB15k model. To fix that, simply set Also, the You raise a good point that the TSV file also contains relation parameters, which you most likely want to ignore when you look for nearest neighbors among the entities. The only way I found to do so is to figure out the total number of entities (for FB15k this should be 14951) and then pass the Thanks for pointing out that the number of parameters is 200 for each relation: it should be 400 (this is true for the translation, diagonal and complex diagonal operators, not for the linear or affine ones) but I just realized there's a bug in Finally, observe that if you are training the FB15k model yourself you easily have access to the |
Thank you so much! Now it makes sense since the dimensions of the relations in fb15k was just mismatched. I would appreciate it if you let me know when the export_to_tsv function is updated. What approach would you recommend to separate the entities from relations in the wikidata tsv file? According to the documentation, are there exactly 78 million entities and 4131 relations? Also, with so much data, is there a simple way to split the data into chunks and process them into IndexFlatL2 separately so that RAM is not overloaded? Lastly, a bit off topic, but would subtracting entity vectors a,b and searching for that resultant vector within all the relation vectors yield the closest relation vectors to a,b? |
The fix to the In the Wikidata TSV file the first line (i.e., the one we're ignoring with
So, if you just want to extract the entities you can use
Whereas if you just want the relation parameters you can use
And if you want the parameters of the reverse relation, they come just after
Again, the above commands will most probably be veeery slow. You can Finally, I don't really know how to answer your question about searching the nearest neighbor of the difference of two embeddings in the space of relation parameters to identify the most likely relation: I've never tried it. |
…tions (#37) Summary: Some operators (affine, complex_diagonal) have more than one parameters, however export_to_tsv was only exporting the first one. This caused, for example, the ComplEx FB15k export to have only 200 params per relation (the real part), with the other 200 (the imaginary part) being dropped. - [ ] Docs change / refactoring / dependency upgrade - [X] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) See #13 (comment) Tested using the FB15k and (a modified version of) the LiveJournal example scripts. Pull Request resolved: #37 Differential Revision: D15030457 Pulled By: lerks fbshipit-source-id: 75c2a148eb2f8e946b4bc5fba1ce284bdcf1620f
How many relations are in the fb15k model? You mentioned that there are 14951 entities, but what about relations? From what indices would I have to splice to separate entities and relations? And I would not need to skip the first row right? |
If you have run the
And, yes, |
Thank you so much! In the code you wrote above in the comments:
You used a dictionary.json file that mapped the resultant vectors to the entity id. Is there a dictionary like this for wikidata? |
It exists, although not exactly in the same format: https://dl.fbaipublicfiles.com/torchbiggraph/wikidata_translation_v1_names.json.gz contains a JSON-encoded list of strings, which are the names of the entities plus, at the end, the names of the relations and the names of the reverse relations (just like with the TSV file as discussed above). |
Thank you for this amazing project and pre-trained model.
Typically when dealing with a Wikidata
RDF
, one will useSPARQL
as interface to the RDF triplets model, so that it will be possible to get items with their relations, etc.That said, assumed one can load the Wikidata model you have released, it should be possibile to infer an embedding for a specified source and target nodes and their relations.
So the questions is: Having this model the neighbors node learned already , is it possible to get the nearest nodes directly (like in the nn api of Word2Vec to be clear) or it must be used an Approximated Nearest Neighbor like FAISS or Annoy to get the closer embedding vectors?
Thank you.
The text was updated successfully, but these errors were encountered: