Skip to content
This repository has been archived by the owner on Mar 14, 2024. It is now read-only.

[Question] About Model Inference and Wikidata SPARQL #13

Closed
loretoparisi opened this issue Apr 2, 2019 · 18 comments
Closed

[Question] About Model Inference and Wikidata SPARQL #13

loretoparisi opened this issue Apr 2, 2019 · 18 comments
Labels
enhancement New feature or request

Comments

@loretoparisi
Copy link

Thank you for this amazing project and pre-trained model.
Typically when dealing with a Wikidata RDF, one will use SPARQL as interface to the RDF triplets model, so that it will be possible to get items with their relations, etc.
That said, assumed one can load the Wikidata model you have released, it should be possibile to infer an embedding for a specified source and target nodes and their relations.
So the questions is: Having this model the neighbors node learned already , is it possible to get the nearest nodes directly (like in the nn api of Word2Vec to be clear) or it must be used an Approximated Nearest Neighbor like FAISS or Annoy to get the closer embedding vectors?
Thank you.

@ledw
Copy link
Contributor

ledw commented Apr 3, 2019

Hi @loretoparisi, thanks for your comments. We do not currently have the nn api under the PyTorch-BigGraph model. We'll consider to add it. For big graphs one do need to use things like FAISS to query nearest neighbor efficiently.

@lw
Copy link
Contributor

lw commented Apr 4, 2019

I'm a bit wary of adding support for "downstream tasks" (nearest neighbors, classification, clustering, querying, ranking, etc.) directly to PBG. It shouldn't be PBG's business to know what you want to do with the embeddings it produces (except for what concerns the loss function to use). There are plenty of these tasks and every person wants them in a slightly different flavor. It's going to be impossible to support all these tasks in ways that make everyone happy, and trying to do so will put a lot of strain on the PBG team. It will also bloat the code and distract from the core functionality.

What I believe we should do is adopt clear standard "interfaces" (more precisely, file formats) to store embeddings in so they can be consumed by other tools to do other things. PBG's "native" checkpoint format is quite custom but is based on the standard HDF5 format. Its specifications are in the documentation and we intend to keep it backwards compatible. Moreover, we welcome converters to and from other more standard formats. For now we only have them for TSV but if RDF and SPARQL (which I'm not familiar with) can be used to represent PBG's data I'd love to see converters for them in PBG.

In general, I imagine PBG and the "ecosystem" around it (if I'm allowed to dream a little) as a family of "UNIX-style" tools: self-contained modular utilities that do one job (and do it well) that can be "chained" into a pipeline.

@lw lw added the enhancement New feature or request label Apr 4, 2019
@loretoparisi
Copy link
Author

@lerks thanks for the clarification, it perfectly makes sense to keep PBG as general as possibile. I think having RDF triplets through SPARQL or GraphQL could be a good options. SPARLQ is typically used by Wikimedia BlazeGraph and it is the standard de facto, while FB GraphQL is being used by more recent graph databases like Dgraph. I think that having a SPARLQ converter could be an option.

In this terms, could you please provide an example of using PBG and FAISS using the smaller FB15K here.
It turns out that this could help to use PBG in a real world application. In my application I'm running a subset of Wikidata for specific categories, compute the embeddings through PBG for this Graph and calculating the ANN.

Thank you.

@lw
Copy link
Contributor

lw commented Apr 5, 2019

Here's some proof-of-concept of using the embeddings learned by the FB15k script with FAISS. It uses the "native" output format (the .h5 files, not the .tsv ones) but it does so directly through h5py, avoiding the need to import any PBG code.

In [1]: import json

In [2]: import numpy as np

In [3]: import h5py

In [4]: import faiss

In [5]: index = faiss.IndexFlatL2(400)

In [6]: with h5py.File("model/fb15k/embeddings_all_0.v50.h5", "r") as hf:
   ...:     index.add(hf["embeddings"][...])
   ...:

In [7]: _, neighbors = index.search(np.random.randn(1, 400).astype('float32'), 4)

In [8]: with open("data/FB15k/dictionary.json", "rt") as f:
   ...:     dictionary = json.load(f)
   ...:

In [9]: dictionary["entities"]["all"][neighbors[0, 0]]
Out[9]: '/m/05nzw6'

In [10]: dictionary["entities"]["all"][neighbors[0, 1]]
Out[10]: '/m/05hf_5'

In [11]: dictionary["entities"]["all"][neighbors[0, 2]]
Out[11]: '/m/01ry0f'

In [12]: dictionary["entities"]["all"][neighbors[0, 3]]
Out[12]: '/m/017yxq'

These four entities are respectively:

Which, with exception of the second, are all actors.

@loretoparisi
Copy link
Author

@lerks amazing thank you. Basically with the FAISS ANN on index.search we will find neighbors of the specified embeddings of dim 400. Of course, we have all the embedding of the current graph, but we could also approximate the embeddings of a potential new candidate node. So how to calculate the embedding of a node from the model?
Thank you.

@lw
Copy link
Contributor

lw commented Apr 5, 2019

I'm not sure I understand what you mean by "calculate the embedding of a node from the model". The parameters of the model are the vertices' embeddings (plus the relation parameters), and the model allows to calculate a score for every (source, target, relation type) triplet.

If your question is how to retrieve the embedding of a vertex, then there's either the torchbiggraph_export_to_tsv command (mentioned in the README) that will produce a file that lists all entities by their original identifier and match them with their embedding (I'm sure you knew this as you mentioned it in #15). If you want something more "raw" (i.e., get embeddings as NumPy arrays rather than as text) you can do it as follows:

In [1]: import json

In [2]: import h5py

In [3]: with open("data/FB15k/dictionary.json", "rt") as f:
   ...:     dictionary = json.load(f)
   ...:

In [4]: offset = dictionary["entities"]["all"].index("/m/05hf_5")

In [5]: with h5py.File("model/fb15k/embeddings_all_0.v50.h5", "r") as hf:
   ...:     embedding = hf["embeddings"][offset, :]
   ...:

In [6]: embedding
Out[6]:
array([-1.61627859e-01, -7.48140737e-02, -1.52236715e-01,  3.08571178e-02,
        ...
        5.75154647e-02,  2.42468923e-01,  7.55479559e-02, -1.26004517e-01],
      dtype=float32)

If your question is what to do when your graph changes and new entities are added (you said something about "a potential new candidate node"), then there's a few options. Entities that weren't trained on have no embedding. The "proper" way to give them an embedding is to re-train on the new graph, in which they appear. There's some discussion in #17 about how to re-use some previously-learned embeddings when training on a new graph.

If re-training is too much effort you may be able to hack together something simpler that could still give decent results, for example using as the embedding of an unseen entity the average of the embeddings of its neighbors. Let me add that from a theoretical point of view this approach doesn't necessarily make sense (PBG's training does not actually try to place entities that are adjacent close to each other; instead it tries to pull together their embeddings after passing them through the relation type's parameters; this only means that entities with similar neighborhoods may end up being nearby).

@loretoparisi
Copy link
Author

loretoparisi commented Apr 5, 2019

thank you very much, sorry I meant both cases that you pointed out:

  • The embedding out of the pre-trained model for a given dictionary index, as you show in the first example;
  • adding unseen entities. On this point of course as you stated re-train would be the best case;

Regarding this last point, there is a question on FastText Word2vec regarding the online training. Because of new items (in that cases words), the model should support online training or fine tuning to avoid calculating the embeddings from scratch.

It's pretty interesting the similar neighborhoods concept that you pointed out. If we can make a similarity with Doc2Vec i.e. the centroids of all embeddings in a text, the nn will return similar documents of an unseen document, because we can calculate the embeddings of each words of the new document, and then its centroid. We do this getting the words from the dictionary, and throwing out the unknown words.

Back in the Graph, assumed I have the unseen node, we want to get the embeddings of its neighbors. As you mention we could make the centroid averaging the embeddings of the neighbors. This means that we shall know the neighbors before or we could make an hypothesis about possibile relations with other nodes, and then calculate the embeddings. At this point how we can evaluate that prediction of the similar neighborhoods against the model?

Thanks a lot.

@JasonLLu
Copy link

JasonLLu commented Apr 20, 2019

Instead of reading in an h5 file when adding to index, how can we read in a tsv file?

Or, is there a trained PBG model on the wikidata that is in h5 format?

Referring to this line:
In [5]: with h5py.File("model/fb15k/embeddings_all_0.v50.h5", "r") as hf:
...: embedding = hf["embeddings"][...]

@lw
Copy link
Contributor

lw commented Apr 21, 2019

As far as I know, FAISS (for Python) only accepts NumPy arrays. So you'll have to import such an array from the TSV file. This can be achieved with the loadtxt function. The following command should work with the Wikidata embeddings:

embedding = np.loadtxt("wikidata_translation_v1.tsv", dtype=np.float32, delimiter="\t", skiprows=1, usecols=range(1, 201), comments=None)

@lw
Copy link
Contributor

lw commented Apr 21, 2019

However beware that the command above will take a very long time to complete. As mentioned in #15 (comment), we now provide also a more machine-friendly version of the Wikidata embeddings. After de-compressing the .npy.gz file, the data can be loaded simply as:

embedding = np.load("wikidata_translation_v1_vectors.npy")

@JasonLLu
Copy link

JasonLLu commented Apr 21, 2019

Thanks for the reply! Currently, I am trying to just test with the fb15k dataset which is much smaller. I am using the joined_embeddings.tsv file that you get from running the torchbiggraph script. Am I able to just pass in the embedding that you described? I get an assertion error when trying to pass in


embedding = np.loadtxt("joined_embeddings.tsv", dtype=np.float32, delimiter="\t", skiprows=1, usecols=range(1, 201), comments=None)

index.add(embedding)

And is there a reason why you are adding the skiprows=1 parameter? (I'ved tried without the skiprows but I still get an Assertionerror.) I believe that the problem may be that the dimensions of entitities in the fb15k are 400 whereas the dimensions of the relations are 200; is there something I can do to load the embedding for the fb15k tsv file?

@lw
Copy link
Contributor

lw commented Apr 21, 2019

Without the exact error message I can't know for sure what's going on, but my guess is that it's a dimension mismatch: the command I posted loads only the first 200 dimensions, because that's how many there are in the Wikidata model, but you probably initialized the FAISS index with 400, that is, the dimension of the FB15k model. To fix that, simply set usecols=range(1, 401) (this argument is needed because the first column, which is the entity name, must be ignored).

Also, the skiprows=1 argument is necessary because in the Wikidata TSV file the first line contains a comment with entity and relation counts. In the FB15k TSV file this shouldn't be the case so, indeed, you should remove that argument.

You raise a good point that the TSV file also contains relation parameters, which you most likely want to ignore when you look for nearest neighbors among the entities. The only way I found to do so is to figure out the total number of entities (for FB15k this should be 14951) and then pass the max_rows argument to loadtxt with that value.

Thanks for pointing out that the number of parameters is 200 for each relation: it should be 400 (this is true for the translation, diagonal and complex diagonal operators, not for the linear or affine ones) but I just realized there's a bug in export_to_tsv that causes half of them not to be exported. I will fix it.

Finally, observe that if you are training the FB15k model yourself you easily have access to the .h5 files: they are located in the model/fb15k directory (in general, in whatever directory is specified as the checkpoint_path in the config). Loading these directly, with the procedure outlined above, will allow you to sidestep all the issues with usecols, skiprows, max_rows, etc.

@JasonLLu
Copy link

JasonLLu commented Apr 21, 2019

Thank you so much! Now it makes sense since the dimensions of the relations in fb15k was just mismatched. I would appreciate it if you let me know when the export_to_tsv function is updated. What approach would you recommend to separate the entities from relations in the wikidata tsv file? According to the documentation, are there exactly 78 million entities and 4131 relations? Also, with so much data, is there a simple way to split the data into chunks and process them into IndexFlatL2 separately so that RAM is not overloaded? Lastly, a bit off topic, but would subtracting entity vectors a,b and searching for that resultant vector within all the relation vectors yield the closest relation vectors to a,b?

@lw
Copy link
Contributor

lw commented Apr 21, 2019

The fix to the export_to_tsv script is in #37.

In the Wikidata TSV file the first line (i.e., the one we're ignoring with skiprows=1) contains the number of entities and of relation types and the dimension:

78404883	4151	200

So, if you just want to extract the entities you can use

embedding = np.loadtxt("wikidata_translation_v1.tsv", dtype=np.float32, delimiter="\t", skiprows=1, max_rows=78404883, usecols=range(1, 201), comments=None)

Whereas if you just want the relation parameters you can use

rel_params = np.loadtxt("wikidata_translation_v1.tsv", dtype=np.float32, delimiter="\t", skiprows=1+78404883, max_rows=4151, usecols=range(1, 201), comments=None)

And if you want the parameters of the reverse relation, they come just after

reverse_rel_params = np.loadtxt("wikidata_translation_v1.tsv", dtype=np.float32, delimiter="\t", skiprows=1+78404883+4151, max_rows=4151, usecols=range(1, 201), comments=None)

Again, the above commands will most probably be veeery slow. You can np.load the .npy file and then just access the range of rows you're interested in: this will definitely be faster. Also, to answer your other question, with the mmap_mode of np.load you can have have all the data stay on disk and only load to RAM the part you're interested in, take a look.

Finally, I don't really know how to answer your question about searching the nearest neighbor of the difference of two embeddings in the space of relation parameters to identify the most likely relation: I've never tried it.

facebook-github-bot pushed a commit that referenced this issue Apr 22, 2019
…tions (#37)

Summary:
Some operators (affine, complex_diagonal) have more than one parameters,
however export_to_tsv was only exporting the first one. This caused, for
example, the ComplEx FB15k export to have only 200 params per relation
(the real part), with the other 200 (the imaginary part) being dropped.

- [ ] Docs change / refactoring / dependency upgrade
- [X] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

See #13 (comment)

Tested using the FB15k and (a modified version of) the LiveJournal
example scripts.
Pull Request resolved: #37

Differential Revision: D15030457

Pulled By: lerks

fbshipit-source-id: 75c2a148eb2f8e946b4bc5fba1ce284bdcf1620f
@JasonLLu
Copy link

JasonLLu commented Apr 22, 2019

How many relations are in the fb15k model? You mentioned that there are 14951 entities, but what about relations? From what indices would I have to splice to separate entities and relations? And I would not need to skip the first row right?

@lw
Copy link
Contributor

lw commented Apr 22, 2019

If you have run the examples/fb15k.py script, you will have the count files in the data/FB15k directory. This is what I have:

$ cat data/FB15k/entity_count_all_0.txt
14951
$ cat data/FB15k/dynamic_rel_count.txt
1345

And, yes, export_to_tsv does not put a comment with the counts on the first line of the output TSV, thus you do not have to skip the first line.

@JasonLLu
Copy link

Thank you so much! In the code you wrote above in the comments:

In [6]: with h5py.File("model/fb15k/embeddings_all_0.v50.h5", "r") as hf:
   ...:     index.add(hf["embeddings"][...])
   ...:

In [7]: _, neighbors = index.search(np.random.randn(1, 400).astype('float32'), 4)

In [8]: with open("data/FB15k/dictionary.json", "rt") as f:
   ...:     dictionary = json.load(f)
   ...:

You used a dictionary.json file that mapped the resultant vectors to the entity id. Is there a dictionary like this for wikidata?

@lw
Copy link
Contributor

lw commented Apr 22, 2019

It exists, although not exactly in the same format: https://dl.fbaipublicfiles.com/torchbiggraph/wikidata_translation_v1_names.json.gz contains a JSON-encoded list of strings, which are the names of the entities plus, at the end, the names of the relations and the names of the reverse relations (just like with the TSV file as discussed above).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants