Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python API #82

Open
zheng-da opened this issue Apr 23, 2020 · 15 comments
Open

Add Python API #82

zheng-da opened this issue Apr 23, 2020 · 15 comments
Labels
enhancement New feature or request

Comments

@zheng-da
Copy link
Contributor

The Python API is convenient for many use cases. It allows more customization and is very friendly for Jupyter Notebook users.

@AlexMRuch
Copy link

AlexMRuch commented May 3, 2020

One thing that would be especially helpful for a Python API would be to create a model class that once trained can do entity and edge prediction (e.g., https://graphvite.io/docs/latest/api/application.html#graphvite.application.KnowledgeGraphApplication). For example, if I have a list of entity nodes and relational edges and want to know either 1) what is the most likely or top-k destination nodes for a set of source nodes or 2) what is the probability that a certain type of edges exists between a source and destination node. Right now, I plan to borrow code for evaluating pre-trained knowledge graph embeddings (https://aws-dglke.readthedocs.io/en/latest/hyper_param.html#evaluation-on-pre-trained-embeddings --> https://github.com/awslabs/dgl-ke/blob/master/python/dglke/eval.py) to try to do this on my own; however, it seems like that would be helpful for downstream tasks for many users. Please let me know if this is something you think would be useful and if I develop such a script I can help share it with your or develop it in a way that works within the dgl-ke library and can be imported. Thanks for your consideration.

@zheng-da
Copy link
Contributor Author

zheng-da commented May 3, 2020

This is one of our motivations to create a Python API. It'll be great if you could contribute to this. Could you share such a script to us once you have it? I think we should totally work together on this.

We'll share our previous design of the Python API. It'll be great if you can give us feedbacks.

@AlexMRuch
Copy link

Happily! Thanks so much for your interest! Should we open another issue for the entity/link prediction ticket? Would you like make to fork dgl-ke or do you want this to be in a feature branch? Also, if you have any suggestions for how we should go about developing this to make it most functional for the library, please let me know. Thanks in advance!

@zheng-da
Copy link
Contributor Author

zheng-da commented May 3, 2020

My understanding is that you like to contribute to creating a model class to evaluate pre-trained embeddings for various tasks: entity classification and link prediction. Is this right?

We can create another ticket to have more focused discussions. As for development, I think you can fork the repo and make a PR for us. Before that, can we start with discussion on API definition? We like to have API to be stable. So it'll be great if we can finalize the API design before we can go to actual code?

@AlexMRuch
Copy link

Yes, that is correct. It would be wonderful to have a model class that can ingest pre-trained embeddings and then perform entity classification and link prediction similar to what graphvite applications do.

Thanks for opening another ticket. I'll likely have questions throughout the process, so that'll help keep this issue ticket cleaner in the event others have ideas or wish to contribute to the Python API. I will fork the repo and can begin work after you all have discussed the API and settled on a stable definition, as you requested. Thanks for the guidance!

@zheng-da
Copy link
Contributor Author

zheng-da commented May 3, 2020

The Python API is mainly defined for users to invoke KGE training in the Notebook environment. It doesn’t support distributed training.

Load Data

# Load builtin datasets
kg = dglke.dataset.FB15k()
# Load users' own data (raw or pre-formatted data)
kg = dglke.dataset.load(train=load_rdf('/path/to/train/file'),
                        valid=load_rdf('/path/to/valid/file'),
                        test=None,
                        format='htr')

Model load and creation

When a model is created, it has to be associated to a knowledge graph. Since KGE models are transductive, it’s only valid on a knowledge graph.

model = dglke.TransE(dim=400)
model.attach_data(kg)

Model training

When training the model, it only trains on the knowledge graph associated with the model and save the model afterwards. When the model is saved to the disk, we only save the model embeddings and configurations to the disk.

# When training a model, we need to provide the training data and
# specify all hyperparameters.
model.fit(num_epochs=10,
          gpus=[0, 1, 2, 3], batch_size=1000,
          neg_sample_size=400, lr=0.1,
          warm_start=False)
model.save('/path/to/save/model')

Restart model training from a checkpoint

Training knowledge graph embeddings may take a long time. It’s likely that people want to save KGE models periodically and restart the training. We should allow KGE training from a checkpoint.

model = model.TransE(dim=400)
model.load('/path/to/trained/model')
model.attach_data(kg)
model.fit() # This will lead to an error if there is no kg

Model evaluation

model.eval(kg.test, filter_edges=kg.train, neg_size=1000,
           neg_sample_strategy='...')

triplets = load_rdf('..', format='htr')
model.link_prediction(triplets)
model.entity_embed          # get the entity embeddings
model.relation_embed        # get the relation embeddings

@zheng-da
Copy link
Contributor Author

zheng-da commented May 3, 2020

I shared the API we defined a few months ago, but we didn't get time to implement them. I would like to share it with the community and ask for feedbacks.

@AlexMRuch as a user, do you find this kind of API be intuitive for you? As for API for evaluation, is this what you have in mind? Feel free to propose your ideas and give us feedbacks on other APIs. Thanks.

@zheng-da
Copy link
Contributor Author

zheng-da commented May 3, 2020

@AlexMRuch please feel free to open another ticket to discuss the evaluation API.

@AlexMRuch
Copy link

AlexMRuch commented May 3, 2020

Wonderful. Thanks! Given the information you posted above, perhaps we can just continue the API setup and evaluation discussion here, as it seems like this will involve creating the objects you mentioned above.

The API seems pretty clear for me and is very similar to what I had in mind; however, a few things are unclear.

  1. model.save('/path/to/save/model') <-- What does this save if the entity.npy, relation.npy, and config.json files for the pre-trained model already exist? It seems like this should only be evoked if the model class was going to be trained and save the *.npy and config.json files, and if that's the case shouldn't the fit method be between model.attach_data(kg) and model.save('/path/to/save/model')?
  2. The specifics of running a warm up for the model should be clarified. For example, for how many steps does the model run for warming up? Also, this may be a good place to add in a search function to find the best lr value (e.g., https://docs.fast.ai/callbacks.lr_finder.html).
  3. model.predict() should allow for a matrix of canonical tuples as well as individual canonical tuples – correct? Also, does predict here apply to link prediction? If so, maybe it should be renamed to link_prediction so there can also be a entity_classification method. On the other hand, the method could also just take a second argument to be "link", "source", or "destination" and then just expect things to be in a hrt format (or that's a third option that defaults to hrt).

Is there any interest in adding visualization (e.g., reduction to 2D or 3D with UMAP)? That could be another method; however, that would also add another installation requirement for users and may be hard on memory (vs. running it in a new kernel separately on the *.npy files).

Hope those suggestions help and that you get other useful feedback from the community!

Please let me know when you've heard back from others and when you'd like me to try and contribute some code to this effort. Thanks!

@zheng-da
Copy link
Contributor Author

zheng-da commented May 4, 2020

  1. The Python API will require users to call model.save() to save models explicitly. I think what the confusion was. I have moved save() after fit(). Hopefully, it's clearer now.

  2. Here the warmup is a little different from the warm up strategy used for model training, although you can use it in that way. Here we just want to give users an option to continue training a model from a checkpoint saved previously. I have changed it in the API definition.

  3. thanks for your suggestion. I have updated the API and call it link prediction. However, how do we do entity classification? Should we train a classification model on top of embeddings first?

Yes, visualization is definitely desired. Do you have any suggestions on what are the good visualization tools out there for large graphs?

@AlexMRuch
Copy link

  1. The Python API will require users to call model.save() to save models explicitly. I think what the confusion was. I have moved save() after fit(). Hopefully, it's clearer now.

Yes, much clearer! Thank you. I agree that calling model.save() explicitly is what people should do.

  1. Here the warmup is a little different from the warm up strategy used for model training, although you can use it in that way. Here we just want to give users an option to continue training a model from a checkpoint saved previously. I have changed it in the API definition.

Ah, yes, that's a great idea – restarting training from a checkpoint. This will be very useful for how I plan to use dgl-ke in my work.

If we wanted to add something like lr_finder to the model later, that's an option, but definitely seems like less of a priority compared to the other things that need to be done first.

  1. thanks for your suggestion. I have updated the API and call it link prediction. However, how do we do entity classification? Should we train a classification model on top of embeddings first?

What I mean by link prediction is where you have a source entity and a destination entity and you want to predict whether a particular kind of relation edge exists between them (i.e., are two Twitter users connected by a Retweet edge?).

What I mean by entity prediction is where you have a source entity and a relation edge and you want to predict the most likely destination entity for that source-relation pair (i.e., who is a given Twitter user most likely to retweet, or what is the list of top-k most likely destination entities?). So this is not really a "classification" problem – my mistake. It should be called entity_prediction and would probably just be a KNN problem given the source entity and relational edge.

I hope that makes sense and that you agree with these ideas. I believe the idea of using KNN for entity prediction is why graphvite required installing the faiss library.

Yes, visualization is definitely desired. Do you have any suggestions on what are the good visualization tools out there for large graphs?

In my work I usually do UMAP to knock the embeddings dimensions down to 2 or 3 dimensions and then just use matplotlib or seaborn to visualize the results. I've done this over 5 million nodes embedded with metapath2vec and it worked fine (see Figures 5 and 10 in https://arxiv.org/pdf/2001.01126.pdf).

I haven't visualized knowledge graphs yet, however, so I don't know what extra complexity will exist for their embeddings compared to metapath2vec (e.g., do we need to do anything in particular to account for the relation embeddings if we wanted to jointly visualize entity and relation embeddings or should we just map both sets of embeddings separately).

For what it's worth, I also plan on using HDBSCAN to cluster the entity embeddings in my work (I'll probably reduce the dimensions to < 64 with UMAP first, though, to improve the compute time of the HDBSCAN algorithm. I did that on the same network I'm running the KGE model on now (with 10M entities and 100M edges) and it returned very nice results in addition to nice UMAP 2D and 3D visualizations: https://www.graphika.com/posts/deep-learning-at-graphika-scaling-network-maps-with-heterogeneous-graph-embedding/.

@zheng-da
Copy link
Contributor Author

zheng-da commented May 4, 2020

lr_finder seems useful. I think we'll include it in a future release. I was reading how to pick the right learning rate (but for a very different purpose) and came across this function. I'm wondering how effective in your experience for KGE training?

Thanks for your suggestions on visualization. Your visualization looks very cool. The team will investigate visualization tools and try out the ones you suggested. I think we'll need your help.

@zheng-da zheng-da added the enhancement New feature or request label May 4, 2020
@AlexMRuch
Copy link

lr_finder seems useful. I think we'll include it in a future release. I was reading how to pick the right learning rate (but for a very different purpose) and came across this function. I'm wondering how effective in your experience for KGE training?

Sounds great! I haven't used it for KGE before but have used it for other tasks (e.g., multi-label NLP classification tasks). I presume it should port over relatively easy to KG tasks given that you can define the problem the same: which learning rate minimizes the loss best.

Thanks for your suggestions on visualization. Your visualization looks very cool. The team will investigate visualization tools and try out the ones you suggested. I think we'll need your help.

You're very welcome! Very happy to help where I can. Please don't hesitate to reach out!!

@AlexMRuch
Copy link

AlexMRuch commented Jul 15, 2021 via email

@ChrisDelClea
Copy link

Hey guys,

i couldn't figure if the API is already released or not? I assume not. I really liked how to defined the API above.
Also i would suggest that for training of a users’ knowledge graph the input would be dirctly an RDF Graph using the RDFlib.
Would that be possible and what date can one except the API to be released?

Best regards
Chris

The Python API is mainly defined for users to invoke KGE training in the Notebook environment. It doesn’t support distributed training.

Load Data

# Load builtin datasets
kg = dglke.dataset.FB15k()
# Load users' own data (raw or pre-formatted data)
kg = dglke.dataset.load(train=load_rdf('/path/to/train/file'),
                        valid=load_rdf('/path/to/valid/file'),
                        test=None,
                        format='htr')

Model load and creation

When a model is created, it has to be associated to a knowledge graph. Since KGE models are transductive, it’s only valid on a knowledge graph.

model = dglke.TransE(dim=400)
model.attach_data(kg)

Model training

When training the model, it only trains on the knowledge graph associated with the model and save the model afterwards. When the model is saved to the disk, we only save the model embeddings and configurations to the disk.

# When training a model, we need to provide the training data and
# specify all hyperparameters.
model.fit(num_epochs=10,
          gpus=[0, 1, 2, 3], batch_size=1000,
          neg_sample_size=400, lr=0.1,
          warm_start=False)
model.save('/path/to/save/model')

Restart model training from a checkpoint

Training knowledge graph embeddings may take a long time. It’s likely that people want to save KGE models periodically and restart the training. We should allow KGE training from a checkpoint.

model = model.TransE(dim=400)
model.load('/path/to/trained/model')
model.attach_data(kg)
model.fit() # This will lead to an error if there is no kg

Model evaluation

model.eval(kg.test, filter_edges=kg.train, neg_size=1000,
           neg_sample_strategy='...')

triplets = load_rdf('..', format='htr')
model.link_prediction(triplets)
model.entity_embed          # get the entity embeddings
model.relation_embed        # get the relation embeddings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants