-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model Wrapper for GCNPredictor from DGL-LifeSci #2249
Conversation
This is looking very nice! It would be good for the docstring for GCNModel to explain how this model is different from GraphConvModel, since they do very similar things. It also would be good if, as much as possible, the names and order of the constructor arguments match GraphConvModel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! +1 for @peastman'scomments. Have one additional minor comment about docs as well
Basically, I think DGLXXXX or XXXXWithDGL (ex: DGLGCNModel or GCNModelWithDGL) is a better name. It clears which framework the model depends on. The situation that DeepChem has both DGL and PyG implementatins is not good because it makes users confuse, like which model is better or correct. And then, the performance of GATModel based on the dgllife-sci model is better compared with the current GATModel based on PyG. If you want to add GATModel based on dgllife-sci model, I think it is better to replace the current GATModel to your GATModel based on dgllife-sci model. My motivation of creating the current GATModel is just to provide the sample model by using PyG, so there is no need that the model is GAT. Currently, my CGCNNModel based on DGL is not working, so I think it is better to keep providing the sample model based on PyG by rewriting CGCNN with PyG. |
Expanding on Daiki's comments, our general approach is that DeepChem provides the user a choice of models, not a choice of implementations. What framework a model is implemented with should be an internal detail that users mostly don't need to think about. The user will say, "I want a GAT model," so they'll use the GATModel class. They may not even know or care whether it's implemented with PyG, DGL, TensorFlow, or whatever. |
I've added an explanation of their differences in |
Make sense. I can work on
Got it, make sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed! Thank you for your work!
I think this PR will be ready to merge after you add GCNModel test and rewrite the examples of docstrings.
The current examples are based on crystal data, so you need to convert them to the examples based on molecule data. I think the GATModel docstrings are really helpful. (GATModel test is also helpful)
GATModel : https://github.com/deepchem/deepchem/blob/master/deepchem/models/torch_models/gat.py
deepchem/models/torch_models/gcn.py
Outdated
>>> from deepchem.models import GCN | ||
>>> lattice = mg.Lattice.cubic(4.2) | ||
>>> structure = mg.Structure(lattice, ["Cs", "Cl"], [[0, 0, 0], [0.5, 0.5, 0.5]]) | ||
>>> featurizer = dc.feat.CGCNNFeaturizer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MolGraphConvFeaturizer is better compared with CGCNNFeaturizer.
The CGCNNFeaturizer has some possibilities to have some bug #2160 and basically DeepChem users are more interested in molecules than inorganic crystals
Sample implementation is here
https://colab.research.google.com/drive/1NWPXbnMMe8vsqouE60hBBF8ouXDIxgMM?usp=sharing
MolGraphConvFeaturizer
https://github.com/deepchem/deepchem/blob/master/deepchem/feat/molecule_featurizers/mol_graph_conv_featurizer.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to have both featurizations as examples. I think DeepChem is starting to pick up more users from the materials science communities so both examples would be great :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agreed that it is better to support both crystal graph (CGCNNFeaturizer) and molecule graph (MolGraphConvFeaturizer) ideally.
But, as I mentioned in #2249 (comment), we should take care of some points for supporting them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've replaced CGCNNFeaturizer
with MolGraphConvFeaturizer
in the examples.
deepchem/models/torch_models/gcn.py
Outdated
raise ImportError('This class requires dgl.') | ||
|
||
inputs, labels, weights = batch | ||
dgl_graphs = [graph.to_dgl_graph() for graph in inputs[0]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using MolGraphConvFeaturizer
, we need to add self-loop connection explicitly.
The graph data built by MolGraphConvFeaturizer
doesn't have the self-loop connection. This design is inspired by PyG style. The self-loop connection is added in the forward function of GCN class.
Ref : https://pytorch-geometric.readthedocs.io/en/latest/notes/create_gnn.html#implementing-the-gcn-layer
dgl.add_self_loop(graph.to_dgl_graph())
And then, considering both CGCNNFeaturizer and MolGraphConvFeaturizer, it is needed that the GCNModel receive the self_loop
flag and add the self-loop connection when self_loop
== True. The CGCNNFeaturizer often return the graph have the multi self-loop connection. This is the character of the crystal graph. However, i think the self_loop
flag is a little difficult for GNN beginners which are most of DeepChem users. In addtion to this, the CGCNNFeaturizer may have some bugs. (#2160) So, basically, I think the GCNModel should focus on supporting MolGraphConvFeaturizer
.
Ideally, DGL graphs have the API which judge whether the graph has the self-loop connection or not. Do you know whether DGL graph have such a API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I've added a
self_loop
flag toGCNModel
as well asto_dgl_graph
. When one instantiates aGCNModel
instance withself_loop = True
, the object will add self loops to DGLGraphs in_prepare_batch
. - Why do you want an API that checks whether the graph already has self-loops? Do you simply want to ensure that each node has a single self loop? If so, we can first call
g = dgl.remove_self_loop(g)
and then callg = dgl.add_self_loop(g)
. To check whether a graph has self loops for all nodes, you can use has_edges_between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a self_loop flag to GCNModel as well as to_dgl_graph. When one instantiates a GCNModel instance with self_loop = True, the object will add self loops to DGLGraphs in _prepare_batch.
The modification looks good to me!
Why do you want an API that checks whether the graph already has self-loops? Do you simply want to ensure that each node has a single self loop? If so, we can first call g = dgl.remove_self_loop(g) and then call g = dgl.add_self_loop(g). To check whether a graph has self loops for all nodes, you can use has_edges_between.
The reason that I want an API that checks whether the graph already has self-loops is that it is better to replace the self_loop
flag to the internal logic in _prepare_batch
. It is a little difficult for many new comers to judge whether the model (like GCN or GAT) requires the self-loop or not.
So my image is like thtat
def check_self_loop(dgl_graph):
if ....
return True // more than one self-loop
else:
return False // no self-loop
class GCNModel
.....
def _prepare_batch(self, batch);
inputs, labels, weights = batch
dgl_graphs = [graph.to_dgl_graph() for graph in inputs[0]]
dgl_graphs = [dgl.add_self_loop(graph) if check_self_loop(graph) else graph for graph in dgl_graphs]
inputs = dgl.batch(dgl_graphs).to(self.device)
_, labels, weights = super(GCNModel, self)._prepare_batch(([], labels, weights))
return inputs, labels, weights
This style doesn't require that users knows whether the model needs a self loop or not.
docs/models.rst
Outdated
@@ -126,6 +126,9 @@ read off what's needed to train the model from the table below. | |||
| :code:`GATModel` | Classifier/| :code:`GraphData` | | :code:`MolGraphConvFeaturizer` | :code:`fit` | | |||
| | Regressor | | | | | | |||
+----------------------------------------+------------+----------------------+------------------------+----------------------------------------------------------------+----------------------+ | |||
| :code:`GCNModel` | Classifier/| :code:`GraphData` | | :code:`CGCNNFeaturizer` | :code:`fit` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, the GCNModel should focus on MolGraphConvFeaturizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 that MolGraphConvFeaturizer
would probably be the more standard featurization here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Considering about peatsman comments (#2249 (comment)) and unifying the graph data #1942, I seem it is better to swap the current legacy GraphConvModel, MPNNModel and WeaveModel to the models based on DGL-LifeSci.
The current WeaveModel have some known issues, like deepchem/deepchem/models/graph_models.py Lines 66 to 71 in 7e745b9
And, how the legacy models handle graph data is really complex, so it is hard to maintain them. |
Just catching up on the discussions here. I'm generally in favor of migrating the backends to DGL-lifesci, but we should get MoleculeNet benchmark numbers up so we know that the new implementation is at par or better than the legacy implementation. This might be a good thing to do over a few PRs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor comments added but nothing big
deepchem/models/torch_models/gcn.py
Outdated
>>> from deepchem.models import GCN | ||
>>> lattice = mg.Lattice.cubic(4.2) | ||
>>> structure = mg.Structure(lattice, ["Cs", "Cl"], [[0, 0, 0], [0.5, 0.5, 0.5]]) | ||
>>> featurizer = dc.feat.CGCNNFeaturizer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to have both featurizations as examples. I think DeepChem is starting to pick up more users from the materials science communities so both examples would be great :)
docs/models.rst
Outdated
@@ -126,6 +126,9 @@ read off what's needed to train the model from the table below. | |||
| :code:`GATModel` | Classifier/| :code:`GraphData` | | :code:`MolGraphConvFeaturizer` | :code:`fit` | | |||
| | Regressor | | | | | | |||
+----------------------------------------+------------+----------------------+------------------------+----------------------------------------------------------------+----------------------+ | |||
| :code:`GCNModel` | Classifier/| :code:`GraphData` | | :code:`CGCNNFeaturizer` | :code:`fit` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 that MolGraphConvFeaturizer
would probably be the more standard featurization here
It seems that @nd-02110114 has some concerns for that?
That has been updated. @nd-02110114 How does DeepChem handle tests for model classes? Where should I add a test for GCNModel? |
I think https://github.com/deepchem/deepchem/blob/master/deepchem/models/tests/test_gat.py is helpful for you. |
I've added a test file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding the test! I added some comments.
I think this PR will be ready to merge after fixing the errors in CI.
About coding style or type annotation, you could check like this.
$ pip install yapf==0.22.0
$ yapf -i -r file_or_directory_you_modified
$ mypy -p deepchem
And then, your tests doesn't pass in CI. (The current CI errors includes some errors which is not related to this PR) You should check the tests which is related to the codes you modified like this.
$ pytest deepchem/feat/tests/test_graph_data.py
$ pytest deepchem/models/tests/test_gcn.py
And, recently I fixed the CI failure in master branch,
so you merge master branch updates to this branch.
After that, all CI errors depends on your modifications
deepchem/feat/graph_data.py
Outdated
@@ -123,29 +123,36 @@ def to_pyg_graph(self): | |||
edge_attr=edge_features, | |||
pos=node_pos_features) | |||
|
|||
def to_dgl_graph(self): | |||
def to_dgl_graph(self, self_loop=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add type annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, can you set the default value as False?
This modification leads to many test errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
deepchem/models/torch_models/gcn.py
Outdated
|
||
>>> import deepchem as dc | ||
>>> from deepchem.models import GCNModel | ||
>>> featurizer = dc.feat.MolGraphConvFeaturizer() | ||
>>> tasks, datasets, transformers = dc.molnet.load_tox21( | ||
... reload=False, featurizer=featurizer, transformers=[]) | ||
>>> train, valid, test = datasets | ||
>>> model = dc.models.GCNModel(mode='classification', n_tasks=len(tasks), | ||
... number_atom_features=30, batch_size=32, learning_rate=0.001) | ||
>>> model.fit(train, nb_epoch=50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change the example like this?
We tests all examples written in docstring. But load_tox21
is slow, so we should skip this example.
>>>
>> import deepchem as dc
>> from deepchem.models import GCNModel
>> featurizer = dc.feat.MolGraphConvFeaturizer()
>> tasks, datasets, transformers = dc.molnet.load_tox21(
.. reload=False, featurizer=featurizer, transformers=[])
>> train, valid, test = datasets
>> model = dc.models.GCNModel(mode='classification', n_tasks=len(tasks),
.. number_atom_features=30, batch_size=32, learning_rate=0.001)
>> model.fit(train, nb_epoch=50)
This style renders the example correctly in docs and skips the doctest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@nd-02110114 I should have fixed all the issues and the code has passed both the style check and CI locally. |
Thanks! But, the CI still failed by the style check. You need to run these commands.
|
The updates looks good to me!
Please add DGL-LifeSci dependencies in After adding these, I think this PR is ready to merge 🎉 |
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a pass and I think this is good to merge as well! I believe @nd-02110114's review is done as well so I'm going to mark this as approved. Feel free to merge when ready!
@rbharath @nd-02110114 Awesome! I don't have write access to merge it and you may go ahead to merge it. Thank you! |
@mufeili Thanks for the contribution! This will be very valuable to our users :). Looking forward to your next contributions! |
@rbharath This PR serves as a proof of concept for wrappers of models from DGL-LifeSci. I basically follow the design by @nd-02110114. The PR assumes the latest version of DGL (0.5.2) and DGL-LifeSci (0.2.5).
Previously the GAT model was implemented in PyG and GCN in this PR is implemented in DGL. What if there are both DGL and PyG implementations for a model? How should we arrange the name space?