Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Wrapper for GCNPredictor from DGL-LifeSci #2249

Merged
merged 21 commits into from
Nov 3, 2020
Merged

Model Wrapper for GCNPredictor from DGL-LifeSci #2249

merged 21 commits into from
Nov 3, 2020

Conversation

mufeili
Copy link
Contributor

@mufeili mufeili commented Oct 26, 2020

@rbharath This PR serves as a proof of concept for wrappers of models from DGL-LifeSci. I basically follow the design by @nd-02110114. The PR assumes the latest version of DGL (0.5.2) and DGL-LifeSci (0.2.5).

Previously the GAT model was implemented in PyG and GCN in this PR is implemented in DGL. What if there are both DGL and PyG implementations for a model? How should we arrange the name space?

@peastman
Copy link
Contributor

This is looking very nice!

It would be good for the docstring for GCNModel to explain how this model is different from GraphConvModel, since they do very similar things. It also would be good if, as much as possible, the names and order of the constructor arguments match GraphConvModel.

Copy link
Member

@rbharath rbharath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! +1 for @peastman'scomments. Have one additional minor comment about docs as well

deepchem/models/torch_models/gcn.py Show resolved Hide resolved
@nissy-dev
Copy link
Member

nissy-dev commented Oct 27, 2020

Previously the GAT model was implemented in PyG and GCN in this PR is implemented in DGL. What if there are both DGL and PyG implementations for a model? How should we arrange the name space?

Basically, I think DGLXXXX or XXXXWithDGL (ex: DGLGCNModel or GCNModelWithDGL) is a better name. It clears which framework the model depends on.

The situation that DeepChem has both DGL and PyG implementatins is not good because it makes users confuse, like which model is better or correct. And then, the performance of GATModel based on the dgllife-sci model is better compared with the current GATModel based on PyG. If you want to add GATModel based on dgllife-sci model, I think it is better to replace the current GATModel to your GATModel based on dgllife-sci model.

My motivation of creating the current GATModel is just to provide the sample model by using PyG, so there is no need that the model is GAT. Currently, my CGCNNModel based on DGL is not working, so I think it is better to keep providing the sample model based on PyG by rewriting CGCNN with PyG.

@peastman
Copy link
Contributor

Expanding on Daiki's comments, our general approach is that DeepChem provides the user a choice of models, not a choice of implementations. What framework a model is implemented with should be an internal detail that users mostly don't need to think about. The user will say, "I want a GAT model," so they'll use the GATModel class. They may not even know or care whether it's implemented with PyG, DGL, TensorFlow, or whatever.

@mufeili
Copy link
Contributor Author

mufeili commented Oct 28, 2020

This is looking very nice!

It would be good for the docstring for GCNModel to explain how this model is different from GraphConvModel, since they do very similar things. It also would be good if, as much as possible, the names and order of the constructor arguments match GraphConvModel.

I've added an explanation of their differences in Notes and made some adjustments for the order of the arguments. Meanwhile, there are still many differences between these two models so it might be difficult to have exactly the same arguments.

@mufeili
Copy link
Contributor Author

mufeili commented Oct 28, 2020

Previously the GAT model was implemented in PyG and GCN in this PR is implemented in DGL. What if there are both DGL and PyG implementations for a model? How should we arrange the name space?

Basically, I think DGLXXXX or XXXXWithDGL (ex: DGLGCNModel or GCNModelWithDGL) is a better name. It clears which framework the model depends on.

The situation that DeepChem has both DGL and PyG implementatins is not good because it makes users confuse, like which model is better or correct. And then, the performance of GATModel based on the dgllife-sci model is better compared with the current GATModel based on PyG. If you want to add GATModel based on dgllife-sci model, I think it is better to replace the current GATModel to your GATModel based on dgllife-sci model.

My motivation of creating the current GATModel is just to provide the sample model by using PyG, so there is no need that the model is GAT. Currently, my CGCNNModel based on DGL is not working, so I think it is better to keep providing the sample model based on PyG by rewriting CGCNN with PyG.

Make sense. I can work on GAT for the next PR.

Expanding on Daiki's comments, our general approach is that DeepChem provides the user a choice of models, not a choice of implementations. What framework a model is implemented with should be an internal detail that users mostly don't need to think about. The user will say, "I want a GAT model," so they'll use the GATModel class. They may not even know or care whether it's implemented with PyG, DGL, TensorFlow, or whatever.

Got it, make sense.

Copy link
Member

@nissy-dev nissy-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed! Thank you for your work!

I think this PR will be ready to merge after you add GCNModel test and rewrite the examples of docstrings.

The current examples are based on crystal data, so you need to convert them to the examples based on molecule data. I think the GATModel docstrings are really helpful. (GATModel test is also helpful)

GATModel : https://github.com/deepchem/deepchem/blob/master/deepchem/models/torch_models/gat.py

>>> from deepchem.models import GCN
>>> lattice = mg.Lattice.cubic(4.2)
>>> structure = mg.Structure(lattice, ["Cs", "Cl"], [[0, 0, 0], [0.5, 0.5, 0.5]])
>>> featurizer = dc.feat.CGCNNFeaturizer()
Copy link
Member

@nissy-dev nissy-dev Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MolGraphConvFeaturizer is better compared with CGCNNFeaturizer.
The CGCNNFeaturizer has some possibilities to have some bug #2160 and basically DeepChem users are more interested in molecules than inorganic crystals

Sample implementation is here
https://colab.research.google.com/drive/1NWPXbnMMe8vsqouE60hBBF8ouXDIxgMM?usp=sharing

MolGraphConvFeaturizer
https://github.com/deepchem/deepchem/blob/master/deepchem/feat/molecule_featurizers/mol_graph_conv_featurizer.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to have both featurizations as examples. I think DeepChem is starting to pick up more users from the materials science communities so both examples would be great :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agreed that it is better to support both crystal graph (CGCNNFeaturizer) and molecule graph (MolGraphConvFeaturizer) ideally.

But, as I mentioned in #2249 (comment), we should take care of some points for supporting them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replaced CGCNNFeaturizer with MolGraphConvFeaturizer in the examples.

raise ImportError('This class requires dgl.')

inputs, labels, weights = batch
dgl_graphs = [graph.to_dgl_graph() for graph in inputs[0]]
Copy link
Member

@nissy-dev nissy-dev Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using MolGraphConvFeaturizer, we need to add self-loop connection explicitly.
The graph data built by MolGraphConvFeaturizer doesn't have the self-loop connection. This design is inspired by PyG style. The self-loop connection is added in the forward function of GCN class.
Ref : https://pytorch-geometric.readthedocs.io/en/latest/notes/create_gnn.html#implementing-the-gcn-layer

dgl.add_self_loop(graph.to_dgl_graph())

And then, considering both CGCNNFeaturizer and MolGraphConvFeaturizer, it is needed that the GCNModel receive the self_loop flag and add the self-loop connection when self_loop == True. The CGCNNFeaturizer often return the graph have the multi self-loop connection. This is the character of the crystal graph. However, i think the self_loop flag is a little difficult for GNN beginners which are most of DeepChem users. In addtion to this, the CGCNNFeaturizer may have some bugs. (#2160) So, basically, I think the GCNModel should focus on supporting MolGraphConvFeaturizer.

Ideally, DGL graphs have the API which judge whether the graph has the self-loop connection or not. Do you know whether DGL graph have such a API?

Copy link
Contributor Author

@mufeili mufeili Nov 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I've added a self_loop flag to GCNModel as well as to_dgl_graph. When one instantiates a GCNModel instance with self_loop = True, the object will add self loops to DGLGraphs in _prepare_batch.
  2. Why do you want an API that checks whether the graph already has self-loops? Do you simply want to ensure that each node has a single self loop? If so, we can first call g = dgl.remove_self_loop(g) and then call g = dgl.add_self_loop(g). To check whether a graph has self loops for all nodes, you can use has_edges_between.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a self_loop flag to GCNModel as well as to_dgl_graph. When one instantiates a GCNModel instance with self_loop = True, the object will add self loops to DGLGraphs in _prepare_batch.

The modification looks good to me!

Why do you want an API that checks whether the graph already has self-loops? Do you simply want to ensure that each node has a single self loop? If so, we can first call g = dgl.remove_self_loop(g) and then call g = dgl.add_self_loop(g). To check whether a graph has self loops for all nodes, you can use has_edges_between.

The reason that I want an API that checks whether the graph already has self-loops is that it is better to replace the self_loop flag to the internal logic in _prepare_batch. It is a little difficult for many new comers to judge whether the model (like GCN or GAT) requires the self-loop or not.
So my image is like thtat

def check_self_loop(dgl_graph):
  if ....
    return True // more than one self-loop
  else:
    return False // no self-loop

class GCNModel
 .....

  def _prepare_batch(self, batch);
      inputs, labels, weights = batch
      dgl_graphs = [graph.to_dgl_graph() for graph in inputs[0]]
      dgl_graphs = [dgl.add_self_loop(graph)  if check_self_loop(graph) else graph for graph in dgl_graphs]
      inputs = dgl.batch(dgl_graphs).to(self.device)
      _, labels, weights = super(GCNModel, self)._prepare_batch(([], labels, weights))
      return inputs, labels, weights

This style doesn't require that users knows whether the model needs a self loop or not.

docs/models.rst Outdated
@@ -126,6 +126,9 @@ read off what's needed to train the model from the table below.
| :code:`GATModel` | Classifier/| :code:`GraphData` | | :code:`MolGraphConvFeaturizer` | :code:`fit` |
| | Regressor | | | | |
+----------------------------------------+------------+----------------------+------------------------+----------------------------------------------------------------+----------------------+
| :code:`GCNModel` | Classifier/| :code:`GraphData` | | :code:`CGCNNFeaturizer` | :code:`fit` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, the GCNModel should focus on MolGraphConvFeaturizer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 that MolGraphConvFeaturizer would probably be the more standard featurization here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@nissy-dev
Copy link
Member

Considering about peatsman comments (#2249 (comment)) and unifying the graph data #1942, I seem it is better to swap the current legacy GraphConvModel, MPNNModel and WeaveModel to the models based on DGL-LifeSci.

  • GraphConvModel -> DGL-LifeSci GCN wrapper
  • MPNNModel -> DGL-LifeSci MPNN wrapper
  • WeaveModel -> DGL-LifeSci WeaveNet wrapper

The current WeaveModel have some known issues, like

Note
----
In general, the use of batch normalization can cause issues with NaNs. If
you're having trouble with NaNs while using this model, consider setting
`batch_normalize_kwargs={"trainable": False}` or turning off batch
normalization entirely with `batch_normalize=False`.

And, how the legacy models handle graph data is really complex, so it is hard to maintain them.

@rbharath
Copy link
Member

Just catching up on the discussions here. I'm generally in favor of migrating the backends to DGL-lifesci, but we should get MoleculeNet benchmark numbers up so we know that the new implementation is at par or better than the legacy implementation. This might be a good thing to do over a few PRs

Copy link
Member

@rbharath rbharath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor comments added but nothing big

>>> from deepchem.models import GCN
>>> lattice = mg.Lattice.cubic(4.2)
>>> structure = mg.Structure(lattice, ["Cs", "Cl"], [[0, 0, 0], [0.5, 0.5, 0.5]])
>>> featurizer = dc.feat.CGCNNFeaturizer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to have both featurizations as examples. I think DeepChem is starting to pick up more users from the materials science communities so both examples would be great :)

docs/models.rst Outdated
@@ -126,6 +126,9 @@ read off what's needed to train the model from the table below.
| :code:`GATModel` | Classifier/| :code:`GraphData` | | :code:`MolGraphConvFeaturizer` | :code:`fit` |
| | Regressor | | | | |
+----------------------------------------+------------+----------------------+------------------------+----------------------------------------------------------------+----------------------+
| :code:`GCNModel` | Classifier/| :code:`GraphData` | | :code:`CGCNNFeaturizer` | :code:`fit` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 that MolGraphConvFeaturizer would probably be the more standard featurization here

@mufeili
Copy link
Contributor Author

mufeili commented Nov 1, 2020

@rbharath

It might be useful to have both featurizations as examples. I think DeepChem is starting to pick up more users from the materials science communities so both examples would be great :)

It seems that @nd-02110114 has some concerns for that?

+1 that MolGraphConvFeaturizer would probably be the more standard featurization here

That has been updated.

@nd-02110114

How does DeepChem handle tests for model classes? Where should I add a test for GCNModel?

@nissy-dev
Copy link
Member

nissy-dev commented Nov 1, 2020

How does DeepChem handle tests for model classes? Where should I add a test for GCNModel?

I think https://github.com/deepchem/deepchem/blob/master/deepchem/models/tests/test_gat.py is helpful for you.
Basically, we write the overfit test about regression and classification task using the small data (10~20). This docs is also helpful https://deepchem.readthedocs.io/en/latest/coding.html#testing-machine-learning-models

@mufeili
Copy link
Contributor Author

mufeili commented Nov 1, 2020

How does DeepChem handle tests for model classes? Where should I add a test for GCNModel?

I think https://github.com/deepchem/deepchem/blob/master/deepchem/models/tests/test_gat.py is helpful for you.
Basically, we write the overfit test about regression and classification task using the small data (10~20). This docs is also helpful https://deepchem.readthedocs.io/en/latest/coding.html#testing-machine-learning-models

I've added a test file test_gcn.py.

Copy link
Member

@nissy-dev nissy-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the test! I added some comments.
I think this PR will be ready to merge after fixing the errors in CI.

About coding style or type annotation, you could check like this.

$ pip install yapf==0.22.0
$ yapf -i -r file_or_directory_you_modified
$ mypy -p deepchem

And then, your tests doesn't pass in CI. (The current CI errors includes some errors which is not related to this PR) You should check the tests which is related to the codes you modified like this.

$ pytest deepchem/feat/tests/test_graph_data.py
$ pytest deepchem/models/tests/test_gcn.py

And, recently I fixed the CI failure in master branch,
so you merge master branch updates to this branch.
After that, all CI errors depends on your modifications

@@ -123,29 +123,36 @@ def to_pyg_graph(self):
edge_attr=edge_features,
pos=node_pos_features)

def to_dgl_graph(self):
def to_dgl_graph(self, self_loop=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add type annotation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, can you set the default value as False?
This modification leads to many test errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 209 to 218

>>> import deepchem as dc
>>> from deepchem.models import GCNModel
>>> featurizer = dc.feat.MolGraphConvFeaturizer()
>>> tasks, datasets, transformers = dc.molnet.load_tox21(
... reload=False, featurizer=featurizer, transformers=[])
>>> train, valid, test = datasets
>>> model = dc.models.GCNModel(mode='classification', n_tasks=len(tasks),
... number_atom_features=30, batch_size=32, learning_rate=0.001)
>>> model.fit(train, nb_epoch=50)
Copy link
Member

@nissy-dev nissy-dev Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the example like this?
We tests all examples written in docstring. But load_tox21 is slow, so we should skip this example.

>>>
>> import deepchem as dc
>> from deepchem.models import GCNModel
>> featurizer = dc.feat.MolGraphConvFeaturizer()
>> tasks, datasets, transformers = dc.molnet.load_tox21(
..     reload=False, featurizer=featurizer, transformers=[])
>> train, valid, test = datasets
>> model = dc.models.GCNModel(mode='classification', n_tasks=len(tasks),
..                            number_atom_features=30, batch_size=32, learning_rate=0.001)
>> model.fit(train, nb_epoch=50)

This style renders the example correctly in docs and skips the doctest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@mufeili
Copy link
Contributor Author

mufeili commented Nov 2, 2020

@nd-02110114 I should have fixed all the issues and the code has passed both the style check and CI locally.

@nissy-dev
Copy link
Member

Thanks! But, the CI still failed by the style check.

You need to run these commands.

$ pip install yapf==0.22.0
$ yapf -i -r deepchem/feat
$ yapf -i -r deepchem/models

@nissy-dev
Copy link
Member

nissy-dev commented Nov 2, 2020

The updates looks good to me!

DGL/DGL-LifeSci are not installed?

Please add DGL-LifeSci dependencies in scripts/install_deepchem_conda.sh(ps1) and docs/requirements.rst.

After adding these, I think this PR is ready to merge 🎉

@mufeili
Copy link
Contributor Author

mufeili commented Nov 2, 2020

The updates looks good to me!

DGL/DGL-LifeSci are not installed?

Please add DGL-LifeSci dependencies in scripts/install_deepchem_conda.sh(ps1) and docs/requirements.rst.

After adding these, I think this PR is ready to merge 🎉

Done

@coveralls
Copy link

coveralls commented Nov 2, 2020

Coverage Status

Coverage increased (+0.05%) to 80.665% when pulling 70abc0a on mufeili:master into 53c3b55 on deepchem:master.

Copy link
Member

@rbharath rbharath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a pass and I think this is good to merge as well! I believe @nd-02110114's review is done as well so I'm going to mark this as approved. Feel free to merge when ready!

@mufeili
Copy link
Contributor Author

mufeili commented Nov 3, 2020

@rbharath @nd-02110114 Awesome! I don't have write access to merge it and you may go ahead to merge it. Thank you!

@rbharath rbharath merged commit a1385f3 into deepchem:master Nov 3, 2020
@rbharath
Copy link
Member

rbharath commented Nov 3, 2020

@mufeili Thanks for the contribution! This will be very valuable to our users :). Looking forward to your next contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants