Model Wrapper for GCNPredictor from DGL-LifeSci #2249

mufeili · 2020-10-26T16:10:46Z

@rbharath This PR serves as a proof of concept for wrappers of models from DGL-LifeSci. I basically follow the design by @nd-02110114. The PR assumes the latest version of DGL (0.5.2) and DGL-LifeSci (0.2.5).

Previously the GAT model was implemented in PyG and GCN in this PR is implemented in DGL. What if there are both DGL and PyG implementations for a model? How should we arrange the name space?

peastman · 2020-10-26T17:50:48Z

This is looking very nice!

It would be good for the docstring for GCNModel to explain how this model is different from GraphConvModel, since they do very similar things. It also would be good if, as much as possible, the names and order of the constructor arguments match GraphConvModel.

rbharath

Looking good! +1 for @peastman'scomments. Have one additional minor comment about docs as well

deepchem/models/torch_models/gcn.py

nissy-dev · 2020-10-27T06:32:27Z

Previously the GAT model was implemented in PyG and GCN in this PR is implemented in DGL. What if there are both DGL and PyG implementations for a model? How should we arrange the name space?

Basically, I think DGLXXXX or XXXXWithDGL (ex: DGLGCNModel or GCNModelWithDGL) is a better name. It clears which framework the model depends on.

The situation that DeepChem has both DGL and PyG implementatins is not good because it makes users confuse, like which model is better or correct. And then, the performance of GATModel based on the dgllife-sci model is better compared with the current GATModel based on PyG. If you want to add GATModel based on dgllife-sci model, I think it is better to replace the current GATModel to your GATModel based on dgllife-sci model.

My motivation of creating the current GATModel is just to provide the sample model by using PyG, so there is no need that the model is GAT. Currently, my CGCNNModel based on DGL is not working, so I think it is better to keep providing the sample model based on PyG by rewriting CGCNN with PyG.

peastman · 2020-10-27T17:09:13Z

Expanding on Daiki's comments, our general approach is that DeepChem provides the user a choice of models, not a choice of implementations. What framework a model is implemented with should be an internal detail that users mostly don't need to think about. The user will say, "I want a GAT model," so they'll use the GATModel class. They may not even know or care whether it's implemented with PyG, DGL, TensorFlow, or whatever.

mufeili · 2020-10-28T12:09:11Z

This is looking very nice!

It would be good for the docstring for GCNModel to explain how this model is different from GraphConvModel, since they do very similar things. It also would be good if, as much as possible, the names and order of the constructor arguments match GraphConvModel.

I've added an explanation of their differences in Notes and made some adjustments for the order of the arguments. Meanwhile, there are still many differences between these two models so it might be difficult to have exactly the same arguments.

mufeili · 2020-10-28T12:10:54Z

Previously the GAT model was implemented in PyG and GCN in this PR is implemented in DGL. What if there are both DGL and PyG implementations for a model? How should we arrange the name space?

Basically, I think DGLXXXX or XXXXWithDGL (ex: DGLGCNModel or GCNModelWithDGL) is a better name. It clears which framework the model depends on.

The situation that DeepChem has both DGL and PyG implementatins is not good because it makes users confuse, like which model is better or correct. And then, the performance of GATModel based on the dgllife-sci model is better compared with the current GATModel based on PyG. If you want to add GATModel based on dgllife-sci model, I think it is better to replace the current GATModel to your GATModel based on dgllife-sci model.

My motivation of creating the current GATModel is just to provide the sample model by using PyG, so there is no need that the model is GAT. Currently, my CGCNNModel based on DGL is not working, so I think it is better to keep providing the sample model based on PyG by rewriting CGCNN with PyG.

Make sense. I can work on GAT for the next PR.

Expanding on Daiki's comments, our general approach is that DeepChem provides the user a choice of models, not a choice of implementations. What framework a model is implemented with should be an internal detail that users mostly don't need to think about. The user will say, "I want a GAT model," so they'll use the GATModel class. They may not even know or care whether it's implemented with PyG, DGL, TensorFlow, or whatever.

Got it, make sense.

nissy-dev

I reviewed! Thank you for your work!

I think this PR will be ready to merge after you add GCNModel test and rewrite the examples of docstrings.

The current examples are based on crystal data, so you need to convert them to the examples based on molecule data. I think the GATModel docstrings are really helpful. (GATModel test is also helpful)

GATModel : https://github.com/deepchem/deepchem/blob/master/deepchem/models/torch_models/gat.py

nissy-dev · 2020-10-28T12:15:20Z

deepchem/models/torch_models/gcn.py

+    >>> from deepchem.models import GCN
+    >>> lattice = mg.Lattice.cubic(4.2)
+    >>> structure = mg.Structure(lattice, ["Cs", "Cl"], [[0, 0, 0], [0.5, 0.5, 0.5]])
+    >>> featurizer = dc.feat.CGCNNFeaturizer()


The MolGraphConvFeaturizer is better compared with CGCNNFeaturizer.
The CGCNNFeaturizer has some possibilities to have some bug #2160 and basically DeepChem users are more interested in molecules than inorganic crystals

Sample implementation is here
https://colab.research.google.com/drive/1NWPXbnMMe8vsqouE60hBBF8ouXDIxgMM?usp=sharing

MolGraphConvFeaturizer
https://github.com/deepchem/deepchem/blob/master/deepchem/feat/molecule_featurizers/mol_graph_conv_featurizer.py

It might be useful to have both featurizations as examples. I think DeepChem is starting to pick up more users from the materials science communities so both examples would be great :)

I agreed that it is better to support both crystal graph (CGCNNFeaturizer) and molecule graph (MolGraphConvFeaturizer) ideally.

But, as I mentioned in #2249 (comment), we should take care of some points for supporting them.

I've replaced CGCNNFeaturizer with MolGraphConvFeaturizer in the examples.

nissy-dev · 2020-10-29T14:10:41Z

deepchem/models/torch_models/gcn.py

+            raise ImportError('This class requires dgl.')
+
+        inputs, labels, weights = batch
+        dgl_graphs = [graph.to_dgl_graph() for graph in inputs[0]]


When using MolGraphConvFeaturizer, we need to add self-loop connection explicitly.
The graph data built by MolGraphConvFeaturizer doesn't have the self-loop connection. This design is inspired by PyG style. The self-loop connection is added in the forward function of GCN class.
Ref : https://pytorch-geometric.readthedocs.io/en/latest/notes/create_gnn.html#implementing-the-gcn-layer

dgl.add_self_loop(graph.to_dgl_graph())

And then, considering both CGCNNFeaturizer and MolGraphConvFeaturizer, it is needed that the GCNModel receive the self_loop flag and add the self-loop connection when self_loop == True. The CGCNNFeaturizer often return the graph have the multi self-loop connection. This is the character of the crystal graph. However, i think the self_loop flag is a little difficult for GNN beginners which are most of DeepChem users. In addtion to this, the CGCNNFeaturizer may have some bugs. (#2160) So, basically, I think the GCNModel should focus on supporting MolGraphConvFeaturizer.

Ideally, DGL graphs have the API which judge whether the graph has the self-loop connection or not. Do you know whether DGL graph have such a API?

I've added a self_loop flag to GCNModel as well as to_dgl_graph. When one instantiates a GCNModel instance with self_loop = True, the object will add self loops to DGLGraphs in _prepare_batch.

Why do you want an API that checks whether the graph already has self-loops? Do you simply want to ensure that each node has a single self loop? If so, we can first call g = dgl.remove_self_loop(g) and then call g = dgl.add_self_loop(g). To check whether a graph has self loops for all nodes, you can use has_edges_between.

I've added a self_loop flag to GCNModel as well as to_dgl_graph. When one instantiates a GCNModel instance with self_loop = True, the object will add self loops to DGLGraphs in _prepare_batch.

The modification looks good to me!

Why do you want an API that checks whether the graph already has self-loops? Do you simply want to ensure that each node has a single self loop? If so, we can first call g = dgl.remove_self_loop(g) and then call g = dgl.add_self_loop(g). To check whether a graph has self loops for all nodes, you can use has_edges_between.

The reason that I want an API that checks whether the graph already has self-loops is that it is better to replace the self_loop flag to the internal logic in _prepare_batch. It is a little difficult for many new comers to judge whether the model (like GCN or GAT) requires the self-loop or not.
So my image is like thtat

def check_self_loop(dgl_graph): if .... return True // more than one self-loop else: return False // no self-loop class GCNModel ..... def _prepare_batch(self, batch); inputs, labels, weights = batch dgl_graphs = [graph.to_dgl_graph() for graph in inputs[0]] dgl_graphs = [dgl.add_self_loop(graph) if check_self_loop(graph) else graph for graph in dgl_graphs] inputs = dgl.batch(dgl_graphs).to(self.device) _, labels, weights = super(GCNModel, self)._prepare_batch(([], labels, weights)) return inputs, labels, weights

This style doesn't require that users knows whether the model needs a self loop or not.

nissy-dev · 2020-10-29T14:16:31Z

docs/models.rst

@@ -126,6 +126,9 @@ read off what's needed to train the model from the table below.
 | :code:`GATModel`                       | Classifier/| :code:`GraphData`    |                        | :code:`MolGraphConvFeaturizer`                                 | :code:`fit`          |
 |                                        | Regressor  |                      |                        |                                                                |                      |
 +----------------------------------------+------------+----------------------+------------------------+----------------------------------------------------------------+----------------------+
+| :code:`GCNModel`                       | Classifier/| :code:`GraphData`    |                        | :code:`CGCNNFeaturizer`                                        | :code:`fit`          |


Basically, the GCNModel should focus on MolGraphConvFeaturizer

+1 that MolGraphConvFeaturizer would probably be the more standard featurization here

nissy-dev · 2020-10-29T15:41:13Z

Considering about peatsman comments (#2249 (comment)) and unifying the graph data #1942, I seem it is better to swap the current legacy GraphConvModel, MPNNModel and WeaveModel to the models based on DGL-LifeSci.

GraphConvModel -> DGL-LifeSci GCN wrapper
MPNNModel -> DGL-LifeSci MPNN wrapper
WeaveModel -> DGL-LifeSci WeaveNet wrapper

The current WeaveModel have some known issues, like

deepchem/deepchem/models/graph_models.py

Lines 66 to 71 in 7e745b9

    
             Note 
        
             ---- 
        
             In general, the use of batch normalization can cause issues with NaNs. If 
        
             you're having trouble with NaNs while using this model, consider setting 
        
             `batch_normalize_kwargs={"trainable": False}` or turning off batch 
        
             normalization entirely with `batch_normalize=False`.

And, how the legacy models handle graph data is really complex, so it is hard to maintain them.

rbharath · 2020-10-30T01:46:57Z

Just catching up on the discussions here. I'm generally in favor of migrating the backends to DGL-lifesci, but we should get MoleculeNet benchmark numbers up so we know that the new implementation is at par or better than the legacy implementation. This might be a good thing to do over a few PRs

rbharath

Couple of minor comments added but nothing big

rbharath · 2020-10-30T03:35:02Z

deepchem/models/torch_models/gcn.py

+    >>> from deepchem.models import GCN
+    >>> lattice = mg.Lattice.cubic(4.2)
+    >>> structure = mg.Structure(lattice, ["Cs", "Cl"], [[0, 0, 0], [0.5, 0.5, 0.5]])
+    >>> featurizer = dc.feat.CGCNNFeaturizer()


It might be useful to have both featurizations as examples. I think DeepChem is starting to pick up more users from the materials science communities so both examples would be great :)

rbharath · 2020-10-30T03:35:42Z

docs/models.rst

@@ -126,6 +126,9 @@ read off what's needed to train the model from the table below.
 | :code:`GATModel`                       | Classifier/| :code:`GraphData`    |                        | :code:`MolGraphConvFeaturizer`                                 | :code:`fit`          |
 |                                        | Regressor  |                      |                        |                                                                |                      |
 +----------------------------------------+------------+----------------------+------------------------+----------------------------------------------------------------+----------------------+
+| :code:`GCNModel`                       | Classifier/| :code:`GraphData`    |                        | :code:`CGCNNFeaturizer`                                        | :code:`fit`          |


+1 that MolGraphConvFeaturizer would probably be the more standard featurization here

mufeili · 2020-11-01T13:27:03Z

@rbharath

It might be useful to have both featurizations as examples. I think DeepChem is starting to pick up more users from the materials science communities so both examples would be great :)

It seems that @nd-02110114 has some concerns for that?

+1 that MolGraphConvFeaturizer would probably be the more standard featurization here

That has been updated.

@nd-02110114

How does DeepChem handle tests for model classes? Where should I add a test for GCNModel?

nissy-dev · 2020-11-01T15:00:16Z

How does DeepChem handle tests for model classes? Where should I add a test for GCNModel?

I think https://github.com/deepchem/deepchem/blob/master/deepchem/models/tests/test_gat.py is helpful for you.
Basically, we write the overfit test about regression and classification task using the small data (10~20). This docs is also helpful https://deepchem.readthedocs.io/en/latest/coding.html#testing-machine-learning-models

mufeili · 2020-11-01T19:40:11Z

How does DeepChem handle tests for model classes? Where should I add a test for GCNModel?

I think https://github.com/deepchem/deepchem/blob/master/deepchem/models/tests/test_gat.py is helpful for you.
Basically, we write the overfit test about regression and classification task using the small data (10~20). This docs is also helpful https://deepchem.readthedocs.io/en/latest/coding.html#testing-machine-learning-models

I've added a test file test_gcn.py.

nissy-dev

Thank you for adding the test! I added some comments.
I think this PR will be ready to merge after fixing the errors in CI.

About coding style or type annotation, you could check like this.

$ pip install yapf==0.22.0
$ yapf -i -r file_or_directory_you_modified
$ mypy -p deepchem

And then, your tests doesn't pass in CI. (The current CI errors includes some errors which is not related to this PR) You should check the tests which is related to the codes you modified like this.

$ pytest deepchem/feat/tests/test_graph_data.py
$ pytest deepchem/models/tests/test_gcn.py

And, recently I fixed the CI failure in master branch,
so you merge master branch updates to this branch.
After that, all CI errors depends on your modifications

nissy-dev · 2020-11-02T01:58:43Z

deepchem/feat/graph_data.py

@@ -123,29 +123,36 @@ def to_pyg_graph(self):
        edge_attr=edge_features,
        pos=node_pos_features)

-  def to_dgl_graph(self):
+  def to_dgl_graph(self, self_loop=True):


Can you add type annotation?

And, can you set the default value as False?
This modification leads to many test errors.

nissy-dev · 2020-11-02T04:23:08Z

deepchem/models/torch_models/gcn.py

+
+    >>> import deepchem as dc
+    >>> from deepchem.models import GCNModel
+    >>> featurizer = dc.feat.MolGraphConvFeaturizer()
+    >>> tasks, datasets, transformers = dc.molnet.load_tox21(
+    ...     reload=False, featurizer=featurizer, transformers=[])
+    >>> train, valid, test = datasets
+    >>> model = dc.models.GCNModel(mode='classification', n_tasks=len(tasks),
+    ...                            number_atom_features=30, batch_size=32, learning_rate=0.001)
+    >>> model.fit(train, nb_epoch=50)


Can you change the example like this?
We tests all examples written in docstring. But load_tox21 is slow, so we should skip this example.

>>> >> import deepchem as dc >> from deepchem.models import GCNModel >> featurizer = dc.feat.MolGraphConvFeaturizer() >> tasks, datasets, transformers = dc.molnet.load_tox21( .. reload=False, featurizer=featurizer, transformers=[]) >> train, valid, test = datasets >> model = dc.models.GCNModel(mode='classification', n_tasks=len(tasks), .. number_atom_features=30, batch_size=32, learning_rate=0.001) >> model.fit(train, nb_epoch=50)

This style renders the example correctly in docs and skips the doctest.

Update from Master

mufeili · 2020-11-02T07:04:55Z

@nd-02110114 I should have fixed all the issues and the code has passed both the style check and CI locally.

nissy-dev · 2020-11-02T13:24:21Z

Thanks! But, the CI still failed by the style check.

You need to run these commands.

$ pip install yapf==0.22.0
$ yapf -i -r deepchem/feat
$ yapf -i -r deepchem/models

nissy-dev · 2020-11-02T13:50:27Z

The updates looks good to me!

DGL/DGL-LifeSci are not installed?

Please add DGL-LifeSci dependencies in scripts/install_deepchem_conda.sh(ps1) and docs/requirements.rst.

After adding these, I think this PR is ready to merge 🎉

mufeili · 2020-11-02T13:57:05Z

The updates looks good to me!

DGL/DGL-LifeSci are not installed?

Please add DGL-LifeSci dependencies in scripts/install_deepchem_conda.sh(ps1) and docs/requirements.rst.

After adding these, I think this PR is ready to merge 🎉

Done

coveralls · 2020-11-02T14:45:58Z

Coverage increased (+0.05%) to 80.665% when pulling 70abc0a on mufeili:master into 53c3b55 on deepchem:master.

rbharath

Did a pass and I think this is good to merge as well! I believe @nd-02110114's review is done as well so I'm going to mark this as approved. Feel free to merge when ready!

mufeili · 2020-11-03T01:06:38Z

@rbharath @nd-02110114 Awesome! I don't have write access to merge it and you may go ahead to merge it. Thank you!

rbharath · 2020-11-03T01:16:03Z

@mufeili Thanks for the contribution! This will be very valuable to our users :). Looking forward to your next contributions!

mufeili added 3 commits October 26, 2020 22:38

Update

ae4f33d

Update

ab1199e

Update

9a02dea

rbharath reviewed Oct 26, 2020

View reviewed changes

deepchem/models/torch_models/gcn.py Show resolved Hide resolved

Update

818d7c0

nissy-dev reviewed Oct 29, 2020

View reviewed changes

rbharath reviewed Oct 30, 2020

View reviewed changes

nissy-dev mentioned this pull request Oct 30, 2020

Add more iterations for GAT test #2240

Closed

mufeili added 6 commits November 1, 2020 18:41

Update

d7dd555

Update

e442398

Update

9c9be08

Update

ad76d34

Update

cebd44c

Update

f1d0d06

mufeili added 2 commits November 2, 2020 03:16

Update

d8bc9b8

Update

67cc762

nissy-dev reviewed Nov 2, 2020

View reviewed changes

mufeili added 3 commits November 2, 2020 14:32

Merge pull request #1 from deepchem/master

2582b55

Update from Master

Update

24d39e0

Update

1d8c877

mufeili added 3 commits November 2, 2020 14:58

udpate

22e5c75

Update

9d3fca5

Update

1532918

Fix

0b7c0c6

Update

15eaed0

Update

70abc0a

rbharath approved these changes Nov 2, 2020

View reviewed changes

rbharath merged commit a1385f3 into deepchem:master Nov 3, 2020

mufeili mentioned this pull request Nov 4, 2020

Model Wrapper for GATPredictor and AttentiveFPPredictor from DGL-LifeSci #2280

Merged

Model Wrapper for GCNPredictor from DGL-LifeSci #2249

Model Wrapper for GCNPredictor from DGL-LifeSci #2249

Conversation

mufeili commented Oct 26, 2020 • edited Loading

peastman commented Oct 26, 2020

rbharath left a comment

Choose a reason for hiding this comment

nissy-dev commented Oct 27, 2020 • edited Loading

peastman commented Oct 27, 2020

mufeili commented Oct 28, 2020

mufeili commented Oct 28, 2020

nissy-dev left a comment

Choose a reason for hiding this comment

nissy-dev Oct 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nissy-dev Oct 29, 2020 • edited Loading

Choose a reason for hiding this comment

mufeili Nov 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nissy-dev commented Oct 29, 2020

rbharath commented Oct 30, 2020

rbharath left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mufeili commented Nov 1, 2020

nissy-dev commented Nov 1, 2020 • edited Loading

mufeili commented Nov 1, 2020

nissy-dev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nissy-dev Nov 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mufeili commented Nov 2, 2020 • edited Loading

nissy-dev commented Nov 2, 2020

nissy-dev commented Nov 2, 2020 • edited Loading

mufeili commented Nov 2, 2020

coveralls commented Nov 2, 2020 • edited Loading

rbharath left a comment

Choose a reason for hiding this comment

mufeili commented Nov 3, 2020 • edited Loading

rbharath commented Nov 3, 2020

mufeili commented Oct 26, 2020 •

edited

Loading

nissy-dev commented Oct 27, 2020 •

edited

Loading

nissy-dev Oct 28, 2020 •

edited

Loading

nissy-dev Oct 29, 2020 •

edited

Loading

mufeili Nov 1, 2020 •

edited

Loading

nissy-dev commented Nov 1, 2020 •

edited

Loading

nissy-dev Nov 2, 2020 •

edited

Loading

mufeili commented Nov 2, 2020 •

edited

Loading

nissy-dev commented Nov 2, 2020 •

edited

Loading

coveralls commented Nov 2, 2020 •

edited

Loading

mufeili commented Nov 3, 2020 •

edited

Loading