Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model Zoo] Molecule Regression #779

Merged
merged 20 commits into from Aug 26, 2019

Conversation

@geekinglcq
Copy link
Contributor

commented Aug 20, 2019

Description

An implementation of Message Passing Neural Network and a sample training code for Alchemy dataset.

Reference:
Gilmer, Justin, et al. "Neural message passing for quantum chemistry." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017. link

Thanks @MilkshakeForReal for providing the draft-version implementation.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR

Changes

  • Model MPNN
@yzh119

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

Thanks for your contribution! I'll review this PR recently.
It seems it takes longer to train a MPNN than SchNet(240s/epoch vs 80s/epoch), and the performance of MPNN is also not as good as SchNet(0.092 vs 0.065). May I ask you the performance of your PyG implementation of MPNN? I just want to verify if our pooling layer is correct and efficient.

@mufeili mufeili self-requested a review Aug 20, 2019

@mufeili

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

Hi @geekinglcq , thanks a lot for your contributions!

We are trying to put all chemistry related models in a dgl model zoo for chemistry. This reorganization has two advantages:

  1. Users who are only interested in chemistry related applications can concentrate their attention.
  2. Many utilities and training scripts can be shared. For example, probably MPNN, MGCN and SchNet can use one single training script.

If you are fine with this reorganization, I would suggest a reorganization of your code as follows:

  1. Put your dataset under https://github.com/dmlc/dgl/tree/master/python/dgl/data/chem with a file called alchemy.py.
  2. Put your mpnn.py under https://github.com/dmlc/dgl/tree/master/python/dgl/model_zoo/chem. You may want to add a short description of MPNN in the README file.
  3. Add a regression.py under https://github.com/dmlc/dgl/tree/master/examples/pytorch/model_zoo/chem/property_prediction, which is basically just your train.py. Update README accordingly.

Also, would you mind refactoring MGCN and SchNet as well?

@geekinglcq

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2019

Hi, @mufeili . Reorganization is a good idea. Maybe we can directly use the pre-processed Alchemy dataset with dgl format so that the user could avoid installing rdkit and save the time.

@yzh119 , the numerical experimental results of pyg version is about MAE=0.05, 240s/epoch. It is wired that the MAE is much lower than this version.

@yzh119

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

@geekinglcq thank you for letting me know, I'll check the implementation.

@mufeili

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

@geekinglcq I'm ok with the additional dependency of RDKit since this is unavoidable for generative models and users may want to know how exactly the featurization is performed. Meanwhile, I'd suggest save the DGLGraphs with pickle after the first-time construction.

@jermainewang

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

Comments about data loading:

  • Recommend download and unzip the data to a directory outside the source tree. Too many data files in source tree will lag many editors that search the project directory. Check out codes in dgl.data module.
  • The data loads quite slowly. I ran the script on a p3 machine. It runs for 5+mins without showing a thing. Is it possible to include a pre-processed data? @VoVAllen this is a good example for our data format.
@geekinglcq

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

Hi, @jermainewang . To solve the data loading problem, I come up with two ideas:

  1. We could provide a single big sdf file instead of 10000+ small files.
  2. We provide a pre-processed data file. But @mufeili suggests to keep the featurization part and save the pre-processed results after first-time running.

I am not sure which way is more suitable.

@mufeili

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

@geekinglcq @jermainewang How about:

  1. Download both the raw data and preprocessed DGLGraphs when running the training script
  2. Use preprocessed DGLGraphs in the training script for fastser data loading
  3. Keep the code for DGLGraph construction and leave a note in README for readers who want to know the data preprocessing steps
@mufeili mufeili referenced this pull request Aug 22, 2019
30 of 30 tasks complete
@mufeili

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

@yzh119 Have you checked the differences between PyG and DGL implementations? If not, I will do it today.

@jermainewang

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

+1 with @mufeili . I also like @geekinglcq 's idea to have each file stores many SDFs as long as the preprocessing is not too tedious. I think we should do both.

@yzh119

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

@mufeili I'm working on it.

@geekinglcq

This comment has been minimized.

Copy link
Contributor Author

commented Aug 23, 2019

Some information about refactoring dataloader:

Method Time cost
construct from raw (multiple) file 28 min
construct from raw (single) file 22 min
load pre-processed file 1min
File File size
Single sdf 179M
Pre-processed pickle file 2.2G

Test in Intel(R) Xeon(R) Gold 61xx CPU 2.5GHz single core

So, in current plan, we provide two options to users:

  1. Load pre-processed file (default).
  2. Construct from raw file (single sdf).
    Am i right? @mufeili @jermainewang
@mufeili

This comment has been minimized.

Copy link
Member

commented Aug 23, 2019

Some information about refactoring dataloader:

Method Time cost
construct from raw (multiple) file 28 min
construct from raw (single) file 22 min
load pre-processed file 1min
File File size
Single sdf 179M
Pre-processed pickle file 2.2G
Test in Intel(R) Xeon(R) Gold 61xx CPU 2.5GHz single core

So, in current plan, we provide two options to users:

  1. Load pre-processed file (default).
  2. Construct from raw file (single sdf).
    Am i right? @mufeili @jermainewang

@geekinglcq I'm fine with this proposal. You can simply add an option in the dataset class.

Also do you have any comments or alternative ideas on my previous proposal of refactoring for model zoo?

@yzh119

This comment has been minimized.

Copy link
Member

commented Aug 23, 2019

There are three major differences between this implementation and pyg version:

  1. Adam 1e-3 vs 1e-4
  2. Set2Set layers 1 vs 3
  3. Aggregator type sum vs mean.

I'll provide fn.mean reducer soon.

geekinglcq and others added 3 commits Aug 23, 2019

- **SchNet**: SchNet is a novel deep learning architecture modeling quantum interactions in molecules which utilize
the continuous-filter convolutional layers [3].
- **Multilevel Graph Convolutional neural Network**: Multilevel Graph Convolutional neural Network (MGCN) is a well-designed

This comment has been minimized.

Copy link
@mufeili

mufeili Aug 23, 2019

Member

The first "n" in "neural" should be capitalized.

@mufeili

This comment has been minimized.

Copy link
Member

commented Aug 24, 2019

@geekinglcq Very nice!

In addition to the minor comments made, several issues need to be cleared:

  1. Do we still need to keep dgl/examples/pytorch/mpnn and dgl/examples/pytorch/schnet? Since they have exactly the same content as regression.py, probably we can just drop them?
  2. I assume different models need different hyperparameters to achieve the best performance, but it seems that the refactored file is now using the same suit of hyperparameters. Have you tested the model performance with regression.py?
  3. We want to provide both training scripts and pre-trained models to users for reproducible results. Do you have pre-trained models saved?
  4. One common pain for writing dgl code is that since we support multiple frameworks, we need to make our code framework agnostic. For example, we do not use PyTorch inside the data files. I can take that part once the rest is completed.

@mufeili mufeili changed the title [Model] Molecule Regression [Model Zoo] Molecule Regression Aug 24, 2019

@geekinglcq

This comment has been minimized.

Copy link
Contributor Author

commented Aug 24, 2019

@geekinglcq Very nice!

In addition to the minor comments made, several issues need to be cleared:

  1. Do we still need to keep dgl/examples/pytorch/mpnn and dgl/examples/pytorch/schnet? Since they have exactly the same content as regression.py, probably we can just drop them?
  2. I assume different models need different hyperparameters to achieve the best performance, but it seems that the refactored file is now using the same suit of hyperparameters. Have you tested the model performance with regression.py?
  3. We want to provide both training scripts and pre-trained models to users for reproducible results. Do you have pre-trained models saved?
  4. One common pain for writing dgl code is that since we support multiple frameworks, we need to make our code framework agnostic. For example, we do not use PyTorch inside the data files. I can take that part once the rest is completed.
  1. Yes, we can simply drop them.
  2. The default setting in each model is tested to be one of the best settings in our dataset. However, I am not sure the best hyperparameter won't change when use other dataset. The shared hyperparamerter (e.g. learning rate) have not been test yet.
  3. Currently we don't have pre-trained model of alchemy dataset.
  4. I am fine with that, thank you~
@jermainewang

This comment has been minimized.

Copy link
Member

commented Aug 24, 2019

Pickled file is generally not a good data format due to (1) its large size (2) compatibility issue. For this PR, I'm fine with pickled files but it should be replaced by our data format in the future.

@mufeili

This comment has been minimized.

Copy link
Member

commented Aug 24, 2019

@jermainewang Got it.

@mufeili

This comment has been minimized.

Copy link
Member

commented Aug 24, 2019

@geekinglcq Could you please also add code for evaluation on the test set? I assume the test set has been available now?

@MilkshakeForReal

This comment has been minimized.

Copy link
Contributor

commented Aug 25, 2019

I'd like to thank you all for this important work. One thing I need to mention is that the reset_parameters function would only work for mlp, while the edge_net technically could be anything which can map the number of edge arributes to input channels times output channels.

  1. Make the edge_net output dimension clear (in the docs) and raise an error when it is wrong.
  2. Extend the scope of application of the reset_parameters function (at least for convolutional network)
@yzh119

This comment has been minimized.

Copy link
Member

commented Aug 25, 2019

@MilkshakeForReal I think the initializer of edge_nn should be outside of NNConv module. I know it's the design of PyG, but I don't agree that dgl nn modules should take care of the initialization of "outside" modules like edge_nn.

We will incorporate NNConv in dgl.*.nn.conv, I think users have right to write their own initializers for these components, and they might be overwritten by reset_parameters, which is harmful.

@MilkshakeForReal

This comment has been minimized.

Copy link
Contributor

commented Aug 25, 2019

@MilkshakeForReal I think the initializer of edge_nn should be outside of NNConv module. I know it's the design of PyG, but I don't agree that dgl nn modules should take care of the initialization of "outside" modules like edge_nn.

We will incorporate NNConv in dgl.*.nn.conv, I think users have right to write their own initializers for these components, and they might be overwritten by reset_parameters, which is harmful.

Highly agreed

mufeili and others added 6 commits Aug 25, 2019
@mufeili

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

@geekinglcq @yzh119 FYI, I got a training MAE of about 0.106 with the previous hyperparameters and a training MAE of about 0.056 by changing the learning rate from 1e-3 to 1e-4 and the number of set2set layers form 1 to 3.

@yzh119

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

@yzh119 thanks for letting me know, I think 0.056 is an acceptable result.

@mufeili

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

@geekinglcq Thanks for the brilliant contributions. Let's merge this PR first and I may perform some minor adjustments.

@mufeili mufeili merged commit 3bc7393 into dmlc:master Aug 26, 2019

1 check passed

continuous-integration/jenkins/pr-merge This commit looks good
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.