New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic MolGAN model #2426
Basic MolGAN model #2426
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2426 +/- ##
==========================================
+ Coverage 85.65% 85.75% +0.09%
==========================================
Files 307 309 +2
Lines 27196 27353 +157
==========================================
+ Hits 23295 23456 +161
+ Misses 3901 3897 -4
Continue to review full report at Codecov.
|
Issue not connected with my code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first round of review. Looks good overall! I've made a few comments about docs mainly below.
@peastman Could you do a review as well? This model would be the first molecular GAN in the library and we might want to make sure the API is one we like
deepchem/models/molgan.py
Outdated
|
||
Returns | ||
------- | ||
keras.Model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this will render correctly in the docs. Could you try building the docs locally and seeing if this looks good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean running doctest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I tried running sphinx, but struggling to get it working.
Should I just remove Returns part?
Is it possible to add a test for correct operation? The current tests just check that certain parameters were recorded correctly. They don't tell you whether the model works. For example, can you train a small model, have it predict some molecules, and verify that they're valid? |
Problem is that it not always trains model that can generate valid molecules. |
Here is an example of a GAN test case: deepchem/deepchem/models/tests/test_gan.py Lines 133 to 148 in 7ed0125
It trains a GAN based on a very simple data distribution, and checks whether the generated data is reasonably close. Can you do something similar for this class? Train it on a very simple dataset, and see whether it has learned the basic properties of the training data? |
I am not sure as it generates molecules and not always learns to do so. |
ok, so there seems to be corner case where there is a conflict between yapf and flake8. |
How do you mean? It's just complaining about the indentation of one line. On the other hand, there seems to be an actual test failure on Windows (but not Linux):
|
The new test case looks good. Just out of curiosity, could you post a few examples of molecules it generates? I'm just wondering what sorts of molecules it produces after training on that small number of examples. Has it memorized members of the training set? Or is it just outputting strings of 'C's? |
Actually, I was surprised to see that it did not recreated single molecule from original set. |
Example model outputs (sample of 1000 generated each time):The model was trained in the loop of 10, with each loop creating new model. What I also noticed that with more compounds low number of iterations is favourable (8-10), whereas for small number (like in the test set) you need much larger number (1000-5000). Model 1, 804 valid, 6 unique, first training attempt successful:Model 2, 60 valid, 10 unique, first training attempt successful:Model 3, 13 valid, 7 unique, first training attempt successfulModel 4, 17 valid, 3 unique, first training attempt successful |
Those actually aren't too bad. Each model produces a decent amount of diversity. It does seem to have a tendency to produce the same molecules over and over, but that's not surprising after overfitting to a tiny dataset.
If it usually produces valid molecules on the first attempt, maybe we should reduce the number of iterations? Letting it fail nine times before succeeding could cause the test not to notice if a future change makes the code stop working as well. |
These don't look bad at all! I'd be curious about how these compare to molecules generated by the normalizing flows. CC @ncfrey who might have some insight there |
Looks really cool @MiloszGrabski! The validity and uniqueness scores are pretty low, but that might not matter too much. About how long does it take to generate 1000 samples? |
@ncfrey It is almost instantaneous, I would say 1 sec maybe. Never paid it much attention as it was so fast. |
@ncfrey @peastman @rbharath The worst part is, that I am unable to test changes on my machine. So every time I have to wait 40 minutes in order to know if fix works. |
One thought off hand is perhaps it's an issue with different randomness? Perhaps setting the random seed to the same value could help here. |
The current errors seems to be connected with gan itself:
OK, I get it. it is because there is no inputs 3 and 4 in real data. The whole concept seems to crumble just because I have to add the "fake" inputs to discriminator to sort out the initial issue. I have reverted back to the old version which is giving input mismatch. |
I was thinking about creating simple layer that will decide which type of output to provide. |
I don't think that design is going to work very well. As the documentation for deepchem/deepchem/models/gan.py Lines 242 to 244 in ab5fa05
If you violate that requirement, you're going to create all sorts of problems for yourself. Can't you do everything you need just by overriding |
@peastman I was looking for a way but do not see how. I have decided to go with subclassing and modifying the call method as it makes everything clean and easy to understand. While predicit sample I get this message first time: If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2. To change all layers to have dtype float64 by default, call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies for the slow review turnaround! I think this is good to merge on my end
Once @peastman is done with his review, we can go ahead and merge this in
Codecov Report
@@ Coverage Diff @@
## master #2426 +/- ##
==========================================
+ Coverage 85.65% 85.75% +0.09%
==========================================
Files 307 309 +2
Lines 27196 27353 +157
==========================================
+ Hits 23295 23456 +161
+ Misses 3901 3897 -4
Continue to review full report at Codecov.
|
Wanted to check in on this PR. I'll plan to go ahead and merge in tomorrow in case there's no further comments |
Once you merge, I will do a few small tweaks and create tutorial in jupyter lab |
@MiloszGrabski congrats on the merge! Major new feature added to DeepChem :). I'll be sure to give this a shout-out in the DeepChem 2.6 release notes (tentative for Mid May) |
@MiloszGrabski thank you for this library, could you create a notebook for all the options: molgan.py and test_molgan Thank you! |
BasicMolGAN model
Description
Final stage of implementation of basic version of MolGAN originally created by Nicola De Cao and Thomas Kipf.
This is a generative graph convolution model based on WGAN architecture that enables generation of small molecules.
Smiles are converted into model input by MolGANFeaturizer class (tested on MoleculeNet QM9 dataset).
Currently, training is a bit unstable and sometimes requires a few separate training attempts to generate a valid model.
This is something I hope to fix in the future release.
Jupyter notebook tutorial will be provided in a separate PR.
Type of change
Please check the option that is related to your PR.
Checklist
yapf -i <modified file>
and check no errors (yapf version must be 0.22.0)mypy -p deepchem
and check no errorsflake8 <modified file> --count
and check no errors