Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic MolGAN model #2426

Merged
merged 22 commits into from Apr 21, 2021
Merged

Basic MolGAN model #2426

merged 22 commits into from Apr 21, 2021

Conversation

MiloszGrabski
Copy link
Contributor

@MiloszGrabski MiloszGrabski commented Mar 6, 2021

BasicMolGAN model

Description

Final stage of implementation of basic version of MolGAN originally created by Nicola De Cao and Thomas Kipf.
This is a generative graph convolution model based on WGAN architecture that enables generation of small molecules.
Smiles are converted into model input by MolGANFeaturizer class (tested on MoleculeNet QM9 dataset).
Currently, training is a bit unstable and sometimes requires a few separate training attempts to generate a valid model.
This is something I hope to fix in the future release.
Jupyter notebook tutorial will be provided in a separate PR.

Type of change

Please check the option that is related to your PR.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • In this case, we recommend to discuss your modification on GitHub issues before creating the PR
  • Documentations (modification for documents)

Checklist

  • My code follows the style guidelines of this project
    • Run yapf -i <modified file> and check no errors (yapf version must be 0.22.0)
    • Run mypy -p deepchem and check no errors
    • Run flake8 <modified file> --count and check no errors
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New unit tests pass locally with my changes
  • I have checked my code and corrected any misspellings

@codecov-io
Copy link

codecov-io commented Mar 6, 2021

Codecov Report

Merging #2426 (4fdd415) into master (ab5fa05) will increase coverage by 0.09%.
The diff coverage is 98.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2426      +/-   ##
==========================================
+ Coverage   85.65%   85.75%   +0.09%     
==========================================
  Files         307      309       +2     
  Lines       27196    27353     +157     
==========================================
+ Hits        23295    23456     +161     
+ Misses       3901     3897       -4     
Impacted Files Coverage Δ
deepchem/models/tests/test_molgan_model.py 98.52% <98.52%> (ø)
deepchem/models/molgan.py 98.86% <98.86%> (ø)
deepchem/models/__init__.py 94.28% <100.00%> (+0.16%) ⬆️
deepchem/models/gan.py 87.36% <0.00%> (+1.05%) ⬆️
...hem/feat/molecule_featurizers/molgan_featurizer.py 84.61% <0.00%> (+4.39%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ab5fa05...4fdd415. Read the comment docs.

@MiloszGrabski
Copy link
Contributor Author

Issue not connected with my code.
===End Flaky Test Report===
=========================== short test summary info ===========================
FAILED deepchem/splits/tests/test_scaffold_splitter.py::TestScaffoldSplitter::test_scaffolds
= 1 failed, 688 passed, 6 skipped, 35 deselected, 695 warnings in 2406.46s (0:40:06) =

Copy link
Member

@rbharath rbharath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first round of review. Looks good overall! I've made a few comments about docs mainly below.

@peastman Could you do a review as well? This model would be the first molecular GAN in the library and we might want to make sure the API is one we like

deepchem/models/molgan.py Outdated Show resolved Hide resolved

Returns
-------
keras.Model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this will render correctly in the docs. Could you try building the docs locally and seeing if this looks good?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean running doctest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi I tried running sphinx, but struggling to get it working.
Should I just remove Returns part?

@peastman
Copy link
Contributor

Is it possible to add a test for correct operation? The current tests just check that certain parameters were recorded correctly. They don't tell you whether the model works. For example, can you train a small model, have it predict some molecules, and verify that they're valid?

@MiloszGrabski
Copy link
Contributor Author

Problem is that it not always trains model that can generate valid molecules.
Not sure how to accommodate that fact.

@peastman
Copy link
Contributor

Here is an example of a GAN test case:

def test_mix_gan():
"""Test a GAN with multiple generators and discriminators."""
gan = ExampleGAN(n_generators=2, n_discriminators=2, learning_rate=0.01)
gan.fit_gan(
generate_data(gan, 1000, 100), generator_steps=0.5, checkpoint_interval=0)
# See if it has done a plausible job of learning the distribution.
means = 10 * np.random.random([1000, 1])
for i in range(2):
values = gan.predict_gan_generator(
conditional_inputs=[means], generator_index=i)
deltas = values - means
assert abs(np.mean(deltas)) < 1.0
assert np.std(deltas) > 1.0

It trains a GAN based on a very simple data distribution, and checks whether the generated data is reasonably close. Can you do something similar for this class? Train it on a very simple dataset, and see whether it has learned the basic properties of the training data?

@MiloszGrabski
Copy link
Contributor Author

I am not sure as it generates molecules and not always learns to do so.
But I have not tried on a very small amount of compounds.
I will check if this might work.

@MiloszGrabski
Copy link
Contributor Author

MiloszGrabski commented Mar 21, 2021

I have run pytest on test_molgan_model and it worked fine. Not sure what is the issue here.
The only issue I see is this: image
But not sure why it is picking it up.

@MiloszGrabski
Copy link
Contributor Author

ok, so there seems to be corner case where there is a conflict between yapf and flake8.
Any idea about workaround?

@peastman
Copy link
Contributor

ok, so there seems to be corner case where there is a conflict between yapf and flake8.

How do you mean? It's just complaining about the indentation of one line.

On the other hand, there seems to be an actual test failure on Windows (but not Linux):

FAILED deepchem/models/tests/test_molgan_model.py::test_molgan_model::test_training
= 1 failed, 690 passed, 6 skipped, 35 deselected, 696 warnings in 2863.04s (0:47:43) =
Exception ignored in: <function Model.__del__ at 0x000001BFDC1B34C8>
Traceback (most recent call last):
  File "D:\a\deepchem\deepchem\deepchem\models\models.py", line 61, in __del__
    shutil.rmtree(self.model_dir)
  File "c:\miniconda3\envs\deepchem\lib\shutil.py", line 516, in rmtree
    return _rmtree_unsafe(path, onerror)
  File "c:\miniconda3\envs\deepchem\lib\shutil.py", line 404, in _rmtree_unsafe
    onerror(os.rmdir, path, sys.exc_info())
  File "c:\miniconda3\envs\deepchem\lib\shutil.py", line 402, in _rmtree_unsafe
    os.rmdir(path)
OSError: [WinError 145] The directory is not empty: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmpkauecabw'

@peastman
Copy link
Contributor

The new test case looks good. Just out of curiosity, could you post a few examples of molecules it generates? I'm just wondering what sorts of molecules it produces after training on that small number of examples. Has it memorized members of the training set? Or is it just outputting strings of 'C's?

@MiloszGrabski
Copy link
Contributor Author

MiloszGrabski commented Mar 25, 2021

Actually, I was surprised to see that it did not recreated single molecule from original set.
Some of them were odd looking some of them were fine.
There was no single case so far that was just C's.
In terms of indentation, this is precisely the case when flask8 wants different indentation than yafp (searched in google and I am not the only one experiencing this issue).
So when I fixed it in flake8, yapf tripped over it. I will try to rewrite code somewhat. Maybe it is worth considering switching test to do flake8 and yapf separately. I do not have much experience with automatic code testing, so not sure if this is a good idea.
I will post some molecules ones have a moment to run jupyter

@MiloszGrabski
Copy link
Contributor Author

MiloszGrabski commented Mar 26, 2021

Not sure what is the issue here, I run pytest on my machine and it passed:
pytest -v -m "not slow" D:\deepchem\deepchem\models\tests\test_molgan_model.py
image

FYI: I run test several times to be sure, all of them passed.

@MiloszGrabski
Copy link
Contributor Author

Example model outputs (sample of 1000 generated each time):

The model was trained in the loop of 10, with each loop creating new model.
The loop breaks when valid molecules are generated, so first successful model is retained.
Luckily enough, for these attempts it generated models at first attempt.
As you can see, it has quite low validity/uniqueness (based on smiles) score.
The authors solved it by implementing a few things which I would like to add in the future release.
One of them, which seems the most straightforward, is guiding via model checking after each training session.
If model performs worse go back to previous checkpoint otherwise use current.
The infrastructure still needs a lot of work to be useful, but the base seems solid enough.

What I also noticed that with more compounds low number of iterations is favourable (8-10), whereas for small number (like in the test set) you need much larger number (1000-5000).
I suppose it makes sense as with large sets you get much more exposure per training loop.
When I plotted losses, generally around 1000 was enough for them to stabilize.
But loss itself cannot be used as measure of model performance, because even with small losses the model
can generate a lot of invalid molecules i.e. I have observed models with worse losses that provided better metrics.

Model 1, 804 valid, 6 unique, first training attempt successful:

image

Model 2, 60 valid, 10 unique, first training attempt successful:

image

Model 3, 13 valid, 7 unique, first training attempt successful

image

Model 4, 17 valid, 3 unique, first training attempt successful

image

@peastman
Copy link
Contributor

Those actually aren't too bad. Each model produces a decent amount of diversity. It does seem to have a tendency to produce the same molecules over and over, but that's not surprising after overfitting to a tiny dataset.

Luckily enough, for these attempts it generated models at first attempt.

If it usually produces valid molecules on the first attempt, maybe we should reduce the number of iterations? Letting it fail nine times before succeeding could cause the test not to notice if a future change makes the code stop working as well.

@rbharath
Copy link
Member

These don't look bad at all! I'd be curious about how these compare to molecules generated by the normalizing flows. CC @ncfrey who might have some insight there

@ncfrey
Copy link
Contributor

ncfrey commented Mar 30, 2021

Looks really cool @MiloszGrabski! The validity and uniqueness scores are pretty low, but that might not matter too much. About how long does it take to generate 1000 samples?

@MiloszGrabski
Copy link
Contributor Author

MiloszGrabski commented Apr 1, 2021

@ncfrey It is almost instantaneous, I would say 1 sec maybe. Never paid it much attention as it was so fast.
That is why in the future version I want to implement guiding system in form of selection which model progress to the next epoch.
This will require sample generation after each epoch/applying metric, which would be prohibitive if not fast enough.
@peastman I think though of what you said and will modify code slightly. Let me know what you think of the change.
I have reduced number of attempts to 6, also to speed up the process I have introduced break. Given we only care if it manages to pass once (at least for now) there is no point of training if it already produced valid model.

@MiloszGrabski
Copy link
Contributor Author

Do not know what is going on here:
Works fine on my machine, any ideas?
image
Unittest also worked just fine.
From error it looks like input mismatch, but no idea why it is happening.
image
I would welcome any suggestion why it might be.

@MiloszGrabski
Copy link
Contributor Author

MiloszGrabski commented Apr 1, 2021

@ncfrey @peastman @rbharath
I think I know why the issue albeit do not understand why it works on my machine but does not on test.
My generator output is = [a,b,c,d]
My discriminator input is = [a,b]
Normally additional inputs are just ignored, but in the test case it seems to behave differently. Any clue?

The worst part is, that I am unable to test changes on my machine. So every time I have to wait 40 minutes in order to know if fix works.

@rbharath
Copy link
Member

rbharath commented Apr 6, 2021

One thought off hand is perhaps it's an issue with different randomness? Perhaps setting the random seed to the same value could help here.

@MiloszGrabski
Copy link
Contributor Author

MiloszGrabski commented Apr 6, 2021

The current errors seems to be connected with gan itself:
Windows test, line 847
@peastman any clue?

# Train the discriminator.
    
        inputs = [self.get_noise_batch(self.batch_size)]
        for input in self.data_input_layers:
>         inputs.append(feed_dict[input.ref()])
E         KeyError: <Reference wrapping <KerasTensor: shape=(None, 9, 9) dtype=float32 (created by layer 'input_4')>>

deepchem\models\gan.py:348: KeyError
============================== warnings summary ===============================
c:\miniconda3\envs\deepchem\lib\site-packages\tensorflow\python\autograph\impl\api.py:22
  c:\miniconda3\envs\deepchem\lib\site-packages\tensorflow\python\autograph\impl\api.py:22: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    import imp

OK, I get it. it is because there is no inputs 3 and 4 in real data. The whole concept seems to crumble just because I have to add the "fake" inputs to discriminator to sort out the initial issue.
The problem is that generator outputs differ form normal training and sample generation. Therefore, generator has 4 outputs First 2 for training, last 2 for generation of real samples. Because of the initial issue (which does not occur on my machine but only in test) I have added fake (unused) inputs into discriminator, but it means now that I also have to create fake data in dataset. It would be best to be able to specify in generator which sort of output to generate (training or sample generation). It would simply everything, any clue on how to do that. Given simple "if" does not work here.
image

I have reverted back to the old version which is giving input mismatch.
There two options, either I rewrite the custom training and sample generation loops from WGAN or we can change this possibly
if len(inputs) != len(input_spec) to if len(inputs) < len(input_spec). The latter would be simpler and still should work as intended. Given that excessive inputs will be ignored.
This is current process:
image
I do not see how I am able to modify either generator or discriminator to recognize if this is training/sample generation version of outputs.

@MiloszGrabski
Copy link
Contributor Author

MiloszGrabski commented Apr 7, 2021

I was thinking about creating simple layer that will decide which type of output to provide.
But I am struggling with finding a good method to pass signal into layers.
Something similar to training, any ideas?
Is there a simpler way than subclassing keras.Model and modifying call method to accept additional option parameter?

@peastman
Copy link
Contributor

peastman commented Apr 7, 2021

I don't think that design is going to work very well. As the documentation for create_generator() says,

by any conditional inputs. The number and shapes of its outputs must match
the return value from get_data_input_shapes(), since generated data must
have the same form as training data.

If you violate that requirement, you're going to create all sorts of problems for yourself.

Can't you do everything you need just by overriding predict_gan_generator()? Given the softmax outputs, it looks like it ought to be easy to calculate the argmax ones, and the only time you need them is when you're predicting molecules.

@MiloszGrabski
Copy link
Contributor Author

MiloszGrabski commented Apr 8, 2021

@peastman I was looking for a way but do not see how.
The problem is that real data goes through one-hot encoding, fake data logits goes through softmax for discriminator or gumbel then argmax for sample generation.
I cannot simply return logits as then discriminator would have to handle conversion and differentiation between real/fake data.
The only option would to reverse softmax in order to get logits in predict_gan_generator(). Do you know any built-in method to do so?

I have decided to go with subclassing and modifying the call method as it makes everything clean and easy to understand.
It works quite well and should eliminate the issue.
I am cleaning the code now.
These are samples after training (used on same set as before).
image

While predicit sample I get this message first time:
"WARNING:tensorflow:Layer SimpleMolGANGenerator is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2. The layer has dtype float32 because its dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call tf.keras.backend.set_floatx('float64'). To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor."
This comes from numpy random noise; what is standard in deepchem, 64 or 32?

@MiloszGrabski
Copy link
Contributor Author

MiloszGrabski commented Apr 8, 2021

@rbharath @peastman
It seems the issues are not connected with my PR
FAILED deepchem/feat/tests/test_mol2vec_fingerprint.py::TestMol2VecFingerprint::test_mol2vec_fingerprint
FAILED deepchem/models/tests/test_mpnn.py::test_mpnn_classification - assert ...

Copy link
Member

@rbharath rbharath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies for the slow review turnaround! I think this is good to merge on my end

Once @peastman is done with his review, we can go ahead and merge this in

@codecov-commenter
Copy link

codecov-commenter commented Apr 19, 2021

Codecov Report

Merging #2426 (4fdd415) into master (ab5fa05) will increase coverage by 0.09%.
The diff coverage is 98.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2426      +/-   ##
==========================================
+ Coverage   85.65%   85.75%   +0.09%     
==========================================
  Files         307      309       +2     
  Lines       27196    27353     +157     
==========================================
+ Hits        23295    23456     +161     
+ Misses       3901     3897       -4     
Impacted Files Coverage Δ
deepchem/models/tests/test_molgan_model.py 98.52% <98.52%> (ø)
deepchem/models/molgan.py 98.86% <98.86%> (ø)
deepchem/models/__init__.py 94.28% <100.00%> (+0.16%) ⬆️
deepchem/models/gan.py 87.36% <0.00%> (+1.05%) ⬆️
...hem/feat/molecule_featurizers/molgan_featurizer.py 84.61% <0.00%> (+4.39%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ab5fa05...4fdd415. Read the comment docs.

@rbharath
Copy link
Member

Wanted to check in on this PR. I'll plan to go ahead and merge in tomorrow in case there's no further comments

@MiloszGrabski
Copy link
Contributor Author

Once you merge, I will do a few small tweaks and create tutorial in jupyter lab

@rbharath rbharath merged commit 2b136d2 into deepchem:master Apr 21, 2021
@rbharath
Copy link
Member

@MiloszGrabski congrats on the merge! Major new feature added to DeepChem :). I'll be sure to give this a shout-out in the DeepChem 2.6 release notes (tentative for Mid May)

@atreyamaj atreyamaj mentioned this pull request Apr 21, 2021
@PARODBE
Copy link

PARODBE commented Apr 21, 2021

@MiloszGrabski thank you for this library, could you create a notebook for all the options: molgan.py and test_molgan

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants