Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maygen #811

Merged
merged 25 commits into from Feb 6, 2022
Merged

Maygen #811

merged 25 commits into from Feb 6, 2022

Conversation

johnmay
Copy link
Member

@johnmay johnmay commented Feb 1, 2022

Rebased - maygen PR.

@@ -2594,21 +2590,8 @@ public void structureGenerator(String localFormula) {
List<int[]> newDegrees = distributeHydrogens();

if (multiThread) {
try {
new ForkJoinPool(size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size is important, it can't be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain, I asked @MehmetAzizYirik. My understanding this code is not doing anything since the parellel stream will not use the forjoin pool

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay see here it really does work that way but is quite opaque/brittle - https://stackoverflow.com/questions/30802463/how-many-threads-are-spawned-in-parallelstream-in-java-8. "It should be noted that the Stream APIs use of ForkJoinPool is an implementation detail. Thus, both solutions work with Oracle’s current implementation but are not guaranteed to work everywhere."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why size and not -

ForkJoinPool pool = new ForkJoinPool(Runtime.getRuntime().availableProcessors());

Copy link
Member Author

@johnmay johnmay Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that is the default anyways for ForkJoinPool in ParallelStream which is cleaner IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnmay we had that conversation with Valentyn about the size parameter, but I believe this needs to be changed like you explained we can use that other option you recommended.

Also, this parallelization actually needs to be enhanced but further tasks not for now. Thus, we dont need to keep it with the ForkJoinPool but can change it later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain, I asked @MehmetAzizYirik. My understanding this code is not doing anything since the parellel stream will not use the forjoin pool

It is related for the bug MehmetAzizYirik/MAYGEN#48

In some configuration default value is 79. We need to control it not use default.

import java.util.logging.Level;
import java.util.logging.Logger;

public class MaygenCLI {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class may be removed. It is useless in the test folder.


@Test
public void detectAllenes() {
assertFalse(BoundaryConditions.detectAllenes(new int[][] {}, new String[] {}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests may be improved.

<dependency>
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
<version>1.5.0</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this dependency.


@Test
public void testExpectedCountMultithreaded() throws CDKException, IOException, CloneNotSupportedException {
Maygen maygen = new Maygen(SilentChemObjectBuilder.getInstance());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to define SilentChemObjectBuilder.getInstance() in constructor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the structgen module is independent of implementation. Same reason it is provided to SmilesParser. Silent is faster than default which sends out notifications

Copy link
Contributor

@javadev javadev Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be set parameter for class and default is SilentChemObjectBuilder.getInstance(). It is easier to create object with constructor without arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module does not and should not depend on data/silent. It’s just the way CDK is - I would not write it this way if we started again but “this is the way”

* this is only used for formula validating and normalizing a parser would
* be better. */
private static String replaceEach(String str, String[] from, String[] to) {
if (from.length != to.length)
Copy link
Contributor

@javadev javadev Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be better to use other implementation with better error handling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to not use this function at all. Using string munging to validate a formula is not nice. Adding in a third party library for a simple function - no

@johnmay
Copy link
Member Author

johnmay commented Feb 4, 2022

@MehmetAzizYirik Why is the D2 graph not supported in SMILES?

        if (writeSDF || printSDF) {
            IAtomContainer ac = buildContainer4SDF(mat);
            if (coordinates) new StructureDiagramGenerator().generateCoordinates(ac);
            sdfOut.write(ac);
        }
        if (writeSMILES || printSMILES) {
            smilesOut.write("Formula is not supported" + "\n");
        }

@javadev
Copy link
Contributor

javadev commented Feb 4, 2022

@MehmetAzizYirik Why is the D2 graph not supported in SMILES?

        if (writeSDF || printSDF) {
            IAtomContainer ac = buildContainer4SDF(mat);
            if (coordinates) new StructureDiagramGenerator().generateCoordinates(ac);
            sdfOut.write(ac);
        }
        if (writeSMILES || printSMILES) {
            smilesOut.write("Formula is not supported" + "\n");
        }

It caused an error as I remember.

@johnmay
Copy link
Member Author

johnmay commented Feb 4, 2022

I am reading this correct that the way the fuzzy formulae are handled is by enumerating non-fuzzy formula? Hence why there seems to be some vestigial stats displayed to TSV.

@MehmetAzizYirik
Copy link
Contributor

@MehmetAzizYirik Why is the D2 graph not supported in SMILES?

        if (writeSDF || printSDF) {
            IAtomContainer ac = buildContainer4SDF(mat);
            if (coordinates) new StructureDiagramGenerator().generateCoordinates(ac);
            sdfOut.write(ac);
        }
        if (writeSMILES || printSMILES) {
            smilesOut.write("Formula is not supported" + "\n");
        }

@johnmay As far as I remember, one of our colleagues recommended not to print SMILES for D2 graphs but I do not remember now the chemical reason she explained about the SMILES of D2 graphs. I am not sure whether there was an error or not.

@MehmetAzizYirik
Copy link
Contributor

I am reading this correct that the way the fuzzy formulae are handled is by enumerating non-fuzzy formula? Hence why there seems to be some vestigial stats displayed to TSV.

@johnmay Yes, we needed the fuzzy formulae for another project, then I coded it like generating al possible standard formulae for given fuzzy formula. Can you please explain the thing about the TSV ? I did not get it.

@johnmay
Copy link
Member Author

johnmay commented Feb 4, 2022

Just in the process of refactoring the IO - to me it looks like the TSVOUTPUT mode was probably just for the manuscript. I've left it as it is for now.

Is a D2 graph just one with all atoms degree 2 right? So in SMILES

C1CC1
C1CCC1
C1CCCC1
C1CCCCC1
C1CCCCCC1

but of course you could have other elements.

@MehmetAzizYirik
Copy link
Contributor

Just in the process of refactoring the IO - to me it looks like the TSVOUTPUT mode was probably just for the manuscript. I've left it as it is for now.

Is a D2 graph just one with all atoms degree 2 right? So in SMILES

C1CC1 C1CCC1 C1CCCC1 C1CCCCC1 C1CCCCCC1

but of course you could have other elements.

@johnmay Just to be sure, I went through the code. Yes the SMILES could be given as string array as what we do in OnSm matrices. But I remember now the comment: "it is just a long list of S or O, and chemists are not interested in these molecules". I left it as how it is when I submitted the paper but in the future later we can change it like what we do with OnSm matrices. Same string array.

Also TSV, I remember for benchmarking while writing the paper, Chris recommended it would be better to have TSV option and he added it. That is all about TSV.

@johnmay
Copy link
Member Author

johnmay commented Feb 4, 2022

Just in the process of refactoring the IO - to me it looks like the TSVOUTPUT mode was probably just for the manuscript. I've left it as it is for now.
Is a D2 graph just one with all atoms degree 2 right? So in SMILES
C1CC1 C1CCC1 C1CCCC1 C1CCCCC1 C1CCCCCC1
but of course you could have other elements.

@johnmay Just to be sure, I went through the code. Yes the SMILES could be given as string array as what we do in OnSm matrices. But I remember now the comment: "it is just a long list of S or O, and chemists are not interested in these molecules". I left it as how it is when I submitted the paper but in the future later we can change it like what we do with OnSm matrices. Same string array.

Also TSV, I remember for benchmarking while writing the paper, Chris recommended it would be better to have TSV option and he added it. That is all about TSV.

Okay… but then why is it useful in SDF?

@MehmetAzizYirik
Copy link
Contributor

Just in the process of refactoring the IO - to me it looks like the TSVOUTPUT mode was probably just for the manuscript. I've left it as it is for now.
Is a D2 graph just one with all atoms degree 2 right? So in SMILES
C1CC1 C1CCC1 C1CCCC1 C1CCCCC1 C1CCCCCC1
but of course you could have other elements.

@johnmay Just to be sure, I went through the code. Yes the SMILES could be given as string array as what we do in OnSm matrices. But I remember now the comment: "it is just a long list of S or O, and chemists are not interested in these molecules". I left it as how it is when I submitted the paper but in the future later we can change it like what we do with OnSm matrices. Same string array.
Also TSV, I remember for benchmarking while writing the paper, Chris recommended it would be better to have TSV option and he added it. That is all about TSV.

Okay… but then why is it useful in SDF?

@johnmay There is no TSV usage in SDF, I checked the code but I think I did not understood what you meant. Can you please explain ?

@johnmay
Copy link
Member Author

johnmay commented Feb 4, 2022

That was unrelated to the TSV comment sorry. What I'm trying to understand is

if (onlyDegree2) {
  if (oxygen == 0 || sulfur == 0) {
      degree2graph(); // when is this called, why does it only output to SDF and not SMILES?
  } else {
      distributeSulfurOxygen(normalizedLocalFormula);
  }
  displayStatistic(startTime, normalizedLocalFormula);
}

I presume this is only for custom valences since those are the only valence=2 default elements (well S should really be 2,4,6 but anyways). So my thoughts are something like this:
``
-f C(val=2)5 -smi -setElements


but that doesn't give me anything for neither SDF or SMILES.

@johnmay
Copy link
Member Author

johnmay commented Feb 5, 2022

Remaining code smells would need a lot of work (i.e. wrapping up methods with larger number of params in a state object). One I can't work out is where the duplicated block is: https://sonarcloud.io/project/issues?id=cdk_cdk&pullRequest=811&resolved=false&rules=common-java%3ADuplicatedBlocks&types=CODE_SMELL

@javadev
Copy link
Contributor

javadev commented Feb 5, 2022

It looks good except CLI class in test package and commons-cli dependency. They may be removed.

@johnmay
Copy link
Member Author

johnmay commented Feb 5, 2022

That’s fine for now - intention is these will go back into the maygen project repo (preferred) or we can add an assembly descriptor. TBD

@MehmetAzizYirik
Copy link
Contributor

That’s fine for now - intention is these will go back into the maygen project repo (preferred) or we can add an assembly descriptor. TBD

@johnmay we can just do whatever you prefer like you explained above.

I am working on the test changes now and will do the pull request to the maygen branch.

And about parallelization, yes it needs changes like we talked but for now it can stay as how it is, I guess.

@MehmetAzizYirik
Copy link
Contributor

Okay @egonw this is now more or less good to merge, I am going to have a look through SonarCloud and maybe fix any obvious smells - logging for example.

Not having to specify fuzzy formula and generally improved parsing and more decoupling would be good (main class is still 4000+ lines) would be good but I think it's time to get this merged. Oh also to fix the issue with ForkJoin I would propose making the multithreaded option an integer - e.g. on the command line "-j 8" or "-t 8".

@MehmetAzizYirik I can add a maven assembly descriptor to the POM so we can build the command line but I think a better option is once this is in a release you can rebase the MAYGEN repo to just contain the command line parts?

@johnmay yes john, as you recommended, I will do the changes after the release.

@javadev
Copy link
Contributor

javadev commented Feb 5, 2022

These changes ruins our logic btw.

We need to support smiles and sdf in the same time, not just sdf or smi.

It will be better to restore old logic.

IMG_20220205_230056

@MehmetAzizYirik
Copy link
Contributor

These changes ruins our logic btw.

We need to support smiles and sdf in the same time, not just sdf or smi.

It will be better to restore old logic.

IMG_20220205_230056

@javadev if @johnmay thinks it is better not to return the both at the same time, then we keep it as how he changed the code.

@MehmetAzizYirik
Copy link
Contributor

..... final thing to do is fix the tests writing data to disk.

@johnmay I checked now the structgen package and run the mvn test but no data was written to disk I checked. Did I misunderstand what you meant ?

@johnmay
Copy link
Member Author

johnmay commented Feb 5, 2022

@MehmetAzizYirik no I did it all in the end. If you need SDF and SMILES you just write a new consumer that does both. I didn’t do this because on the command line tool it looked like smiles output would only activate if sdf wasn’t specified.

i’m very worn down with this PR now and still not 100% happy but it’s good enough now.

@javadev
Copy link
Contributor

javadev commented Feb 5, 2022

the command line tool it looked like smiles output would only activate if sdf wasn’t specified.

Support both formats at the same time is important.

@johnmay
Copy link
Member Author

johnmay commented Feb 5, 2022

You can literally convert one in to the other so no it is really NOT important and a stupid design. But yea it’s still possible to do as I explained.

@johnmay
Copy link
Member Author

johnmay commented Feb 5, 2022

I’m turning off notifications on this now - @egonw will take over.

@MehmetAzizYirik
Copy link
Contributor

I’m turning off notifications on this now - @egonw will take over.

Thanks a lot for this detailed and helpful PR.

@javadev
Copy link
Contributor

javadev commented Feb 5, 2022

You can literally convert one in to the other

Sdf also supports coordinates.

I think the console application will behave unexpectedly in case you will use both keys sdf and smiles.

@MehmetAzizYirik
Copy link
Contributor

You can literally convert one in to the other

Sdf also supports coordinates.

I think the console application will behave unexpectedly in case you will use both keys sdf and smiles.

@javadev as I wrote above, if John changed it, then done.

I did all my changes before the integration pull request, and after that, I left the code to CDK developers. It is a great pleasure for me if they do any changes on it and no need of any other changes after them.

@javadev
Copy link
Contributor

javadev commented Feb 6, 2022

I added support for the sdf and smi generator #818

@MehmetAzizYirik
Copy link
Contributor

MehmetAzizYirik commented Feb 6, 2022

I added support for the sdf and smi generator #818

@javadev I as the main developer of MAYGEN , I dont want any other changes from someone else while CDK developers are working on it. And I prefer to keep it as how John changed! As it was your task in MAYGEN to clean code or format, if there is a need, we will let you know but besides that please do not resist to the changes of CDK developers, especially you are not in the position to criticize cheminformatics related work. I left the code to them after my changes and it is a pleasure if they enhance the code.

Added support for the smiles and sdf generator.
@javadev
Copy link
Contributor

javadev commented Feb 6, 2022

Hi @MehmetAzizYirik don't worry. These changes were mentioned by @johnmay in comments.

Changes have been approved btw.

@sonarcloud
Copy link

sonarcloud bot commented Feb 6, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 22 Code Smells

0.0% 0.0% Coverage
0.5% 0.5% Duplication

@egonw egonw merged commit f9ecd05 into master Feb 6, 2022
@egonw
Copy link
Member

egonw commented Feb 6, 2022

All, thank you very much for all your hard work, peer review, and positiveness. Very much appreciated!

@MehmetAzizYirik
Copy link
Contributor

All, thank you very much for all your hard work, peer review, and positiveness. Very much appreciated!

@egonw Thanks to John and you for the guidance, this PR and all the changes you did. So happy that MAYGEN is integrated to CDK as planned in my PhD.

@johnmay johnmay deleted the maygen branch February 9, 2022 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants