Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with hose code nesting #828

Merged
merged 14 commits into from Sep 12, 2022
Merged

Fix issue with hose code nesting #828

merged 14 commits into from Sep 12, 2022

Conversation

johnmay
Copy link
Member

@johnmay johnmay commented Feb 8, 2022

Fixes an issue with Hose Code nesting via #588.
We can actually see from another test that the output was wrong since it didn't make sense.

example:

image
image
image

So

C,
C,C,C
CC,CC,CO

but we had

C,
C,C,C
CC,CC,C,O

It's logically inconsistent to have 4 in the second sphere and 3 in the first.

@johnmay
Copy link
Member Author

johnmay commented Feb 8, 2022

I may still rewrite this code as it appears to be doing a lot more than it needs to. In my mind it should just be a simple breath-first order traversal using a comparator but it's going around the houses making temporary objects and comparison keys (maybe this was pre-generics).

@johnmay
Copy link
Member Author

johnmay commented Feb 8, 2022

Doh, there is a QSAR model trained on the broken hose codes :-)

@johnmay
Copy link
Member Author

johnmay commented Feb 8, 2022

Cl-1;C(*C*C/H,H,*C,*C/H,H,*C,*&)H*&////// 753.1 (was)
vs
Cl-1;C(*C*C/H*C,H*C/H*C,H*&)H*&////// 753.1 (now)

Again completely wrong since *C*C/H,H,*C,*C doesn't make sense. Now do I write a hose code validator...

@johnmay
Copy link
Member Author

johnmay commented Feb 8, 2022

@egonw I can go through and mark the incorrect HOSE codes but I think it's impossible to fix completely unless we know how these DB's were generated - for which there are hints of NIST but not much more.

@egonw
Copy link
Member

egonw commented Feb 9, 2022

@steinbeck, can you have a look at this? I know the idea, but never actively used HOSE codes... or know someone who is willing to check this patch?

@johnmay
Copy link
Member Author

johnmay commented Feb 9, 2022

One option to have both options is we can have an option to generate old (incorrect) CDK hose codes and the fixed ones. There is a strong overlap between them so mostly they are compatible. The two QSAR descriptors are most problematic since they use many layers.

Then existing usages can use the old ones if needed. The tricky part here is a lot of the usages are very old - circa 2004 - and probably bit-rotted quite heavily all-ready. In which case I would favour just deprecating and moving things to cdk-legacy.

@sonarcloud
Copy link

sonarcloud bot commented Feb 9, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

@egonw
Copy link
Member

egonw commented Feb 9, 2022

Maybe not a bad idea (both options), but ideally the descriptors are recalculated indeed. We have a cdk.cite javadoc tag which was meant to allow proper documents, but we were all learning, I guess. Which two were these again? Maybe from Miquel?

@johnmay
Copy link
Member Author

johnmay commented Feb 9, 2022

Yep - there isn't much info "Weka(J48)" (a classifier) and "NIST"

@egonw
Copy link
Member

egonw commented Feb 14, 2022

@johnmay, this PR is now conflicting too :/

@johnmay
Copy link
Member Author

johnmay commented Feb 14, 2022

Yes need to rebase - it's fine need some more thought anyways

@egonw
Copy link
Member

egonw commented May 16, 2022

@johnmay, got an update?

@javadev
Copy link
Contributor

javadev commented Jun 30, 2022

This branch has conflicts that must be resolved

@johnmay johnmay marked this pull request as draft June 30, 2022 09:33
@johnmay
Copy link
Member Author

johnmay commented Jun 30, 2022

Marked as draft, these changes would break other parts (models trained on the wrong values) so it is difficult to merge in for now.

@johnmay johnmay changed the base branch from master to main September 12, 2022 07:45
…tor chaining here is a little harder to read so we do it the old-fassioned way. Another test need updating but we can see the expected output is clear wrong. For example "C-3;*C*C*C(*C*C,*C,*CC,O/..." is wrong since we have three nodes in the first sphere C,C,C then 4 in the second *C*C,*C,*CC,O. This is now correctly encoded as *C*C,*CO,*CC.
…o be a hack to try and fix the nesting issue and can't work since the parents may be idential. Now we have a proper comparator we can sory in the correct places.
…is descriptor unless we can regenerate the DBs."

This reverts commit 4525a9a.
@johnmay johnmay marked this pull request as ready for review September 12, 2022 12:19
@johnmay
Copy link
Member Author

johnmay commented Sep 12, 2022

A new constructor parameter (flags) has been added, all existing usages use new HOSECodeGen(LEGACY_MODE); to keep compatible behaviour.

This was referenced Sep 12, 2022
@sonarcloud
Copy link

sonarcloud bot commented Sep 12, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 5 Code Smells

26.9% 26.9% Coverage
9.6% 9.6% Duplication

@johnmay
Copy link
Member Author

johnmay commented Sep 12, 2022

I'm OK with the 5 code smells.

@egonw egonw self-requested a review September 12, 2022 15:20
Copy link
Member

@egonw egonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. With this low level changes, the impact can be big, and I am not sure if all use cases still work, but at least some are that have tests too, and since there are no regressions, I am happy to apply.

@johnmay johnmay merged commit 40b552e into main Sep 12, 2022
@johnmay johnmay deleted the patch/hose-codes branch September 12, 2022 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants