New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue with hose code nesting #828
Conversation
I may still rewrite this code as it appears to be doing a lot more than it needs to. In my mind it should just be a simple breath-first order traversal using a comparator but it's going around the houses making temporary objects and comparison keys (maybe this was pre-generics). |
Doh, there is a QSAR model trained on the broken hose codes :-) |
Again completely wrong since |
@egonw I can go through and mark the incorrect HOSE codes but I think it's impossible to fix completely unless we know how these DB's were generated - for which there are hints of NIST but not much more. |
@steinbeck, can you have a look at this? I know the idea, but never actively used HOSE codes... or know someone who is willing to check this patch? |
One option to have both options is we can have an option to generate old (incorrect) CDK hose codes and the fixed ones. There is a strong overlap between them so mostly they are compatible. The two QSAR descriptors are most problematic since they use many layers. Then existing usages can use the old ones if needed. The tricky part here is a lot of the usages are very old - circa 2004 - and probably bit-rotted quite heavily all-ready. In which case I would favour just deprecating and moving things to cdk-legacy. |
c03a31b
to
8942771
Compare
Kudos, SonarCloud Quality Gate passed! |
Maybe not a bad idea (both options), but ideally the descriptors are recalculated indeed. We have a |
Yep - there isn't much info "Weka(J48)" (a classifier) and "NIST" |
@johnmay, this PR is now conflicting too :/ |
Yes need to rebase - it's fine need some more thought anyways |
@johnmay, got an update? |
This branch has conflicts that must be resolved |
Marked as draft, these changes would break other parts (models trained on the wrong values) so it is difficult to merge in for now. |
…ant loop variable.
…only compares the "stringScore"
…tor chaining here is a little harder to read so we do it the old-fassioned way. Another test need updating but we can see the expected output is clear wrong. For example "C-3;*C*C*C(*C*C,*C,*CC,O/..." is wrong since we have three nodes in the first sphere C,C,C then 4 in the second *C*C,*C,*CC,O. This is now correctly encoded as *C*C,*CO,*CC.
…iptor unless we can regenerate the DBs.
…o be a hack to try and fix the nesting issue and can't work since the parents may be idential. Now we have a proper comparator we can sory in the correct places.
8942771
to
47e46a9
Compare
…is descriptor unless we can regenerate the DBs." This reverts commit 4525a9a.
A new constructor parameter (flags) has been added, all existing usages use new HOSECodeGen(LEGACY_MODE); to keep compatible behaviour. |
Kudos, SonarCloud Quality Gate passed! |
I'm OK with the 5 code smells. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. With this low level changes, the impact can be big, and I am not sure if all use cases still work, but at least some are that have tests too, and since there are no regressions, I am happy to apply.
Fixes an issue with Hose Code nesting via #588.
We can actually see from another test that the output was wrong since it didn't make sense.
example:
So
but we had
It's logically inconsistent to have 4 in the second sphere and 3 in the first.