CxSMILES compatibility #7

schymane · 2018-01-09T07:39:59Z

We're trying to get our CxSMILES compatible between ChemAxon and CDK, testing visually with CDK Depict. We've run into this - the top example works, but no others (of 27 we've tried) do. Below a selection of the smallest. All 26 SMILES that fail have this "lp" field in the extended SMILES bit, the one that works doesn't. If I select "title" the extended part is just printed out as the title, i.e. it's not recognising it somehow.
C*.CC1=CC=CC=C1 |c:4,6,t:2,m:1:4.5.6|
C*.S=C1NC2=CC=CC=C2N1 |c:6,8,t:4,lp:2:2,4:1,11:1,m:1:8.9|
CCC1=CC=CC=C1.Br*.Br* |c:4,6,t:2,lp:8:3,10:3,m:9:3.4,11:4.5.6.7|
Cl*.Cl*.ClCC1=CC=CC=C1 |c:6,8,t:4,lp:0:3,2:3,4:3,m:1:7.8,3:8.9.10.11|
*C=O.C1CC2C3CCC(C3)C2C1 |lp:2:2,m:0:3.4.5.6.7.8.9.10.11.12|

johnmay · 2018-01-09T08:11:19Z

What do you think is wrong with the first one?

C*.CC1=CC=CC=C1 |c:4,6,t:2,m:1:4.5.6|

schymane · 2018-01-09T08:19:10Z

Nothing - it's the only one in the set that works :-)
If I get rid of the lp manually all look good. I also don't see the need for a lone pair definition for these structures...
C*.CC1=CC=CC=C1 |c:4,6,t:2,m:1:4.5.6|
C*.S=C1NC2=CC=CC=C2N1 |c:6,8,t:4,m:1:8.9|
CCC1=CC=CC=C1.Br*.Br* |c:4,6,t:2,m:9:3.4,11:4.5.6.7|
Cl*.Cl*.ClCC1=CC=CC=C1 |c:6,8,t:4,m:1:7.8,3:8.9.10.11|
*C=O.C1CC2C3CCC(C3)C2C1 |m:0:3.4.5.6.7.8.9.10.11.12|

schymane · 2018-01-09T08:37:15Z

Looking much better now - here's one that's totally broken once removing the lp part:

Original:
[H]CC(C[H])C1=CC=C(C=C1)S(O)(=O)=O |c:7,9,t:5,lp:12:2,13:2,14:2,Sg:n:1:x:ht,Sg:n:3:y:ht| DTXCID701284951

Hand edited:
[H]CC(C[H])C1=CC=C(C=C1)S(O)(=O)=O |c:7,9,t:5,Sg:n:1:x:ht,Sg:n:3:y:ht| DTXCID701284951

schymane · 2018-01-09T08:45:58Z

...and easily fixed by turning Hs on

schymane · 2018-01-09T09:08:31Z

So ... getting there ... here's what it looks like in development (trying out a few different options) vs Depict now - I'm personally not a big fan of this third representation, but these are the only examples that still don't look quite the same (ignoring the shading for the moment - although in the 3rd case the missing shading means loss of information).
Apart from the depiction, what are your thoughts on the best representation(s)? Obviously this sometimes depends on the definition, or lack there of, of the substances involved and these three examples are not representing exactly the same thing (as you can see from the names in the first screenshot).
My ideal aim would be something we can depict properly and expand in R to use our mass spec workflows - i.e. to generate valid SMILES from structures stored (like these examples) in the Dashboard that we can then manipulate in (r)cdk.

Some SMILES:

[H]CC(C[H])C1=CC=C(C=C1)S(O)(=O)=O |c:7,9,t:5,Sg:n:1:x:ht,Sg:n:3:y:ht| DTXCID701284951
OS(=O)(=O)C1=CC=CC=C1.** |$;;;;;;;;;;Alkyl_p;$,c:6,8,t:4,m:11:7.8.9| DTXCID301079750           
CCCCCCCCCC.OS(=O)(=O)C1=CC=C(*)C=C1 |c:18,t:13,15,m:18:1.2.3.4| DTXCID001079751           
CCCCCCCCCCC.OS(=O)(=O)C1=CC=C(*)C=C1 |c:19,t:14,16,m:19:5.6.7.8.9| DTXCID701079752           
CCCCCCCCCCCC.OS(=O)(=O)C1=CC=C(*)C=C1 |c:20,t:15,17,m:20:1.2.3.4.5| DTXCID401079753           
CCCCCCC(O)=O.OS(=O)(=O)C1=CC=C(*)C=C1 |c:17,t:12,14,m:17:1.2.3.4.5| DTXCID101079754

johnmay · 2018-01-09T10:38:54Z

Have patched the lp: for you so these are now ignored (see Ignore lp: designations. cdk#410). The reason we ignore lp: and c: t: etc is because they don't change the meaning and you run in to issues with canonical labelling/registration. For example in pyrrole you would want these to have the same canonical identifier:

N1C=CC=C1
N1C=CC=C1 |c:2,3|
N1C=CC=C1 |c:2,3,lp:0:1|
N1C=CC=C1 |lp:0:1|

The hydrogen case is easy to address, essentially as you guess the hydrogen gets removed and so the brackets can't be drawn raising an exception. Will add a patch for that.

? still don't look quite the same - that's not the goal, my goal was to get them looking like they do in patents - all the information is captured and so can always be rendered manually if desired. I don't like the shading as it's not clear when the shaded regions overlap:

CC1=CC=CC=C1.N*.O*.Cl* |c:3,5,t:1,m:8:4.3.2,10:3.4.5,12:2.3.4.5|

CAS have a better semantic depiction but again in the same we don't draw the electrons the goal is not to exactly draw what is semantically captured underneath.

CDK draws these small brackets for single atom link nodes, if you look through patents this is very common. It's a simple boolean switch to draw the larger (and IMO uglier) brackets in these cases. If statement is here: StandardSgroupGenerator:L531
Attachment is ortho/meta/para and therefore can go anywhere valence is okay on this ring.
Yeah there is no good way to draw this, I think even IUPAC admits this, the CAS may be okay here. I'll agree the ChemAxon depiction is clearer here.

schymane · 2018-01-09T10:55:52Z

Thanks! That's awesome. I actually much prefer your depiction for case 1 and 2, they are simpler and cleaner. Case 3 is the tricky one, like you say :-) [and good point about the shading].
Case 3 could be captured (for my purposes) using Case 1 and being able to define the range of x and y. This is all I need and structure-wise enables you to collapse down the long chains and get a much clearer depiction.

Is there a way to store and display that range properly? So far my non-ideal workaround has been via the title field...

[H]CC(C[H])C1=CC=C(C=C1)S(O)(=O)=O |c:7,9,t:5,Sg:n:1:x:ht,Sg:n:3:y:ht| DTXCID701284951 (x+y=1-10)

johnmay · 2018-01-09T11:34:47Z

Not in CDK's handling and I don't think so in CXSMILES also, you can do constraints on Rgroups (R group logic) but not sure about the repeat variation. Completely makes sense though and you'd want it for a general Markush match.

schymane · 2018-01-10T15:37:50Z

This one is an interesting corner case ...
NC1=CC=C(O)C2=C1C(=O)C1=C(O)C=CC(N)=C1C2=O.Br* |c:6,11,14,17,t:1,3,lp:0:1,5:2,9:2,12:2,16:1,19:2,20:3,m:21:13.14| DTXCID301079550

(are there any plans to be able to adjust orientation if desired at some point, in general?)

johnmay · 2018-01-10T16:32:02Z

Let's remove the junk first:

NC1=CC=C(O)C2=C1C(=O)C1=C(O)C=CC(N)=C1C2=O.Br* |m:21:13.14|

when it crossed a single bond it doesn't seem to do the check for which side to put it:

NC1=CC=C(O)C2=C1C(=O)C1=C(O)C=CC(N)=C1C2=O.Br* |m:21:13.14.15|

schymane · 2018-01-10T17:27:37Z

Not just the bond order? This also ends up on the wrong side: NC1=CC=C(O)C2=C1C(=O)C1=C(O)C=CC(N)=C1C2=O.Br<http://O.Br>* |m:21:13.14.15.16<http://13.14.15.16>| (debugging on my phone so it's just a wild guess - which side is the issue? ChemAxon or CDK?) On Wed, Jan 10, 2018 at 5:31 PM +0100, "John Mayfield" <notifications@github.com<mailto:notifications@github.com>> wrote: Let's remove the junk first: NC1=CC=C(O)C2=C1C(=O)C1=C(O)C=CC(N)=C1C2=O.Br* |m:21:13.14| when it crossed a single bond it doesn't seem to do the check for which side to put it: NC1=CC=C(O)C2=C1C(=O)C1=C(O)C=CC(N)=C1C2=O.Br* |m:21:13.14.15| — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#7 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD4a_R9rUpu8W8BIcKhGNumNpSsnash2ks5tJOYCgaJpZM4RXbK6>.

johnmay · 2018-01-10T18:54:09Z

CDK issue

johnmay · 2018-01-13T15:12:40Z

I believe I've come up with a fix for the sided-ness of the positional variation. The algorithm currently looks at the atoms where it can go and compute the centre from that. When there is a single bond there is no sided-ness and so it's arbitrary. You can see the 'center of mass' marked in the following examples with an 'X'. I used the existing APIs to add the colors so we can see where it's attaching.

|13.14|

|13.14.15|

|13.14.15.16|

schymane · 2018-03-12T07:29:37Z

I'm adding comments to this as we have a new case:
https://comptox.epa.gov/dashboard/dsstoxdb/results?search=PCBs
[*]C1=C([*])C([*])=C(C([*])=C1[*])C1=C([*])C([*])=C([*])C([*])=C1[*] |$_R1;;;_R1;;_R1;;;_R1;;_R1;;;_R1;;_R1;;_R1;;_R1;;_R1$,c:1,5,8,12,20,t:16,RG:_R1={Cl* |$;_AP1$,lp:0:2|},LOG={_R1:;H;>0}|

which does not work in CDK Depict. I am also not yet a big fan of this style of representing the problem.
This works as an alternative:

Cl*.Cl*.c1ccccc1-c1ccccc1 |m:1:4.5.6.7.8.9,3:10.11.12.13.14.15,Sg:n:2:x:ht,Sg:n:0:y:ht| polychlorinated biphenyls, x+y>1

What are your thoughts/suggestions? THanks!

johnmay · 2018-03-12T13:35:13Z

Yes R groups are not supported. I have some internal code at NextMove that does handle them but the issue is the CDK has an explicit type (RGroupQuery) for handling this. Trying to convert the ChemAxon definitions into this is tricky as the concepts don't match exactly (RGroupQuery doesn't store attachments explicitly for example) - Hence for my internal code I just store them on a property of a molecule.

schymane · 2018-06-26T13:24:03Z

Here is a new case that doesn't seem to display optimally ... an undefined location within a repeater unit:
[H]OCCO.C* |lp:1:2,4:2,m:6:3.2,Sg:n:5,1,2,3::ht|

http://www.simolecule.com/cdkdepict/depict/bow/svg?smi=[H]OCCO.C*%20%7Clp%3A1%3A2%2C4%3A2%2Cm%3A6%3A3.2%2CSg%3An%3A5%2C1%2C2%2C3%3A%3Aht%7C&abbr=on&hdisp=bridgehead&showtitle=false&zoom=1.6&annotate=none

Pic from ChemAxon (source of extended SMILES):

I'm not yet sure I'm a fan of the ChemAxon representation ... but we don't currently see another option.

schymane · 2018-06-26T13:33:45Z

[H]OCCO.C* |m:6:3.2,Sg:n:1,2,3::ht| looks better

johnmay · 2018-06-27T11:23:37Z

I think this is possibly a bug in the ChemAxon export. Since the positional variation is part of the repeat then the whole thing needs to be included in the repeat, not just one of the atoms. Of course this depends how you represent the attachment, but since it is present as real node in the SMILES * it's logical it should be in the repeat - so the Sg:n should include atom indices 1-6.

[H]OCCO.C* |m:6:3.2,Sg:n:1,2,3,5,6::ht|

I was talking to Greg about this at the ICCS, CXSMILES really is poorly designed. It can't for example differentiate spiro vs linear repeats because it only stores the atoms and not the crossing bonds:

schymane · 2018-06-27T11:42:23Z

Agree re: design ... Chris and I were swapping some last night that enumerate correctly according to the technical definitions but are visually incomprehensible.
OS(=O)(=O)c1ccc2c(c1)C(CCC2CC)CC |Sg:n:14:m:ht,Sg:n:16:n:ht| C6-C10DATS; n+m=0-4

vs (n and ms may not match exactly, don't have the ChemAxom CxSMILES for this yet)

johnmay · 2022-02-09T11:03:57Z

As far as I can tell there is one still broken - will open a separate issue for that.

johnmay mentioned this issue Jan 13, 2018

Patch/sgrouphydrogens cdk/cdk#411

Merged

johnmay mentioned this issue Jan 14, 2018

Improved placement of positional variation bonds. cdk/cdk#412

Merged

johnmay closed this as completed Jan 14, 2018

johnmay reopened this Mar 12, 2018

johnmay closed this as completed Feb 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CxSMILES compatibility #7

CxSMILES compatibility #7

schymane commented Jan 9, 2018

johnmay commented Jan 9, 2018

schymane commented Jan 9, 2018

schymane commented Jan 9, 2018

schymane commented Jan 9, 2018

schymane commented Jan 9, 2018

johnmay commented Jan 9, 2018

schymane commented Jan 9, 2018

johnmay commented Jan 9, 2018

schymane commented Jan 10, 2018

johnmay commented Jan 10, 2018

schymane commented Jan 10, 2018 via email

johnmay commented Jan 10, 2018

johnmay commented Jan 13, 2018

schymane commented Mar 12, 2018

johnmay commented Mar 12, 2018 •

edited

Loading

schymane commented Jun 26, 2018

schymane commented Jun 26, 2018

johnmay commented Jun 27, 2018

schymane commented Jun 27, 2018

johnmay commented Feb 9, 2022

CxSMILES compatibility #7

CxSMILES compatibility #7

Comments

schymane commented Jan 9, 2018

johnmay commented Jan 9, 2018

schymane commented Jan 9, 2018

schymane commented Jan 9, 2018

schymane commented Jan 9, 2018

schymane commented Jan 9, 2018

johnmay commented Jan 9, 2018

schymane commented Jan 9, 2018

johnmay commented Jan 9, 2018

schymane commented Jan 10, 2018

johnmay commented Jan 10, 2018

schymane commented Jan 10, 2018 via email

johnmay commented Jan 10, 2018

johnmay commented Jan 13, 2018

schymane commented Mar 12, 2018

johnmay commented Mar 12, 2018 • edited Loading

schymane commented Jun 26, 2018

schymane commented Jun 26, 2018

johnmay commented Jun 27, 2018

schymane commented Jun 27, 2018

johnmay commented Feb 9, 2022

johnmay commented Mar 12, 2018 •

edited

Loading