Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CxSMILES compatibility #7

Closed
schymane opened this issue Jan 9, 2018 · 20 comments
Closed

CxSMILES compatibility #7

schymane opened this issue Jan 9, 2018 · 20 comments

Comments

@schymane
Copy link

schymane commented Jan 9, 2018

We're trying to get our CxSMILES compatible between ChemAxon and CDK, testing visually with CDK Depict. We've run into this - the top example works, but no others (of 27 we've tried) do. Below a selection of the smallest. All 26 SMILES that fail have this "lp" field in the extended SMILES bit, the one that works doesn't. If I select "title" the extended part is just printed out as the title, i.e. it's not recognising it somehow.
C*.CC1=CC=CC=C1 |c:4,6,t:2,m:1:4.5.6|
C*.S=C1NC2=CC=CC=C2N1 |c:6,8,t:4,lp:2:2,4:1,11:1,m:1:8.9|
CCC1=CC=CC=C1.Br*.Br* |c:4,6,t:2,lp:8:3,10:3,m:9:3.4,11:4.5.6.7|
Cl*.Cl*.ClCC1=CC=CC=C1 |c:6,8,t:4,lp:0:3,2:3,4:3,m:1:7.8,3:8.9.10.11|
*C=O.C1CC2C3CCC(C3)C2C1 |lp:2:2,m:0:3.4.5.6.7.8.9.10.11.12|

image

@johnmay
Copy link
Member

johnmay commented Jan 9, 2018

What do you think is wrong with the first one?

C*.CC1=CC=CC=C1 |c:4,6,t:2,m:1:4.5.6|

@schymane
Copy link
Author

schymane commented Jan 9, 2018

Nothing - it's the only one in the set that works :-)
If I get rid of the lp manually all look good. I also don't see the need for a lone pair definition for these structures...
C*.CC1=CC=CC=C1 |c:4,6,t:2,m:1:4.5.6|
C*.S=C1NC2=CC=CC=C2N1 |c:6,8,t:4,m:1:8.9|
CCC1=CC=CC=C1.Br*.Br* |c:4,6,t:2,m:9:3.4,11:4.5.6.7|
Cl*.Cl*.ClCC1=CC=CC=C1 |c:6,8,t:4,m:1:7.8,3:8.9.10.11|
*C=O.C1CC2C3CCC(C3)C2C1 |m:0:3.4.5.6.7.8.9.10.11.12|

image

@schymane
Copy link
Author

schymane commented Jan 9, 2018

Looking much better now - here's one that's totally broken once removing the lp part:

Original:
[H]CC(C[H])C1=CC=C(C=C1)S(O)(=O)=O |c:7,9,t:5,lp:12:2,13:2,14:2,Sg:n:1:x:ht,Sg:n:3:y:ht| DTXCID701284951

Hand edited:
[H]CC(C[H])C1=CC=C(C=C1)S(O)(=O)=O |c:7,9,t:5,Sg:n:1:x:ht,Sg:n:3:y:ht| DTXCID701284951

image

@schymane
Copy link
Author

schymane commented Jan 9, 2018

...and easily fixed by turning Hs on
image

@schymane
Copy link
Author

schymane commented Jan 9, 2018

So ... getting there ... here's what it looks like in development (trying out a few different options) vs Depict now - I'm personally not a big fan of this third representation, but these are the only examples that still don't look quite the same (ignoring the shading for the moment - although in the 3rd case the missing shading means loss of information).
Apart from the depiction, what are your thoughts on the best representation(s)? Obviously this sometimes depends on the definition, or lack there of, of the substances involved and these three examples are not representing exactly the same thing (as you can see from the names in the first screenshot).
My ideal aim would be something we can depict properly and expand in R to use our mass spec workflows - i.e. to generate valid SMILES from structures stored (like these examples) in the Dashboard that we can then manipulate in (r)cdk.

image

image

Some SMILES:

[H]CC(C[H])C1=CC=C(C=C1)S(O)(=O)=O |c:7,9,t:5,Sg:n:1:x:ht,Sg:n:3:y:ht| DTXCID701284951
OS(=O)(=O)C1=CC=CC=C1.** |$;;;;;;;;;;Alkyl_p;$,c:6,8,t:4,m:11:7.8.9| DTXCID301079750           
CCCCCCCCCC.OS(=O)(=O)C1=CC=C(*)C=C1 |c:18,t:13,15,m:18:1.2.3.4| DTXCID001079751           
CCCCCCCCCCC.OS(=O)(=O)C1=CC=C(*)C=C1 |c:19,t:14,16,m:19:5.6.7.8.9| DTXCID701079752           
CCCCCCCCCCCC.OS(=O)(=O)C1=CC=C(*)C=C1 |c:20,t:15,17,m:20:1.2.3.4.5| DTXCID401079753           
CCCCCCC(O)=O.OS(=O)(=O)C1=CC=C(*)C=C1 |c:17,t:12,14,m:17:1.2.3.4.5| DTXCID101079754           

@johnmay
Copy link
Member

johnmay commented Jan 9, 2018

  • Have patched the lp: for you so these are now ignored (see Ignore lp: designations. cdk#410). The reason we ignore lp: and c: t: etc is because they don't change the meaning and you run in to issues with canonical labelling/registration. For example in pyrrole you would want these to have the same canonical identifier:
N1C=CC=C1
N1C=CC=C1 |c:2,3|
N1C=CC=C1 |c:2,3,lp:0:1|
N1C=CC=C1 |lp:0:1|
  • The hydrogen case is easy to address, essentially as you guess the hydrogen gets removed and so the brackets can't be drawn raising an exception. Will add a patch for that.

? still don't look quite the same - that's not the goal, my goal was to get them looking like they do in patents - all the information is captured and so can always be rendered manually if desired. I don't like the shading as it's not clear when the shaded regions overlap:

image
CC1=CC=CC=C1.N*.O*.Cl* |c:3,5,t:1,m:8:4.3.2,10:3.4.5,12:2.3.4.5|

CAS have a better semantic depiction but again in the same we don't draw the electrons the goal is not to exactly draw what is semantically captured underneath.

  1. CDK draws these small brackets for single atom link nodes, if you look through patents this is very common. It's a simple boolean switch to draw the larger (and IMO uglier) brackets in these cases. If statement is here: StandardSgroupGenerator:L531
  2. Attachment is ortho/meta/para and therefore can go anywhere valence is okay on this ring.
  3. Yeah there is no good way to draw this, I think even IUPAC admits this, the CAS may be okay here. I'll agree the ChemAxon depiction is clearer here.

@schymane
Copy link
Author

schymane commented Jan 9, 2018

Thanks! That's awesome. I actually much prefer your depiction for case 1 and 2, they are simpler and cleaner. Case 3 is the tricky one, like you say :-) [and good point about the shading].
Case 3 could be captured (for my purposes) using Case 1 and being able to define the range of x and y. This is all I need and structure-wise enables you to collapse down the long chains and get a much clearer depiction.

Is there a way to store and display that range properly? So far my non-ideal workaround has been via the title field...

[H]CC(C[H])C1=CC=C(C=C1)S(O)(=O)=O |c:7,9,t:5,Sg:n:1:x:ht,Sg:n:3:y:ht| DTXCID701284951 (x+y=1-10)
image

@johnmay
Copy link
Member

johnmay commented Jan 9, 2018

Not in CDK's handling and I don't think so in CXSMILES also, you can do constraints on Rgroups (R group logic) but not sure about the repeat variation. Completely makes sense though and you'd want it for a general Markush match.

@schymane
Copy link
Author

This one is an interesting corner case ...
NC1=CC=C(O)C2=C1C(=O)C1=C(O)C=CC(N)=C1C2=O.Br* |c:6,11,14,17,t:1,3,lp:0:1,5:2,9:2,12:2,16:1,19:2,20:3,m:21:13.14| DTXCID301079550
image
(are there any plans to be able to adjust orientation if desired at some point, in general?)

@johnmay
Copy link
Member

johnmay commented Jan 10, 2018

Let's remove the junk first:

NC1=CC=C(O)C2=C1C(=O)C1=C(O)C=CC(N)=C1C2=O.Br* |m:21:13.14|

when it crossed a single bond it doesn't seem to do the check for which side to put it:

NC1=CC=C(O)C2=C1C(=O)C1=C(O)C=CC(N)=C1C2=O.Br* |m:21:13.14.15|

@schymane
Copy link
Author

schymane commented Jan 10, 2018 via email

@johnmay
Copy link
Member

johnmay commented Jan 10, 2018

CDK issue

@johnmay
Copy link
Member

johnmay commented Jan 13, 2018

I believe I've come up with a fix for the sided-ness of the positional variation. The algorithm currently looks at the atoms where it can go and compute the centre from that. When there is a single bond there is no sided-ness and so it's arbitrary. You can see the 'center of mass' marked in the following examples with an 'X'. I used the existing APIs to add the colors so we can see where it's attaching.

|13.14|

image

|13.14.15|

image

|13.14.15.16|

image

@schymane
Copy link
Author

I'm adding comments to this as we have a new case:
https://comptox.epa.gov/dashboard/dsstoxdb/results?search=PCBs
[*]C1=C([*])C([*])=C(C([*])=C1[*])C1=C([*])C([*])=C([*])C([*])=C1[*] |$_R1;;;_R1;;_R1;;;_R1;;_R1;;;_R1;;_R1;;_R1;;_R1;;_R1$,c:1,5,8,12,20,t:16,RG:_R1={Cl* |$;_AP1$,lp:0:2|},LOG={_R1:;H;>0}|
image

which does not work in CDK Depict. I am also not yet a big fan of this style of representing the problem.
This works as an alternative:
image

Cl*.Cl*.c1ccccc1-c1ccccc1 |m:1:4.5.6.7.8.9,3:10.11.12.13.14.15,Sg:n:2:x:ht,Sg:n:0:y:ht| polychlorinated biphenyls, x+y>1

What are your thoughts/suggestions? THanks!

@johnmay
Copy link
Member

johnmay commented Mar 12, 2018

Yes R groups are not supported. I have some internal code at NextMove that does handle them but the issue is the CDK has an explicit type (RGroupQuery) for handling this. Trying to convert the ChemAxon definitions into this is tricky as the concepts don't match exactly (RGroupQuery doesn't store attachments explicitly for example) - Hence for my internal code I just store them on a property of a molecule.

@johnmay johnmay reopened this Mar 12, 2018
@schymane
Copy link
Author

Here is a new case that doesn't seem to display optimally ... an undefined location within a repeater unit:
[H]OCCO.C* |lp:1:2,4:2,m:6:3.2,Sg:n:5,1,2,3::ht|
image

http://www.simolecule.com/cdkdepict/depict/bow/svg?smi=[H]OCCO.C*%20%7Clp%3A1%3A2%2C4%3A2%2Cm%3A6%3A3.2%2CSg%3An%3A5%2C1%2C2%2C3%3A%3Aht%7C&abbr=on&hdisp=bridgehead&showtitle=false&zoom=1.6&annotate=none

Pic from ChemAxon (source of extended SMILES):
ppgn_chemaxon

I'm not yet sure I'm a fan of the ChemAxon representation ... but we don't currently see another option.

@schymane
Copy link
Author

[H]OCCO.C* |m:6:3.2,Sg:n:1,2,3::ht| looks better

@johnmay
Copy link
Member

johnmay commented Jun 27, 2018

I think this is possibly a bug in the ChemAxon export. Since the positional variation is part of the repeat then the whole thing needs to be included in the repeat, not just one of the atoms. Of course this depends how you represent the attachment, but since it is present as real node in the SMILES * it's logical it should be in the repeat - so the Sg:n should include atom indices 1-6.

[H]OCCO.C* |m:6:3.2,Sg:n:1,2,3,5,6::ht|

I was talking to Greg about this at the ICCS, CXSMILES really is poorly designed. It can't for example differentiate spiro vs linear repeats because it only stores the atoms and not the crossing bonds:

image

@schymane
Copy link
Author

Agree re: design ... Chris and I were swapping some last night that enumerate correctly according to the technical definitions but are visually incomprehensible.
OS(=O)(=O)c1ccc2c(c1)C(CCC2CC)CC |Sg:n:14:m:ht,Sg:n:16:n:ht| C6-C10DATS; n+m=0-4
image

vs (n and ms may not match exactly, don't have the ChemAxom CxSMILES for this yet)

image

@johnmay
Copy link
Member

johnmay commented Feb 9, 2022

As far as I can tell there is one still broken - will open a separate issue for that.

@johnmay johnmay closed this as completed Feb 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants