Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The drawbacks of backtracking #2

Closed
dan2097 opened this issue Apr 15, 2011 · 4 comments
Closed

The drawbacks of backtracking #2

dan2097 opened this issue Apr 15, 2011 · 4 comments
Labels
bug Something isn't working major

Comments

@dan2097
Copy link
Owner

dan2097 commented Apr 15, 2011

Original report by Steve Chapman (Bitbucket: isomerdesign, ).


Synthetic cannabinoid JWH-251 is commonly named "1-pentyl-3-(2-methylphenylacetyl)indole" which is parsed as "1-pentyl-3-(2-methyl-2-phenylacetyl)indole" rather than "1-pentyl-3-[(2-methylphenyl)acetyl]indole." The latter name evokes the correct depiction of JWH-251. (Credit Lee Fadness for this discovery)

However "1-pentyl-3-(3-methylphenylacetyl)indole" is parsed as "1-pentyl-3-[(3-methylphenyl)acetyl]indole," presumably because the parse that worked for the prior name fails for this one. Pragmatic but //inconsistant//.

Also, this variant parses correctly and consistently regardless of the locant: "1-pentyl-3-(2-methylphenacyl)indole".

@dan2097
Copy link
Owner Author

dan2097 commented Apr 16, 2011

Original comment by Daniel Lowe (Bitbucket: dan2097, GitHub: dan2097).


The common name is in my opinion formally ambiguous. Unfortunately OPSIN does not currently detect ambiguity in chemical names and instead tends to stop looking for possibilities as soon as it has found a sensible outcome.

The heuristic that is employed is to start from the rightmost group in the bracket and work right to left checking whether the group has the desired locant.

I have changed this heuristic to first check the adjacent group when the following criteria are satisfied:

  • The locant is of the form \d+[a-z]?'* i.e. numeric
  • Neither a hyphen or locant are present e.g. 1-pentyl-3-(2-methyl-phenylacetyl)indole or 1-pentyl-3-(2-methyl2phenylacetyl)indole or 1-pentyl-3-(2-methyl-2-phenylacetyl)indole will retain OPSIN's original interpretation.

In my regression sets, especially a set of polymer names, this change makes a uniformly positive improvement, albeit it only effects names that are formally ambiguous.

Thanks for the bug report. The fixed version is up on the web service. Let me know if it doesn't perform as expected.

Daniel

P.S. OPSIN only actually generates one parse for this name as at there is no ambiguity in tokenizing this name/assigning meaning to the tokens.

@dan2097
Copy link
Owner Author

dan2097 commented Apr 16, 2011

Original comment by Daniel Lowe (Bitbucket: dan2097, GitHub: dan2097).


1-pentyl-3-(1-methylphenylacetyl)indole is probably interpreted incorrectly (as in it produces a structure rather than a valency error). OPSIN currently doesn't treat phenyl as being explicitly phen-1-yl, and it really should do.

@dan2097
Copy link
Owner Author

dan2097 commented Apr 16, 2011

Original comment by Anonymous.


Outstanding! Fastest Debug Ever.

Thanks for the explanation as well, Daniel. I had just read your paper and had alternative parses on my mind, but in future I'll stick to reporting the symptoms and leave the diagnosis to you.

The problem is solved, but just fyi, here are a few more curious/degenerate cases:

  • 1-pentyl-3-(methylphenylacetyl)indole = 1-pentyl-3-(2-methyl-2-phenylacetyl)indole
  • 1-pentyl-3-(1-methylphenylacetyl)indole = 1-pentyl-3-(2-methylphenylacetyl)indole
  • 1-pentyl-3-(3-methyl-2-phenylacetyl)indole = 1-pentyl-3-(3-methylphenylacetyl)indole

All the best, and props to the OPSIN team for this extraordinarily useful service.

/Steve

@dan2097
Copy link
Owner Author

dan2097 commented Apr 19, 2011

Original comment by Daniel Lowe (Bitbucket: dan2097, GitHub: dan2097).


Yeah alternative parses are an important part of OPSIN but as you can see from the pie chart in that paper multiple parses are fortunately moderately rare. Ideally multiple parses should only happen if there are multiple ways of interpreting something that are either non trivial or impossible to disambiguate between e.g. 2-methylthiophenyl which could be 2-(methylthio)phenyl or 2-methyl-thiophen-yl.

  • 1-pentyl-3-(methylphenylacetyl)indole acting as 1-pentyl-3-(2-methyl-2-phenylacetyl)indole I think is working as intended (although the name is clearly ambiguous)
  • 1-pentyl-3-(1-methylphenylacetyl)indole was being incorrectly interpreted as 1-pentyl-3-(1-methylphen-2-ylacetyl)indole which clearly makes no sense as the position of the radical on phenyl is always at locant 1. Its current interpretation isn't too much better but at least now is clearly a case of garbage in, garbage out.
  • Interpreting 1-pentyl-3-(3-methyl-2-phenylacetyl)indole in that way is the only way to generate a structure so I think it is working as intended.

The change which which stops phenyl being broken down into phen and yl is now live.

Thanks for the feedback

@dan2097 dan2097 closed this as completed Apr 20, 2011
@dan2097 dan2097 added major bug Something isn't working labels Apr 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working major
Projects
None yet
Development

No branches or pull requests

1 participant