Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between sentence segmentation and parse trees #46

Closed
florijanstamenkovic opened this issue Apr 7, 2015 · 3 comments
Closed

Comments

@florijanstamenkovic
Copy link
Contributor

I've bumped into an issue where sentence segmentation (as given in Tokens.sents) doesn't match parse trees in that Tokens.sents do not separate independent parse trees. This behavior can be observed on the following text:

"It was a mere past six when we left Baker Street, and it still wanted ten minutes to the hour when we found ourselves in Serpentine Avenue."

I assume this is due to sentence segmentation being done by a separate classifier, so I wouldn't call it a bug, but it can be a usage problem, so I am reporting it. A few examples that I've checked manually indicate that parse-trees give better sentence segmentation then whatever Tokens.sents is based on.

My current workaround idea is to follow each token's dependancy tree path all the way up to the root, and then using the obtained root-node-array as sentence labels. This is however crude, ugly and inefficient, it would be nice to have a better solution.

@honnibal
Copy link
Member

honnibal commented Apr 7, 2015

Thanks for the report.

This is fixed in the next version; the orthographic heuristics were only a quick hack. The plan has always been to segment the document using the parse tree, as the document is parsed all at once, since the process is linear time. Ultimately the sentences will be joined with meaningful dependencies, that indicate the discourse structure.

The next version will be a large update, including preliminary named entity recognition. I hope to have it out within a week.

@florijanstamenkovic
Copy link
Contributor Author

Excellent, looking forward to it.

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants