Discrepancy between sentence segmentation and parse trees #46

florijanstamenkovic · 2015-04-07T15:45:08Z

I've bumped into an issue where sentence segmentation (as given in Tokens.sents) doesn't match parse trees in that Tokens.sents do not separate independent parse trees. This behavior can be observed on the following text:

"It was a mere past six when we left Baker Street, and it still wanted ten minutes to the hour when we found ourselves in Serpentine Avenue."

I assume this is due to sentence segmentation being done by a separate classifier, so I wouldn't call it a bug, but it can be a usage problem, so I am reporting it. A few examples that I've checked manually indicate that parse-trees give better sentence segmentation then whatever Tokens.sents is based on.

My current workaround idea is to follow each token's dependancy tree path all the way up to the root, and then using the obtained root-node-array as sentence labels. This is however crude, ugly and inefficient, it would be nice to have a better solution.

honnibal · 2015-04-07T16:29:04Z

Thanks for the report.

This is fixed in the next version; the orthographic heuristics were only a quick hack. The plan has always been to segment the document using the parse tree, as the document is parsed all at once, since the process is linear time. Ultimately the sentences will be joined with meaningful dependencies, that indicate the discourse structure.

The next version will be a large update, including preliminary named entity recognition. I hope to have it out within a week.

florijanstamenkovic · 2015-04-07T16:46:54Z

Excellent, looking forward to it.

lock · 2018-05-09T18:31:42Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

florijanstamenkovic closed this as completed Apr 7, 2015

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy between sentence segmentation and parse trees #46

Discrepancy between sentence segmentation and parse trees #46

florijanstamenkovic commented Apr 7, 2015

honnibal commented Apr 7, 2015

florijanstamenkovic commented Apr 7, 2015

lock bot commented May 9, 2018

Discrepancy between sentence segmentation and parse trees #46

Discrepancy between sentence segmentation and parse trees #46

Comments

florijanstamenkovic commented Apr 7, 2015

honnibal commented Apr 7, 2015

florijanstamenkovic commented Apr 7, 2015

lock bot commented May 9, 2018