Sentence tokenization in spaCy? #93

c0nn3r · 2015-09-09T00:48:31Z

While trying to do sentence tokenization in spaCy, I ran into the following problem while trying to tokenize sentences:

from __future__ import unicode_literals, print_function
from spacy.en import English
nlp = English()
doc = nlp('Hello, world. Here are two sentences.')
sentence = doc.sents.next()

It was unclear to me how to get the text of the sentence object. I tried using dir() to find a method that would allow this and was unsuccessful. Any code that I have found from others trying to do sentence tokenization doesn't seem to function properly.

The text was updated successfully, but these errors were encountered:

honnibal · 2015-09-09T01:22:00Z

The sentence is a Span object: http://spacy.io/docs/#api

The attributes on the Span object are a slightly rough part of the API at the moment. A Span.text attribute will be in upcoming versions.

For now, the best way is to do:

''.join(token.string for token in sentence)

The .string attribute includes trailing whitespace, so this will give you the original text of the sentence, verbatim.

There's also a sentence.string attribute. However, this also includes the trailing whitespace --- so you'll need to do sentence.string.strip() to get just the string.

If this all seems really weird: the over-arching idea is that you should usually only need the string for output. That's why the padding is there --- to make it easy to construct the original string from the tokens, so that mark-up can be attached to the original string.

The thing that I've tried to make really easy in the API is using the token objects. So, if your use case really needs the sentence string, and can't work with the tokens, I'd be interested to understand it a bit better.

c0nn3r · 2015-09-09T01:51:34Z

My goal currently is to use spaCy to parse this:

raw_text = 'Hello, world. Here are two sentences.'

to this resulting output:

sentences = [ 'Hello, world.', 'Here are two sentences.' ]

honnibal · 2015-09-09T02:32:41Z

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc]

c0nn3r · 2015-09-09T02:40:52Z

My result was:

[u'Hello', u',', u'world', u'.', u'Here', u'are', u'two', u'sentences', u'.']

Is there further work needed to find the resulting sentences?

honnibal · 2015-09-09T02:53:56Z

Did you install the data?

python -m spacy.en.download all

c0nn3r · 2015-09-09T03:39:10Z

Tried reinstalling data, but the code is still returning the same result.

widdakay · 2015-09-09T03:40:24Z

I also am running into the same problem. Originally, I had not downloaded the dataset, but downloading it didn't fix the problem.
It still responds with the same result:

[u'Hello', u',', u'world', u'.', u'Here', u'are', u'two', u'sentences', u'.']

honnibal · 2015-09-09T04:05:16Z

Argh.

The last snippet I sent you was wrong --- sorry, it's late and I was hasty.

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

The Doc object has an attribute, sents, which gives you Span objects for the sentences.

widdakay · 2015-09-09T04:07:37Z

Thank you! That worked.

Now it responds with this, as expected.

[u'Hello, world.', u'Here are two sentences.']

c0nn3r · 2015-09-09T04:08:07Z

👍 Thank you so much! It works!

lynochka · 2018-03-15T09:02:22Z

I have the same problem with spacy 2.0.7

def sentence_tokenize(text):
doc = nlp(text)
return [sent.string.strip() for sent in doc.sents]

sentence_tokenize("Navn den er kjøpt i er Sigrid Trasti")

['Navn den', 'er kjøpt', 'i er', 'Sigrid Trasti']

lock · 2018-05-07T21:52:53Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

c0nn3r closed this as completed Sep 9, 2015

lynochka mentioned this issue Mar 16, 2018

Sentence segmentation splits into clauses on Spacy 2.0.5 #1756

Closed

lock bot locked as resolved and limited conversation to collaborators May 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence tokenization in spaCy? #93

Sentence tokenization in spaCy? #93

c0nn3r commented Sep 9, 2015

honnibal commented Sep 9, 2015

c0nn3r commented Sep 9, 2015

honnibal commented Sep 9, 2015

c0nn3r commented Sep 9, 2015

honnibal commented Sep 9, 2015

c0nn3r commented Sep 9, 2015

widdakay commented Sep 9, 2015

honnibal commented Sep 9, 2015

widdakay commented Sep 9, 2015

c0nn3r commented Sep 9, 2015

lynochka commented Mar 15, 2018

lock bot commented May 7, 2018

Sentence tokenization in spaCy? #93

Sentence tokenization in spaCy? #93

Comments

c0nn3r commented Sep 9, 2015

honnibal commented Sep 9, 2015

c0nn3r commented Sep 9, 2015

honnibal commented Sep 9, 2015

c0nn3r commented Sep 9, 2015

honnibal commented Sep 9, 2015

c0nn3r commented Sep 9, 2015

widdakay commented Sep 9, 2015

honnibal commented Sep 9, 2015

widdakay commented Sep 9, 2015

c0nn3r commented Sep 9, 2015

lynochka commented Mar 15, 2018

lock bot commented May 7, 2018