Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence tokenization in spaCy? #93

Closed
c0nn3r opened this issue Sep 9, 2015 · 12 comments
Closed

Sentence tokenization in spaCy? #93

c0nn3r opened this issue Sep 9, 2015 · 12 comments

Comments

@c0nn3r
Copy link

c0nn3r commented Sep 9, 2015

While trying to do sentence tokenization in spaCy, I ran into the following problem while trying to tokenize sentences:

from __future__ import unicode_literals, print_function
from spacy.en import English
nlp = English()
doc = nlp('Hello, world. Here are two sentences.')
sentence = doc.sents.next()

It was unclear to me how to get the text of the sentence object. I tried using dir() to find a method that would allow this and was unsuccessful. Any code that I have found from others trying to do sentence tokenization doesn't seem to function properly.

@honnibal
Copy link
Member

honnibal commented Sep 9, 2015

The sentence is a Span object: http://spacy.io/docs/#api

The attributes on the Span object are a slightly rough part of the API at the moment. A Span.text attribute will be in upcoming versions.

For now, the best way is to do:

''.join(token.string for token in sentence)

The .string attribute includes trailing whitespace, so this will give you the original text of the sentence, verbatim.

There's also a sentence.string attribute. However, this also includes the trailing whitespace --- so you'll need to do sentence.string.strip() to get just the string.

If this all seems really weird: the over-arching idea is that you should usually only need the string for output. That's why the padding is there --- to make it easy to construct the original string from the tokens, so that mark-up can be attached to the original string.

The thing that I've tried to make really easy in the API is using the token objects. So, if your use case really needs the sentence string, and can't work with the tokens, I'd be interested to understand it a bit better.

@c0nn3r
Copy link
Author

c0nn3r commented Sep 9, 2015

My goal currently is to use spaCy to parse this:

raw_text = 'Hello, world. Here are two sentences.'

to this resulting output:

sentences = [ 'Hello, world.', 'Here are two sentences.' ]

@honnibal
Copy link
Member

honnibal commented Sep 9, 2015

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc]

@c0nn3r
Copy link
Author

c0nn3r commented Sep 9, 2015

My result was:

[u'Hello', u',', u'world', u'.', u'Here', u'are', u'two', u'sentences', u'.']

Is there further work needed to find the resulting sentences?

@honnibal
Copy link
Member

honnibal commented Sep 9, 2015

Did you install the data?

python -m spacy.en.download all

@c0nn3r
Copy link
Author

c0nn3r commented Sep 9, 2015

Tried reinstalling data, but the code is still returning the same result.

@widdakay
Copy link

widdakay commented Sep 9, 2015

I also am running into the same problem. Originally, I had not downloaded the dataset, but downloading it didn't fix the problem.
It still responds with the same result:

[u'Hello', u',', u'world', u'.', u'Here', u'are', u'two', u'sentences', u'.']

@honnibal
Copy link
Member

honnibal commented Sep 9, 2015

Argh.

The last snippet I sent you was wrong --- sorry, it's late and I was hasty.

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

The Doc object has an attribute, sents, which gives you Span objects for the sentences.

@widdakay
Copy link

widdakay commented Sep 9, 2015

Thank you! That worked.

Now it responds with this, as expected.

[u'Hello, world.', u'Here are two sentences.']

@c0nn3r
Copy link
Author

c0nn3r commented Sep 9, 2015

👍 Thank you so much! It works!

@c0nn3r c0nn3r closed this as completed Sep 9, 2015
@lynochka
Copy link

I have the same problem with spacy 2.0.7

def sentence_tokenize(text):
doc = nlp(text)
return [sent.string.strip() for sent in doc.sents]

sentence_tokenize("Navn den er kjøpt i er Sigrid Trasti")

['Navn den', 'er kjøpt', 'i er', 'Sigrid Trasti']

@lock
Copy link

lock bot commented May 7, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants