-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sentence tokenization in spaCy? #93
Comments
The sentence is a Span object: http://spacy.io/docs/#api The attributes on the Span object are a slightly rough part of the API at the moment. A Span.text attribute will be in upcoming versions. For now, the best way is to do:
The .string attribute includes trailing whitespace, so this will give you the original text of the sentence, verbatim. There's also a sentence.string attribute. However, this also includes the trailing whitespace --- so you'll need to do sentence.string.strip() to get just the string. If this all seems really weird: the over-arching idea is that you should usually only need the string for output. That's why the padding is there --- to make it easy to construct the original string from the tokens, so that mark-up can be attached to the original string. The thing that I've tried to make really easy in the API is using the token objects. So, if your use case really needs the sentence string, and can't work with the tokens, I'd be interested to understand it a bit better. |
My goal currently is to use spaCy to parse this: raw_text = 'Hello, world. Here are two sentences.' to this resulting output: sentences = [ 'Hello, world.', 'Here are two sentences.' ] |
|
My result was: [u'Hello', u',', u'world', u'.', u'Here', u'are', u'two', u'sentences', u'.'] Is there further work needed to find the resulting sentences? |
Did you install the data?
|
Tried reinstalling data, but the code is still returning the same result. |
I also am running into the same problem. Originally, I had not downloaded the dataset, but downloading it didn't fix the problem. [u'Hello', u',', u'world', u'.', u'Here', u'are', u'two', u'sentences', u'.'] |
Argh. The last snippet I sent you was wrong --- sorry, it's late and I was hasty.
The Doc object has an attribute, sents, which gives you Span objects for the sentences. |
Thank you! That worked. Now it responds with this, as expected. [u'Hello, world.', u'Here are two sentences.'] |
👍 Thank you so much! It works! |
I have the same problem with spacy 2.0.7 def sentence_tokenize(text): sentence_tokenize("Navn den er kjøpt i er Sigrid Trasti") ['Navn den', 'er kjøpt', 'i er', 'Sigrid Trasti'] |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
While trying to do sentence tokenization in spaCy, I ran into the following problem while trying to tokenize sentences:
It was unclear to me how to get the text of the sentence object. I tried using
dir()
to find a method that would allow this and was unsuccessful. Any code that I have found from others trying to do sentence tokenization doesn't seem to function properly.The text was updated successfully, but these errors were encountered: