Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segment sentences on newlines #2299

Closed
ghost opened this issue May 5, 2018 · 2 comments
Closed

segment sentences on newlines #2299

ghost opened this issue May 5, 2018 · 2 comments
Labels
usage General spaCy usage

Comments

@ghost
Copy link

ghost commented May 5, 2018

is there a way to recognize sentences as each newline in document? I did not find this in the docs
for example:

text=" I went to the store and got beck. he said.\n well,  thats should be ok."


doc=nlp(text)
for sent in doc.sents:
    print (sent)

The result is this :
I went to the store and then got beck.
he said.
well, I thats should be ok.

where the desired result is:
I went to the store and then got beck. he said.
well, thats should be ok.

Thanks

@ines ines added the usage General spaCy usage label May 7, 2018
@ines
Copy link
Member

ines commented May 7, 2018

Sure – if that's what you want, you can implement a custom sentence segmentation strategy using the SentenceSegmenter hook. This isn't documented well at the moment, but I'll put this on my list for examples to add to the docs 😊

The SentenceSegmenter is a pipeline component that can be initialised with a strategy argument (the function used to compute the boundaries). For example:

import spacy
from spacy.pipeline import SentenceSegmenter

def split_on_newlines(doc):
   # compute your sentence boundaries here
   # and yield Span objects

nlp = spacy.load('en')
sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
nlp.add_pipe(sbd, first=True)

Setting first=True will add it before all other pipeline components and also before the parser (which normally sets the sentence boundaries). The parser should respect already set boundaries and ideally, you might also see better predictions – however, if you find that the POS tags and dependency labels are less accurate this way, you can also add it last in the pipeline by setting last=True instead.

Edit: https://spacy.io/usage/linguistic-features#sbd-custom

@ines ines closed this as completed May 7, 2018
@lock
Copy link

lock bot commented Jun 10, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

1 participant