New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Is there any way to use "doc.spans" in 01_parse.py? #142

Open

nonstoprunning opened this issue Aug 4, 2021 · 0 comments

nonstoprunning commented Aug 4, 2021

Hi,
I am trying to built a sense2vec model with new data. I have made few changes in 01_parse.py.
First, I have removed the default ner pipe coming with "en_core_web_lg".
Then I have added a new Language.component where I identify Spans associated to a new entities (new labels) in a doc.
Sometimes, I would like to assign a Span[x, y] to more than one entity but I can not.
My question...
I have read the new changes in spaCy v3.1. Is there a way to use "doc.spans" (or something similar) in 01_parse where SpaCy's internal algorithms take Spans overlap into account?

@Language.component("name_comp")
def my_component(doc):
matches = matcher(doc)
seen_tokens = set()
new_entities = []
entities = doc.ents
for match_id, start, end in matches:
# check for end - 1 here because boundaries are inclusive
if start not in seen_tokens and end - 1 not in seen_tokens:
new_entities.append(Span(doc, start, end, label=match_id))
entities = [
e for e in entities if not (e.start < end and e.end > start)
]
seen_tokens.update(range(start, end))
doc.ents = tuple(entities) + tuple(new_entities)
return doc

Thanks in advance,
Paula

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment