Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

append the noun_chunk generator object #3856

Closed
Fourthought opened this issue Jun 17, 2019 · 3 comments
Closed

append the noun_chunk generator object #3856

Fourthought opened this issue Jun 17, 2019 · 3 comments
Labels
feat / doc Feature: Doc, Span and Token objects usage General spaCy usage

Comments

@Fourthought
Copy link

Fourthought commented Jun 17, 2019

Feature description

Is there a way to append the doc.noun_chunk generator object in the way its possible to append the doc.ents tuple?

I can see the doc.ents can be appended with doc.ents += (new_entity,), but I've been unable to recreate with itertools.chain() for doc.noun_chunks.

The new noun_chunks are based on patterns identified by the pattern matcher.

Reproducing example code

doc = nlp('the enemy of America is')

print(*doc.noun_chunks)
>> the enemy
>> America

pattern = [
    [{'POS': 'DET', 'OP' : '?'}, {'LEMMA' : 'enemy'}, {'LEMMA' : 'of'}, {'POS' : 'DET', 'OP' : '?' }, {'POS' : 'PROPN', 'OP' : '+'}]
]

matcher2 = Matcher(nlp.vocab)
matcher2.add('OUTGROUP', None, *pattern)

matches = matcher(doc)

for match_id, start, end in matcher(doc):
    print(test_doc[start:end])
>> the enemy of America
>> enemy of America

in this case, would it be possible to add 'the enemy of America' to doc.noun_chunks?

@ines
Copy link
Member

ines commented Jun 18, 2019

The doc.noun_chunks iterator is read-only, because it's computed by a getter function that uses the tokens' dependencies and part-of-speech tags. See lang/en/syntax_iterators.py for an example of this.

However, you could use a custom extension attribute to create your own custom noun chunks property on the Doc, and then make it return

from spacy.tokens import Doc

def get_custom_noun_chunks(doc):
    default_noun_chunks = list(doc.noun_chunks)
    # Add your logic with the matcher etc. here
    custom_noun_chunks = get_your_custom_chunks(doc)
    return default_noun_chunks + custom_noun_chunks

Doc.set_extension("custom_noun_chunks", getter=get_custom_noun_chunks)

You can then access doc._.custom_noun_chunks and it should return a list of the combined spans.

@ines ines added feat / doc Feature: Doc, Span and Token objects usage General spaCy usage labels Jun 18, 2019
@ines ines closed this as completed Jun 18, 2019
@Fourthought
Copy link
Author

Great, thank you Ines

@lock
Copy link

lock bot commented Jul 18, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jul 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / doc Feature: Doc, Span and Token objects usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

2 participants