Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does chapter 2 section 7 solution use iteration instead of matching? #59

Closed
ab-10 opened this issue May 30, 2020 · 2 comments
Closed
Labels
content Issues and PRs related to course content

Comments

@ab-10
Copy link

ab-10 commented May 30, 2020

Could someone please explain why does the section on Data structures best practices use iteration instead of matching for finding proper noun before a verb? Is it simply to make the pairing between naive solution and the recommended one more direct or is there additional rationale on when to use iteration instead of the excellent matching functionality provided by SpaCy?

Here's the relevant solution code snippet provided:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

# Iterate over the tokens
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

And here's an example of how I would construct a matcher to extract proper noun followed by a verb:

doc = nlp("Berlin is a nice city")
matcher = Matcher(nlp.vocab)
matcher.add("Proper nouns", None, [{"POS": "PROPN"}, {"POS":"VERB"}])

matches = matcher(doc)
for match in matches:
    print("Found proper noun before a verb:", doc[match[1]])
@ines ines added the content Issues and PRs related to course content label Jun 1, 2020
@ines
Copy link
Member

ines commented Jun 1, 2020

Hi! This is a good question and definitely a valid point. There's no particular reason and the Matcher would definitely be the more elegant and scalable solution in this case. There are other cases where you might still want to iterate over the tokens – for instance, if you're working with the dependency tree or if you're extracting additional information after matching (e.g. check the previous sentence for something).

This particular exercise is based on a real example I once saw on Stack Overflow and I wanted to get the point of using the Doc as the "single source of truth" across. And since rewrite exercises are often a bit more challenging than "fill in the gaps", I didn't want to introduce any new concepts or new token attributes here, and keep the result a bit closer to the original code.

tl;dr: No particular reason, mostly to make the rewrite exercise more straightforward and not ask for too much at once. Matching is perfectly reasonable, too 🙂

@ab-10
Copy link
Author

ab-10 commented Jun 1, 2020

Hi @ines, thank you for the reply (and the exercises too), that clears things up and adds more context to the tradeoffs between the two methods.

@ab-10 ab-10 closed this as completed Jun 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Issues and PRs related to course content
Projects
None yet
Development

No branches or pull requests

2 participants