Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span Finder Suggester #10

Merged
merged 66 commits into from Jun 7, 2022

Conversation

thomashacker
Copy link
Contributor

@thomashacker thomashacker commented Mar 28, 2022

This PR adds a new experimental component for learning span boundaries and a custom suggester function for spancat.
It further adds a spaCy project showcasing how to use the SpanFinder component on 3 different datasets (Healthsea, ToxicSpans, Genia) with 2 configurations (tok2vec & transformer). The project also provides the possibility to train spancat with ngram and compare it to SpanFinder with a custom evaluation script that calculates the performance and overall coverage of the suggester functions.

Features

  • spaCy project for comparing SpanFinder vs Ngram
  • SpanFinder model
  • SpanFinder component
  • SpanFinder suggester
  • Unit tests for component, model and suggester

@thomashacker thomashacker added the enhancement New feature or request label Mar 28, 2022
@adrianeboyd
Copy link
Contributor

Very nice! This is going to be annoying, but I think that we need to avoid the abbreviation sbd because it's used too much for "sentence boundary detection".

@thomashacker
Copy link
Contributor Author

@adrianeboyd how about instead of "sbd" -> "span_bd"? 😄

Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great Edi!

From a practical point, I'm wondering whether we need all the different config files in all the different subdirectories? It's great to support various different datasets, but what if we could make the decision very early on:

  • You process one of the three available datasets, with one of the three specific commands (cf below)
  • The processed .spacy output is stored in the same subdir of the project, regardless of what dataset had been processed
  • Now, all consequent steps don't have to bother about where the data came from originally

Is that feasible? Or are there things you're doing different in the scripts, depending on what dataset it was (besides preprocessing)

projects/span_boundary_detection/project.yml Outdated Show resolved Hide resolved
projects/span_boundary_detection/README.md Outdated Show resolved Hide resolved
@thomashacker
Copy link
Contributor Author

Alright, I've renamed every reference to SpanFinder 😄 I've rebuilt the set_annotation() logic so that the SpanFinder component directly produces spans and saves them to doc.spans["span_finder_candidates"] instead of having an additional token attribute. The suggester uses doc.spans["span_finder_candidates"] to produce the Ragged for the spancat.

I've also adjusted the spaCy project, added more commands, and reduced the number of configs, so that now there is only one config for all datasets.

@thomashacker thomashacker changed the title Span Boundary Detection Span Finder Suggester Apr 13, 2022
@adrianeboyd
Copy link
Contributor

As the next minor adjustment, I think that span_finder_candidates should be a configurable setting.

@svlandeg svlandeg self-requested a review April 20, 2022 12:31
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
@adrianeboyd
Copy link
Contributor

Can you rename:

  • candidates_key -> predicted_key
  • reference_key -> training_key

In the project, I'd also like to have paths.spans_key as vars.spans_key or something else instead, since paths seems kind of confusing.

Copy link
Contributor

@adrianeboyd adrianeboyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I found everything, but can you do another pass through the docs, project, and tests?

README.md Outdated Show resolved Hide resolved
projects/span_finder/README.md Outdated Show resolved Hide resolved
projects/span_finder/configs/span_finder/config_trf.cfg Outdated Show resolved Hide resolved
spacy_experimental/span_finder/tests/test_span_finder.py Outdated Show resolved Hide resolved
spacy_experimental/span_finder/tests/test_span_finder.py Outdated Show resolved Hide resolved
spacy_experimental/span_finder/tests/test_span_finder.py Outdated Show resolved Hide resolved
@adrianeboyd
Copy link
Contributor

The remaining issues:

  • keep candidates_key as the key name for the spans suggester
  • one more pass through the docs and project for the other key renaming
  • a full test of the project with various option combinations after the renaming

@thomashacker
Copy link
Contributor Author

Updated the key in the suggester and looked over the docs and the project. I also checked the spaCy project and ran the workflows with different configurations.

@svlandeg svlandeg merged commit 10b2178 into explosion:master Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
5 participants