Span Finder Suggester #10

thomashacker · 2022-03-28T12:22:42Z

This PR adds a new experimental component for learning span boundaries and a custom suggester function for spancat.
It further adds a spaCy project showcasing how to use the SpanFinder component on 3 different datasets (Healthsea, ToxicSpans, Genia) with 2 configurations (tok2vec & transformer). The project also provides the possibility to train spancat with ngram and compare it to SpanFinder with a custom evaluation script that calculates the performance and overall coverage of the suggester functions.

Features

spaCy project for comparing SpanFinder vs Ngram
SpanFinder model
SpanFinder component
SpanFinder suggester
Unit tests for component, model and suggester

spacy_experimental/span_boundary_detection/sbd_component.py

adrianeboyd · 2022-03-28T12:44:41Z

Very nice! This is going to be annoying, but I think that we need to avoid the abbreviation sbd because it's used too much for "sentence boundary detection".

spacy_experimental/span_boundary_detection/sbd_component.py

spacy_experimental/span_boundary_detection/sbd_model.py

spacy_experimental/span_boundary_detection/sbd_suggester.py

thomashacker · 2022-03-30T12:09:20Z

@adrianeboyd how about instead of "sbd" -> "span_bd"? 😄

svlandeg

This is great Edi!

From a practical point, I'm wondering whether we need all the different config files in all the different subdirectories? It's great to support various different datasets, but what if we could make the decision very early on:

You process one of the three available datasets, with one of the three specific commands (cf below)
The processed .spacy output is stored in the same subdir of the project, regardless of what dataset had been processed
Now, all consequent steps don't have to bother about where the data came from originally

Is that feasible? Or are there things you're doing different in the scripts, depending on what dataset it was (besides preprocessing)

projects/span_boundary_detection/project.yml

projects/span_boundary_detection/README.md

thomashacker · 2022-04-12T12:37:21Z

Alright, I've renamed every reference to SpanFinder 😄 I've rebuilt the set_annotation() logic so that the SpanFinder component directly produces spans and saves them to doc.spans["span_finder_candidates"] instead of having an additional token attribute. The suggester uses doc.spans["span_finder_candidates"] to produce the Ragged for the spancat.

I've also adjusted the spaCy project, added more commands, and reduced the number of configs, so that now there is only one config for all datasets.

adrianeboyd · 2022-04-14T09:44:25Z

As the next minor adjustment, I think that span_finder_candidates should be a configurable setting.

…omashacker/spacy-experimental into feature/spanboundarydetection

spacy_experimental/span_finder/span_finder_component.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

spacy_experimental/span_finder/span_finder_component.py

adrianeboyd · 2022-06-01T11:50:34Z

Can you rename:

candidates_key -> predicted_key
reference_key -> training_key

In the project, I'd also like to have paths.spans_key as vars.spans_key or something else instead, since paths seems kind of confusing.

adrianeboyd

I'm not sure I found everything, but can you do another pass through the docs, project, and tests?

README.md

projects/span_finder/README.md

projects/span_finder/configs/span_finder/config_tok2vec.cfg

projects/span_finder/configs/span_finder/config_trf.cfg

spacy_experimental/span_finder/tests/test_span_finder.py

adrianeboyd · 2022-06-02T12:10:21Z

The remaining issues:

keep candidates_key as the key name for the spans suggester
one more pass through the docs and project for the other key renaming
a full test of the project with various option combinations after the renaming

thomashacker · 2022-06-03T09:02:25Z

Updated the key in the suggester and looked over the docs and the project. I also checked the spaCy project and ran the workflows with different configurations.

projects/span_finder/README.md

projects/span_finder/scripts/preprocessing/preprocess_healthsea.py

thomashacker and others added 15 commits February 10, 2022 20:39

Init

819ce31

Add Healthsea experiments

448cc19

Add suggester function to registry

268b75b

Change transformer config

c35a125

Update sbd model architecture

00fa069

Add spancat configs

d6c6940

Add evaluation script

1b603cd

Add toxicspans dataset

84c5ded

Add toxic configs

c0b5cc4

Add unit tests

a29eb04

Add genia dataset

89f204b

Add configs and normalize preprocessing

e56939a

Adjust configs

9b36200

Adjust readme

b6c6974

Merge branch 'master' into feature/spanboundarydetection

3e305e5

thomashacker added the enhancement New feature or request label Mar 28, 2022

adrianeboyd reviewed Mar 28, 2022

View reviewed changes

spacy_experimental/span_boundary_detection/sbd_component.py Outdated Show resolved Hide resolved

kadarakos reviewed Mar 28, 2022

View reviewed changes

Adjust naming and code

e8bd729

svlandeg reviewed Mar 31, 2022

View reviewed changes

projects/span_boundary_detection/project.yml Outdated Show resolved Hide resolved

projects/span_boundary_detection/README.md Outdated Show resolved Hide resolved

Rename to spanfinder

773aa05

update readme

9074b0e

thomashacker changed the title ~~Span Boundary Detection~~ Span Finder Suggester Apr 13, 2022

Add shared embedding config

625dc6d

svlandeg self-requested a review April 20, 2022 12:31

Add candidates_key to suggester

d0bf6af

thomashacker and others added 9 commits May 30, 2022 22:48

Start alignment method

f4205ee

Add alignment validation for loss calculation

18dbe4d

Increase spaCy version

ff6c741

Update spacy requirements in pyproject.toml

a863995

More alignment

1a92eb0

Merge branch 'feature/spanboundarydetection' of https://github.com/th…

3a4c8e2

…omashacker/spacy-experimental into feature/spanboundarydetection

Update and add xfailed tests for reference alignment

e545231

Add TODO in component

35e9d60

update alignment function to use characters

74f3106

adrianeboyd reviewed Jun 1, 2022

View reviewed changes

spacy_experimental/span_finder/span_finder_component.py Outdated Show resolved Hide resolved

Update spacy_experimental/span_finder/span_finder_component.py

ba56150

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

adrianeboyd reviewed Jun 1, 2022

View reviewed changes

spacy_experimental/span_finder/span_finder_component.py Outdated Show resolved Hide resolved

Update spacy_experimental/span_finder/span_finder_component.py

49c188d

adrianeboyd reviewed Jun 1, 2022

View reviewed changes

spacy_experimental/span_finder/span_finder_component.py Show resolved Hide resolved

adrianeboyd added 2 commits June 1, 2022 13:10

Update spacy_experimental/span_finder/span_finder_component.py

6b509d3

Temporarily fix test

dc3cedd

Rename keys

97c1a40

adrianeboyd reviewed Jun 1, 2022

View reviewed changes

thomashacker and others added 2 commits June 1, 2022 15:51

Rename keys in readme

3fcd7da

Sync project README

74aa9cd

Update keys and docs

7f8092d

adrianeboyd reviewed Jun 3, 2022

View reviewed changes

projects/span_finder/README.md Outdated Show resolved Hide resolved

adrianeboyd reviewed Jun 3, 2022

View reviewed changes

projects/span_finder/scripts/preprocessing/preprocess_healthsea.py Outdated Show resolved Hide resolved

thomashacker added 3 commits June 3, 2022 12:36

Fix tests

390af06

Readme adjustment

36654a3

Fix split in healthsea preprocess

4211396

svlandeg merged commit 10b2178 into explosion:master Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Span Finder Suggester #10

Span Finder Suggester #10

thomashacker commented Mar 28, 2022 •

edited

adrianeboyd commented Mar 28, 2022

thomashacker commented Mar 30, 2022

svlandeg left a comment

thomashacker commented Apr 12, 2022

adrianeboyd commented Apr 14, 2022

adrianeboyd commented Jun 1, 2022

adrianeboyd left a comment

adrianeboyd commented Jun 2, 2022

thomashacker commented Jun 3, 2022

Span Finder Suggester #10

Span Finder Suggester #10

Conversation

thomashacker commented Mar 28, 2022 • edited

Features

adrianeboyd commented Mar 28, 2022

thomashacker commented Mar 30, 2022

svlandeg left a comment

Choose a reason for hiding this comment

thomashacker commented Apr 12, 2022

adrianeboyd commented Apr 14, 2022

adrianeboyd commented Jun 1, 2022

adrianeboyd left a comment

Choose a reason for hiding this comment

adrianeboyd commented Jun 2, 2022

thomashacker commented Jun 3, 2022

thomashacker commented Mar 28, 2022 •

edited