Add beam_parser and beam_ner components for v3 #6369

honnibal · 2020-11-10T05:20:36Z

~~Still testing.~~

Okay this should be mergeable now. The config usage is like this:

[components.parser]
factory = "beam_parser"
beam_density = 0.1
beam_update_prob = 0.5
beam_width = 8
learn_tokens = false
min_action_freq = 1
moves = null
update_with_oracle_cut_size = 100

[components.ner]
factory = "beam_ner"
beam_density = 0.1
beam_update_prob = 0.5
beam_width = 8
learn_tokens = false
min_action_freq = 1
moves = null
update_with_oracle_cut_size = 100

The affordances for getting probabilities out of the beam aren't really there at the moment, I want to build them into a second PR.

The PR has a lot of incidental changes to the parser, and requires models to be retrained. The incidental changes are introduced due to problems with the transition system and state class that made the beam parsing slower and ineffective. Specifically, the parser relied on this mechanism where we would "fast forward" through states that had only one valid action. This fast-forwarding isn't correct for the beam objective, since states still need to be scored under the global model, even if there's only one next action.

I've also cleaned up the definition of the Break transition to be simpler and more consistent. It now inserts a sentence break beginning at B[1], i.e. the first word of the buffer. Previously the break was inserted at the leftmost edge of B[0]. The new definition lets the parser see both the last word of the sentence and the first word of the next sentence in the state. It also reduces the interaction with the other actions, and makes it easier to respect preset sentence boundaries.

The StateC data structure has also been revised considerably, to reduce the expense of copy operations that made the beam slow on long inputs. We now don't copy the TokenC* array, and the parse is now quicker to copy, especially for states near the beginning.

…into feature/v3-beam

bratao · 2020-12-09T14:35:26Z

Just a +1. I trained my own NER in v3 and using this PR with a small beam size, and I got some expressive gains over the greedy NER version👏

honnibal · 2020-12-11T01:06:24Z

@bratao Oh that's really nice to hear, thanks! Can you share:

Number of tokens in train set?
Number of sentences in train set?
Number of entity types
Accuracy with and without the beam?

bratao · 2020-12-11T01:43:55Z

Hello @honnibal ,
I´m doing NER on very long documents. This dataset have 5 entities. There is no token without an entity ( no O).

It is composed of 28 Documents. The biggest have 72k tokens. The average amount of tokens per document was 12k. I used batches of up to 64k tokens.
Here is the stats I collected. But I´m still working on my pipeline.
Adam(0.001), MultiHashEmbed with width=128

Regular NER:
Max Memory: 2GB
Run time: 61 minutes
F1: 0.9805

Beam NER ( Beam-size of 3):
Max Memory: 17GB
Run time: 696 minutes
F1: 0.9901

For comparison
crfsuite
Run time: 25 minutes
F1: 0.979591836734694

honnibal · 2020-12-11T03:14:48Z

Thanks for the details! It's a shame about the speed. If you want to make it a bit faster (possibly at the expense of accuracy), you could try use_upper = false.

honnibal added 7 commits November 10, 2020 11:03

Get basic beam tests working

59629db

Get basic beam tests working

f0043b6

Compile _beam_utils

b250be3

Remove prints

0ca2cb7

Test beam density

5a2e3d4

Beam parser seems to train

773fa17

Draft beam NER

cf187b8

svlandeg added enhancement Feature requests and improvements feat / ner Feature: Named Entity Recognizer v3.0 Related to v3.0 labels Nov 10, 2020

honnibal added 20 commits November 11, 2020 12:02

Upd beam

59841c5

Merge branch 'develop' into feature/v3-beam

57b005a

Add hypothesis as dev dependency

37f5ff9

Implement missing is-gold-parse method

8ccead0

Implement early update

b5d1b58

Fix state hashing

bc747e4

Fix test

5db7662

Fix test

45797ea

Default to non-beam in parser constructor

db2a20e

Improve oracle for beam

80b4777

Start refactoring beam

b5481f1

Update test

469aed4

Refactor beam

a9b0735

Update nn

660d597

Refactor beam and weight by cost

f39c8a2

Update ner beam settings

a1235c4

Update test

c885623

Add __init__.pxd

f19dbc2

Upd test

e58a77d

Fix test

7337f55

honnibal added 16 commits November 25, 2020 11:24

Improve state class

96f3155

Refactor parser oracles

0e26f80

Fix arc eager oracle

3eef433

Fix arc eager oracle

0abe335

Use a vector to implement the stack

03a8585

Refactor state data structure

ea87aeb

Merge branch 'develop' into feature/v3-beam

daf9019

Fix alignment of sent start

588baba

Add get_aligned_sent_starts method

4ee515d

Add test for ae oracle when bad sentence starts

c6790b9

Merge branch 'feature/v3-beam' of https://github.com/explosion/spaCy …

e4c3c1f

…into feature/v3-beam

Fix sentence segment handling

8095c20

Avoid Reduce that inserts illegal sentence

a8faff0

Update preset SBD test

34c2aa4

Fix test

2a82eea

Remove prints

1a45a28

honnibal added 4 commits December 11, 2020 09:20

Fix sent starts in Example

df5ca9a

Improve python API of StateClass

67c89bc

Tweak comments and debug output of arc eager

cc36e0b

Merge branch 'develop' into feature/v3-beam

18615dd

honnibal added 3 commits December 11, 2020 16:41

Upd test

1c2cf41

Fix state test

e7fab2c

Fix state test

2075b2d

honnibal merged commit 8656a08 into develop Dec 13, 2020

svlandeg deleted the feature/v3-beam branch December 15, 2020 16:35

svlandeg mentioned this pull request Dec 7, 2022

Add in errors used in the beam code that were removed at some point #11935

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add beam_parser and beam_ner components for v3 #6369

Add beam_parser and beam_ner components for v3 #6369

honnibal commented Nov 10, 2020 •

edited

bratao commented Dec 9, 2020 •

edited

honnibal commented Dec 11, 2020

bratao commented Dec 11, 2020

honnibal commented Dec 11, 2020

Add beam_parser and beam_ner components for v3 #6369

Add beam_parser and beam_ner components for v3 #6369

Conversation

honnibal commented Nov 10, 2020 • edited

bratao commented Dec 9, 2020 • edited

honnibal commented Dec 11, 2020

bratao commented Dec 11, 2020

honnibal commented Dec 11, 2020

honnibal commented Nov 10, 2020 •

edited

bratao commented Dec 9, 2020 •

edited