Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add beam_parser and beam_ner components for v3 #6369

Merged
merged 66 commits into from Dec 13, 2020
Merged

Conversation

honnibal
Copy link
Member

@honnibal honnibal commented Nov 10, 2020

Still testing.

Okay this should be mergeable now. The config usage is like this:

[components.parser]
factory = "beam_parser"
beam_density = 0.1
beam_update_prob = 0.5
beam_width = 8
learn_tokens = false
min_action_freq = 1
moves = null
update_with_oracle_cut_size = 100

[components.ner]
factory = "beam_ner"
beam_density = 0.1
beam_update_prob = 0.5
beam_width = 8
learn_tokens = false
min_action_freq = 1
moves = null
update_with_oracle_cut_size = 100

The affordances for getting probabilities out of the beam aren't really there at the moment, I want to build them into a second PR.

The PR has a lot of incidental changes to the parser, and requires models to be retrained. The incidental changes are introduced due to problems with the transition system and state class that made the beam parsing slower and ineffective. Specifically, the parser relied on this mechanism where we would "fast forward" through states that had only one valid action. This fast-forwarding isn't correct for the beam objective, since states still need to be scored under the global model, even if there's only one next action.

I've also cleaned up the definition of the Break transition to be simpler and more consistent. It now inserts a sentence break beginning at B[1], i.e. the first word of the buffer. Previously the break was inserted at the leftmost edge of B[0]. The new definition lets the parser see both the last word of the sentence and the first word of the next sentence in the state. It also reduces the interaction with the other actions, and makes it easier to respect preset sentence boundaries.

The StateC data structure has also been revised considerably, to reduce the expense of copy operations that made the beam slow on long inputs. We now don't copy the TokenC* array, and the parse is now quicker to copy, especially for states near the beginning.

@svlandeg svlandeg added enhancement Feature requests and improvements feat / ner Feature: Named Entity Recognizer v3.0 Related to v3.0 labels Nov 10, 2020
@bratao
Copy link
Contributor

bratao commented Dec 9, 2020

Just a +1. I trained my own NER in v3 and using this PR with a small beam size, and I got some expressive gains over the greedy NER version👏

@honnibal
Copy link
Member Author

@bratao Oh that's really nice to hear, thanks! Can you share:

  • Number of tokens in train set?
  • Number of sentences in train set?
  • Number of entity types
  • Accuracy with and without the beam?

@bratao
Copy link
Contributor

bratao commented Dec 11, 2020

Hello @honnibal ,
I´m doing NER on very long documents. This dataset have 5 entities. There is no token without an entity ( no O).

It is composed of 28 Documents. The biggest have 72k tokens. The average amount of tokens per document was 12k. I used batches of up to 64k tokens.
Here is the stats I collected. But I´m still working on my pipeline.
Adam(0.001), MultiHashEmbed with width=128

Regular NER:
Max Memory: 2GB
Run time: 61 minutes
F1: 0.9805

Beam NER ( Beam-size of 3):
Max Memory: 17GB
Run time: 696 minutes
F1: 0.9901

For comparison
crfsuite
Run time: 25 minutes
F1: 0.979591836734694

@honnibal
Copy link
Member Author

Thanks for the details! It's a shame about the speed. If you want to make it a bit faster (possibly at the expense of accuracy), you could try use_upper = false.

@honnibal honnibal merged commit 8656a08 into develop Dec 13, 2020
@svlandeg svlandeg deleted the feature/v3-beam branch December 15, 2020 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / ner Feature: Named Entity Recognizer v3.0 Related to v3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants