Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Bert srl #2854

Merged
merged 15 commits into from
Jun 7, 2019
Merged

Bert srl #2854

merged 15 commits into from
Jun 7, 2019

Conversation

DeNeutoy
Copy link
Contributor

A new model which uses BERT for SRL. I want this to replace our current SRL model, because:

  • It has better performance ~ 86.4 F1 vs 84.9 F1 from the old model
  • It's very simple.
  • It trains 15-20x faster.
  • People will actually be able to reproduce it (our current srl model requires pytorch 0.4.1 to train + custom cuda stuff which isn't in the repo anymore and some massive hacks to get the model archive to actually work)

This might be a little controversial for a couple of reasons:

  1. It doesn't use the allennlp textfield embedder for BERT (in the same way that Joel's BertClassifier doesn't, see Bert for classification #2787). This is necessary however, because I am modifying the input to BERT directly (switching the sentence indexing to use it for the predicate indicator).

  2. It has a moderate amount of duplicated code from the original SRL model. I'm not super motivated to have this model inherit from the original one, because some things make that difficult and we've found previously that model composition works better than inheritance.

This still needs a little bit of a clean up, but if anyone has really strong opinions about this, comments are appreciated!

@matt-gardner
Copy link
Contributor

On (1), the abstractions exist to help you, not the other way around. If they aren't useful for what you're doing, don't use them, that's fine. The fact that they're not useful here suggests that maybe we should update our abstractions somehow, but that's not an issue for this PR.

@joelgrus
Copy link
Contributor

I don't mind not using the text field embedder (although maybe we should think about if a different abstraction is needed here since we're having to do this multiple times).

I'm not entirely thrilled with the idea of a "srl-bert" dataset reader, in theory you'd just want to change the token_indexer in the existing srl dataset reader, but I haven't looked too deeply into why that doesn't work here.

@omerarshad
Copy link

can i get this pretrained model? and any instructions how to use it?

@DeNeutoy
Copy link
Contributor Author

@omerarshad sorry I haven't actually trained it yet, should be done within the next week or so!

@omerarshad
Copy link

so will it be replaced with current SRL?

@DeNeutoy
Copy link
Contributor Author

Yes?

@DeNeutoy
Copy link
Contributor Author

There might be some clean up stuff to do from this PR but i'll wait until you think the "big picture" looks decent.

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big picture looks ok to me; I'll leave the rest to @joelgrus, if you both are ok with that.

lazy: bool = False) -> None:

if token_indexers:
raise ConfigurationError("The SrlBertReader has a fixed input "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have this as a constructor parameter at all, then? Just pass the right thing to the superclass yourself.

Also, docstring needs updating.

new_verbs = [1 if "-V" in tag else 0 for tag in new_tags]
# In order to override the indexing mechanism, we need to set the `text_id`
# attribute directly. This causes the indexing to use this id.
token_field = TextField([Token(t, text_id=self.bert_tokenizer.vocab[t]) for t in wordpieces],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have a bert token indexer? You're not using that because it groups wordpieces into tokens? What if we have a bert wordpiece indexer, that just uses this vocabulary and doesn't do any vocab counting, etc.? Then you wouldn't have to have this hack, and you could just manually set self._token_indexers = BertWordpieceIndexer in the constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sorry, I think the right thing to do here, instead of having a different indexer, is to have a BertWordpieceTokenizer that does the tokenization as you have it and sets the text id for all wordpieces. Then you just use the standard single id indexer, and everything is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work because I need to get back the offsets of the wordpieces which doesn't fit the Tokenizer.tokenize api. Obviously I could re-compute these, or override the type. What do you suggest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spacy tokenizer returns offsets as a field on each token. Can't you do the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do that, but 1) It doesn't achieve the objective of being able to swap out tokenizers, because that offset must be set for what I am doing, so the only reason to do that would be that we are imagining this is quite general, and 2) it would have different semantics from the field offset in Spacy, which would be confusing. I can do it if you want, but it seems a bit "jumping through hoops" for something we don't know if people are going to use extensively (and if they did want to use it extensively we should really be considering a fuller re-write which places less importance on TokenIndexers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a lot more context here than I do; if what I suggested doesn't make sense, do what makes sense. I thought the offsets you needed were the same as what spacy gives (and squad models also require those offsets); I must have missed something.

return wordpieces, offsets

@staticmethod
def _convert_tags_to_wordpiece_tags(tags: List[str], offsets: List[int]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like this should be in a util somewhere.

Copy link
Contributor

@joelgrus joelgrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like the idea of a "srl_bert_reader". If I'm understanding correctly, the reason for that is because the verb indicators and labels need to be "expanded" to correspond to the wordpieces, but there's no way for the SequenceLabelField to "know" about this expansion.

I am trying to think of a clean solution for this.

model : ``Union[str, BertModel]``, required.
A string describing the BERT model to load or an already constructed BertModel.
bert_dim : int, required.
The dimension of the contextual representations from BERT.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you just get this as bert_model.config.hidden_size?

@matt-gardner
Copy link
Contributor

@joelgrus, to solve that problem, can't you just have an optional use_wordpieces flag to the SRL reader, and expand the indicators and labels accordingly? You could probably even detect this automatically if you go with the tokenizer approach I suggested - if you have more tokens than you have labels, you need to do some expansion, and you call the existing method.

@matt-gardner
Copy link
Contributor

Oh, except typically the data is already tokenized... So, yeah, you'd want a flag that triggers all three - wordpiece tokenization, indicator expansion, and label expansion, for each token.

@joelgrus
Copy link
Contributor

joelgrus commented Jun 5, 2019

oh man, you're going to hate this idea:

  1. in Instance.index_fields, sort the fields "topologically" (in the sense that sequence label fields come after their corresponding sequence fields, possibly other ways that make sense too)
  2. in WordpieceIndexer, generate an additional index piece-offsets (i.e. the index of the tokens that each wordpiece came from)
  3. modify SequenceLabelField.index() so that if its parent field (which has necessarily already been indexed) has a "piece-offsets" index (and possibly if some other flag is set) then the labels get expanded to compensate
  4. which means that the SequenceLabelField has to be aware of whether it is IOB / BIOUL / nothing and handle the "expansion" accordingly

@DeNeutoy
Copy link
Contributor Author

DeNeutoy commented Jun 5, 2019

Correct I hate that idea

@joelgrus
Copy link
Contributor

joelgrus commented Jun 5, 2019

I mean, there's a sense in which it solves your problem somewhat cleanly (and I think should work with any dataset reader / wordpiece token-indexing scheme).

@DeNeutoy
Copy link
Contributor Author

DeNeutoy commented Jun 5, 2019

That proposal is not a good idea because:

  • It expands the scope of index to not just index and pad tokens
  • SequenceLabelFields are not exclusively BIO so we'd have to enumerate 1000 functions which convert between various formats (case in point: the verb_indicator field in the srl model, which is a binary flag)
  • It's so opaque no-one will use it because they won't be able to understand how it works.

@DeNeutoy
Copy link
Contributor Author

DeNeutoy commented Jun 6, 2019

@joelgrus, ready for another pass/discussing your idea whenever you're ready

Copy link
Contributor

@joelgrus joelgrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good, I have a bunch of small comments, mostly around naming and documentation


def tearDown(self):
self.monkeypatch.undo()
self.monkeypatch.undo()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are there 2 undos here?

actually, the monkeypatch documentation claims that manually calling undo is not usually necessary, why do we need to do that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was because 2 things are monkeypatched and they keep a stack of the patches. I also thought that it was unnecessary, but without it several tests failed because the monkeypatch was still there, and with this it doesn't 🤷‍♂

@@ -14,6 +15,54 @@

logger = logging.getLogger(__name__) # pylint: disable=invalid-name

START_TOKEN = "[CLS]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since these are only used once, is there value in pulling them out into module-level variables?

START_TOKEN = "[CLS]"
SEP_TOKEN = "[SEP]"

def _convert_tags_to_wordpiece_tags(tags: List[str], offsets: List[int]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a docstring explaining why this is here and mentioning that it will only get used if you're using BERT

bert_model_name : ``Optional[str]``, (default = None)
The BERT model to be wrapped, which we use to wordpiece-tokenize the data correctly.
If a bert model name is passed, we will also convert the BIO tags and verb indices
to be faithful to the wordpieces generated from the tokenizer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this comment should make it much more explicit that

  • if you specify a bert_model here, then we will assume you want to use BERT throughout, will use the bert tokenizer, and will expand your tags and verb indicators accordingly
  • if you don't specify a bert_model here, then it's just business as usual

self.bert_tokenizer = None
self.lowercase_input = False

def _tokenize_input(self, tokens: List[str]) -> Tuple[List[str], List[int]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, add a docstring here indicating what this does and that it's only used in the bert case.

also, I'm not convinced this is the best name, maybe _convert_tokens_to_wordpieces ?

word_pieces = self.bert_tokenizer.wordpiece_tokenizer.tokenize(token)
offsets.append(offsets[-1] + len(word_pieces))
word_piece_tokens.extend(word_pieces)
del offsets[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment about this logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was dumb, removed

@DeNeutoy DeNeutoy merged commit 5f37783 into allenai:master Jun 7, 2019
@omerarshad
Copy link

any code example how to use it?

@matt-gardner
Copy link
Contributor

@DeNeutoy, do you have a model saved anywhere for this? Or a plan to put it in pretrained.py?

reiyw pushed a commit to reiyw/allennlp that referenced this pull request Nov 12, 2019
* initial working dataset reader for bert srl

* tests for everything, all good

* pylint, mypy

* clean up after monkeypatch

* docs, rename model

* import

* get names the right way around

* shift everything over to the srl reader

* docstring, update model to not take bert_dim param

* I love to lint

* sneaky configuration test

* joel's comments
TalSchuster pushed a commit to TalSchuster/allennlp-MultiLang that referenced this pull request Feb 20, 2020
* initial working dataset reader for bert srl

* tests for everything, all good

* pylint, mypy

* clean up after monkeypatch

* docs, rename model

* import

* get names the right way around

* shift everything over to the srl reader

* docstring, update model to not take bert_dim param

* I love to lint

* sneaky configuration test

* joel's comments
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants