Bert srl #2854

DeNeutoy · 2019-05-16T23:43:39Z

A new model which uses BERT for SRL. I want this to replace our current SRL model, because:

It has better performance ~ 86.4 F1 vs 84.9 F1 from the old model
It's very simple.
It trains 15-20x faster.
People will actually be able to reproduce it (our current srl model requires pytorch 0.4.1 to train + custom cuda stuff which isn't in the repo anymore and some massive hacks to get the model archive to actually work)

This might be a little controversial for a couple of reasons:

It doesn't use the allennlp textfield embedder for BERT (in the same way that Joel's BertClassifier doesn't, see Bert for classification #2787). This is necessary however, because I am modifying the input to BERT directly (switching the sentence indexing to use it for the predicate indicator).
It has a moderate amount of duplicated code from the original SRL model. I'm not super motivated to have this model inherit from the original one, because some things make that difficult and we've found previously that model composition works better than inheritance.

This still needs a little bit of a clean up, but if anyone has really strong opinions about this, comments are appreciated!

matt-gardner · 2019-05-17T00:05:16Z

On (1), the abstractions exist to help you, not the other way around. If they aren't useful for what you're doing, don't use them, that's fine. The fact that they're not useful here suggests that maybe we should update our abstractions somehow, but that's not an issue for this PR.

joelgrus · 2019-05-17T00:07:25Z

I don't mind not using the text field embedder (although maybe we should think about if a different abstraction is needed here since we're having to do this multiple times).

I'm not entirely thrilled with the idea of a "srl-bert" dataset reader, in theory you'd just want to change the token_indexer in the existing srl dataset reader, but I haven't looked too deeply into why that doesn't work here.

omerarshad · 2019-05-20T17:48:26Z

can i get this pretrained model? and any instructions how to use it?

DeNeutoy · 2019-05-22T16:09:29Z

@omerarshad sorry I haven't actually trained it yet, should be done within the next week or so!

omerarshad · 2019-05-22T17:12:25Z

so will it be replaced with current SRL?

DeNeutoy · 2019-05-22T17:28:21Z

Yes?

DeNeutoy · 2019-05-24T16:53:58Z

There might be some clean up stuff to do from this PR but i'll wait until you think the "big picture" looks decent.

matt-gardner

Big picture looks ok to me; I'll leave the rest to @joelgrus, if you both are ok with that.

matt-gardner · 2019-05-24T20:57:07Z

allennlp/data/dataset_readers/srl_bert_reader.py

+                 lazy: bool = False) -> None:
+
+        if token_indexers:
+            raise ConfigurationError("The SrlBertReader has a fixed input "


Why have this as a constructor parameter at all, then? Just pass the right thing to the superclass yourself.

Also, docstring needs updating.

matt-gardner · 2019-05-24T21:01:02Z

allennlp/data/dataset_readers/srl_bert_reader.py

+        new_verbs = [1 if  "-V" in tag else 0 for tag in new_tags]
+        # In order to override the indexing mechanism, we need to set the `text_id`
+        # attribute directly. This causes the indexing to use this id.
+        token_field = TextField([Token(t, text_id=self.bert_tokenizer.vocab[t]) for t in wordpieces],


Don't we have a bert token indexer? You're not using that because it groups wordpieces into tokens? What if we have a bert wordpiece indexer, that just uses this vocabulary and doesn't do any vocab counting, etc.? Then you wouldn't have to have this hack, and you could just manually set self._token_indexers = BertWordpieceIndexer in the constructor.

Actually, sorry, I think the right thing to do here, instead of having a different indexer, is to have a BertWordpieceTokenizer that does the tokenization as you have it and sets the text id for all wordpieces. Then you just use the standard single id indexer, and everything is fine.

This doesn't work because I need to get back the offsets of the wordpieces which doesn't fit the Tokenizer.tokenize api. Obviously I could re-compute these, or override the type. What do you suggest?

The spacy tokenizer returns offsets as a field on each token. Can't you do the same?

I could do that, but 1) It doesn't achieve the objective of being able to swap out tokenizers, because that offset must be set for what I am doing, so the only reason to do that would be that we are imagining this is quite general, and 2) it would have different semantics from the field offset in Spacy, which would be confusing. I can do it if you want, but it seems a bit "jumping through hoops" for something we don't know if people are going to use extensively (and if they did want to use it extensively we should really be considering a fuller re-write which places less importance on TokenIndexers).

You have a lot more context here than I do; if what I suggested doesn't make sense, do what makes sense. I thought the offsets you needed were the same as what spacy gives (and squad models also require those offsets); I must have missed something.

matt-gardner · 2019-05-24T21:02:08Z

allennlp/data/dataset_readers/srl_bert_reader.py

+        return wordpieces, offsets
+
+    @staticmethod
+    def _convert_tags_to_wordpiece_tags(tags: List[str], offsets: List[int]):


Feels like this should be in a util somewhere.

joelgrus

I still don't like the idea of a "srl_bert_reader". If I'm understanding correctly, the reason for that is because the verb indicators and labels need to be "expanded" to correspond to the wordpieces, but there's no way for the SequenceLabelField to "know" about this expansion.

I am trying to think of a clean solution for this.

joelgrus · 2019-05-24T20:59:06Z

allennlp/models/srl_bert.py

+    model : ``Union[str, BertModel]``, required.
+        A string describing the BERT model to load or an already constructed BertModel.
+    bert_dim : int, required.
+        The dimension of the contextual representations from BERT.


can't you just get this as bert_model.config.hidden_size?

matt-gardner · 2019-05-24T21:56:52Z

@joelgrus, to solve that problem, can't you just have an optional use_wordpieces flag to the SRL reader, and expand the indicators and labels accordingly? You could probably even detect this automatically if you go with the tokenizer approach I suggested - if you have more tokens than you have labels, you need to do some expansion, and you call the existing method.

matt-gardner · 2019-05-24T21:58:02Z

Oh, except typically the data is already tokenized... So, yeah, you'd want a flag that triggers all three - wordpiece tokenization, indicator expansion, and label expansion, for each token.

joelgrus · 2019-06-05T22:42:50Z

oh man, you're going to hate this idea:

in Instance.index_fields, sort the fields "topologically" (in the sense that sequence label fields come after their corresponding sequence fields, possibly other ways that make sense too)
in WordpieceIndexer, generate an additional index piece-offsets (i.e. the index of the tokens that each wordpiece came from)
modify SequenceLabelField.index() so that if its parent field (which has necessarily already been indexed) has a "piece-offsets" index (and possibly if some other flag is set) then the labels get expanded to compensate
which means that the SequenceLabelField has to be aware of whether it is IOB / BIOUL / nothing and handle the "expansion" accordingly

DeNeutoy · 2019-06-05T22:44:29Z

Correct I hate that idea

joelgrus · 2019-06-05T22:52:32Z

I mean, there's a sense in which it solves your problem somewhat cleanly (and I think should work with any dataset reader / wordpiece token-indexing scheme).

DeNeutoy · 2019-06-05T23:03:14Z

That proposal is not a good idea because:

It expands the scope of index to not just index and pad tokens
SequenceLabelFields are not exclusively BIO so we'd have to enumerate 1000 functions which convert between various formats (case in point: the verb_indicator field in the srl model, which is a binary flag)
It's so opaque no-one will use it because they won't be able to understand how it works.

DeNeutoy · 2019-06-06T19:26:06Z

@joelgrus, ready for another pass/discussing your idea whenever you're ready

joelgrus

this looks good, I have a bunch of small comments, mostly around naming and documentation

joelgrus · 2019-06-06T21:10:35Z

allennlp/tests/models/bert_srl_test.py

+
+    def tearDown(self):
+        self.monkeypatch.undo()
+        self.monkeypatch.undo()


why are there 2 undos here?

actually, the monkeypatch documentation claims that manually calling undo is not usually necessary, why do we need to do that here?

I thought it was because 2 things are monkeypatched and they keep a stack of the patches. I also thought that it was unnecessary, but without it several tests failed because the monkeypatch was still there, and with this it doesn't 🤷‍♂

joelgrus · 2019-06-06T21:15:20Z