-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Conversation
On (1), the abstractions exist to help you, not the other way around. If they aren't useful for what you're doing, don't use them, that's fine. The fact that they're not useful here suggests that maybe we should update our abstractions somehow, but that's not an issue for this PR. |
I don't mind not using the text field embedder (although maybe we should think about if a different abstraction is needed here since we're having to do this multiple times). I'm not entirely thrilled with the idea of a "srl-bert" dataset reader, in theory you'd just want to change the token_indexer in the existing srl dataset reader, but I haven't looked too deeply into why that doesn't work here. |
can i get this pretrained model? and any instructions how to use it? |
@omerarshad sorry I haven't actually trained it yet, should be done within the next week or so! |
so will it be replaced with current SRL? |
Yes? |
There might be some clean up stuff to do from this PR but i'll wait until you think the "big picture" looks decent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big picture looks ok to me; I'll leave the rest to @joelgrus, if you both are ok with that.
lazy: bool = False) -> None: | ||
|
||
if token_indexers: | ||
raise ConfigurationError("The SrlBertReader has a fixed input " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have this as a constructor parameter at all, then? Just pass the right thing to the superclass yourself.
Also, docstring needs updating.
new_verbs = [1 if "-V" in tag else 0 for tag in new_tags] | ||
# In order to override the indexing mechanism, we need to set the `text_id` | ||
# attribute directly. This causes the indexing to use this id. | ||
token_field = TextField([Token(t, text_id=self.bert_tokenizer.vocab[t]) for t in wordpieces], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have a bert token indexer? You're not using that because it groups wordpieces into tokens? What if we have a bert wordpiece indexer, that just uses this vocabulary and doesn't do any vocab counting, etc.? Then you wouldn't have to have this hack, and you could just manually set self._token_indexers = BertWordpieceIndexer
in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, sorry, I think the right thing to do here, instead of having a different indexer, is to have a BertWordpieceTokenizer
that does the tokenization as you have it and sets the text id for all wordpieces. Then you just use the standard single id indexer, and everything is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work because I need to get back the offsets of the wordpieces which doesn't fit the Tokenizer.tokenize
api. Obviously I could re-compute these, or override the type. What do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spacy tokenizer returns offsets as a field on each token. Can't you do the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do that, but 1) It doesn't achieve the objective of being able to swap out tokenizers, because that offset
must be set for what I am doing, so the only reason to do that would be that we are imagining this is quite general, and 2) it would have different semantics from the field offset
in Spacy, which would be confusing. I can do it if you want, but it seems a bit "jumping through hoops" for something we don't know if people are going to use extensively (and if they did want to use it extensively we should really be considering a fuller re-write which places less importance on TokenIndexers
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a lot more context here than I do; if what I suggested doesn't make sense, do what makes sense. I thought the offsets you needed were the same as what spacy gives (and squad models also require those offsets); I must have missed something.
return wordpieces, offsets | ||
|
||
@staticmethod | ||
def _convert_tags_to_wordpiece_tags(tags: List[str], offsets: List[int]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like this should be in a util somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't like the idea of a "srl_bert_reader". If I'm understanding correctly, the reason for that is because the verb indicators and labels need to be "expanded" to correspond to the wordpieces, but there's no way for the SequenceLabelField to "know" about this expansion.
I am trying to think of a clean solution for this.
allennlp/models/srl_bert.py
Outdated
model : ``Union[str, BertModel]``, required. | ||
A string describing the BERT model to load or an already constructed BertModel. | ||
bert_dim : int, required. | ||
The dimension of the contextual representations from BERT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you just get this as bert_model.config.hidden_size
?
@joelgrus, to solve that problem, can't you just have an optional |
Oh, except typically the data is already tokenized... So, yeah, you'd want a flag that triggers all three - wordpiece tokenization, indicator expansion, and label expansion, for each token. |
oh man, you're going to hate this idea:
|
Correct I hate that idea |
I mean, there's a sense in which it solves your problem somewhat cleanly (and I think should work with any dataset reader / wordpiece token-indexing scheme). |
That proposal is not a good idea because:
|
@joelgrus, ready for another pass/discussing your idea whenever you're ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good, I have a bunch of small comments, mostly around naming and documentation
|
||
def tearDown(self): | ||
self.monkeypatch.undo() | ||
self.monkeypatch.undo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are there 2 undos here?
actually, the monkeypatch documentation claims that manually calling undo is not usually necessary, why do we need to do that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was because 2 things are monkeypatched and they keep a stack of the patches. I also thought that it was unnecessary, but without it several tests failed because the monkeypatch was still there, and with this it doesn't 🤷♂
@@ -14,6 +15,54 @@ | |||
|
|||
logger = logging.getLogger(__name__) # pylint: disable=invalid-name | |||
|
|||
START_TOKEN = "[CLS]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since these are only used once, is there value in pulling them out into module-level variables?
START_TOKEN = "[CLS]" | ||
SEP_TOKEN = "[SEP]" | ||
|
||
def _convert_tags_to_wordpiece_tags(tags: List[str], offsets: List[int]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a docstring explaining why this is here and mentioning that it will only get used if you're using BERT
bert_model_name : ``Optional[str]``, (default = None) | ||
The BERT model to be wrapped, which we use to wordpiece-tokenize the data correctly. | ||
If a bert model name is passed, we will also convert the BIO tags and verb indices | ||
to be faithful to the wordpieces generated from the tokenizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this comment should make it much more explicit that
- if you specify a
bert_model
here, then we will assume you want to use BERT throughout, will use the bert tokenizer, and will expand your tags and verb indicators accordingly - if you don't specify a bert_model here, then it's just business as usual
self.bert_tokenizer = None | ||
self.lowercase_input = False | ||
|
||
def _tokenize_input(self, tokens: List[str]) -> Tuple[List[str], List[int]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, add a docstring here indicating what this does and that it's only used in the bert case.
also, I'm not convinced this is the best name, maybe _convert_tokens_to_wordpieces
?
word_pieces = self.bert_tokenizer.wordpiece_tokenizer.tokenize(token) | ||
offsets.append(offsets[-1] + len(word_pieces)) | ||
word_piece_tokens.extend(word_pieces) | ||
del offsets[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment about this logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was dumb, removed
any code example how to use it? |
@DeNeutoy, do you have a model saved anywhere for this? Or a plan to put it in |
* initial working dataset reader for bert srl * tests for everything, all good * pylint, mypy * clean up after monkeypatch * docs, rename model * import * get names the right way around * shift everything over to the srl reader * docstring, update model to not take bert_dim param * I love to lint * sneaky configuration test * joel's comments
* initial working dataset reader for bert srl * tests for everything, all good * pylint, mypy * clean up after monkeypatch * docs, rename model * import * get names the right way around * shift everything over to the srl reader * docstring, update model to not take bert_dim param * I love to lint * sneaky configuration test * joel's comments
A new model which uses BERT for SRL. I want this to replace our current SRL model, because:
This might be a little controversial for a couple of reasons:
It doesn't use the allennlp textfield embedder for BERT (in the same way that Joel's BertClassifier doesn't, see Bert for classification #2787). This is necessary however, because I am modifying the input to BERT directly (switching the sentence indexing to use it for the predicate indicator).
It has a moderate amount of duplicated code from the original SRL model. I'm not super motivated to have this model inherit from the original one, because some things make that difficult and we've found previously that model composition works better than inheritance.
This still needs a little bit of a clean up, but if anyone has really strong opinions about this, comments are appreciated!