Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add BERT token embedder #2067

Merged
merged 19 commits into from Nov 26, 2018

Conversation

@joelgrus
Copy link
Collaborator

commented Nov 16, 2018

this is ready for review. in addition to the included unit tests, I trained two NER models using these embeddings (unfortunately, I realized this morning, I used the uncased BERT model, which seems like a bad idea for NER)

(1) only BERT embeddings: https://beaker-internal.allenai.org/ex/ex_rnk3mcplnpjz/tasks
(2) BERT embeddings + character embeddings: https://beaker-internal.allenai.org/ex/ex_nrq8d5vw5cb2/tasks

(apologies to non-AI2 people for the beaker-internal links)

as discussed offline, because of the positional encodings the BERT embedding has a max sequence length and will crash if you feed it longer sequences. this implementation simply truncates longer sequences and logs a warning. I left a TODO to come up with something better.

@SparkJiao

This comment has been minimized.

Copy link

commented Nov 17, 2018

Very appreciate your work about bert. I'm considering using bert on my task so could I use your implementation now?

@joelgrus

This comment has been minimized.

Copy link
Collaborator Author

commented Nov 17, 2018

I wouldn't recommend it using it for anything important, the code hasn't been thoroughly tested yet.

(you can try it and tell me how well it works though. 😀)

@SparkJiao

This comment has been minimized.

Copy link

commented Nov 17, 2018

@handsomezebra

This comment has been minimized.

Copy link
Contributor

commented Nov 17, 2018

Thanks for all the work!
I prefer that the BERT code being included in the allennlp library instead of pip install it. One reason is that people may want to experiment with the code using allennlp framework.
Also, ELMo and BERT in the same place is fun.

@SparkJiao

This comment has been minimized.

Copy link

commented Nov 17, 2018

Unfortunately I'd say that I have find a bug.
While sorting in the batches using the sorting keys where has a dict {"passage", "num_tokens"} it is reported that the key num_tokens didn't exist. The bug is because the bert token indexer has made an array of tokens with different lengths with single_id token_indexer and character_tokenizer and the key "num_tokens" will not be added into the dict "padding_lengths".
The config file is dialog_qa.jsonnet and I will try to modified the sorting keys temporarily

@SparkJiao

This comment has been minimized.

Copy link

commented Nov 17, 2018

So are there some problems while using GloVe and singId token characters and bert at the same time? Because bert will make sequences with different lengths from the others? And we may not joint the word embedding ? Sorry to bother you

@thomwolf

This comment has been minimized.

Copy link
Contributor

commented Nov 17, 2018

Hi @joelgrus, I've released our implementation on pip (see https://github.com/huggingface/pytorch-pretrained-BERT). Sorry for the delay! Tell me if I can change anything to make it easier for you to integrate in AllenNLP!

@joelgrus

This comment has been minimized.

Copy link
Collaborator Author

commented Nov 17, 2018

the timing is perfect, thanks so much!

@joelgrus

This comment has been minimized.

Copy link
Collaborator Author

commented Nov 17, 2018

@SparkJiao the indexer produces a bert-offsets field that contains the indices of the last wordpiece for each word. if you pass this to the bert token embedder, it will only return the embedding of the last wordpiece per original token, which gives it the right size. (whether this is the optimal way to accomplish this I'm less sure of)

you can do this from your config file by adding something like the following to your text_field_embedder:

            "allow_unmatched_keys": true,
            "embedder_to_indexer_map": {
                "tokens": ["tokens"],  // etc
                "bert": ["bert", "bert-offsets"]
            },

so that the model knows to pass the offsets into the bert token embedder.

@SparkJiao

This comment has been minimized.

Copy link

commented Nov 19, 2018

@joelgrus Well, sorry to reply lately and very appreciate to your help. I recently met some other problems and I can't solve them so I may wait for your official tutorial:(
Thanks a lot at last!

@susht3

This comment has been minimized.

Copy link

commented Nov 19, 2018

This is mostly implemented, with two caveats:

(1) the end to end test for it is failing (and is itself not fully written), but for reasons that I'm pretty sure are unrelated to BERT or the new code (and that I'm pretty sure are related to me doing something really dumb somewhere that I haven't figured out)

(2) huggingface said they're going to pip release their implementation, but afaik they haven't, so for now I copied over the relevant files, but I'm using them as if they were a library, so when the library gets released I should (in theory) just have to change a handful of import statements and everything should work.

now they have pip release their implementation, could you please add bert?

@joelgrus joelgrus changed the title WIP: add BERT token embedder add BERT token embedder Nov 21, 2018
@joelgrus joelgrus requested a review from DeNeutoy Nov 21, 2018
Copy link
Member

left a comment

Code largely looks good. The big question to me is how to use BERT embeddings if you're not fine tuning (or making sure that it works as expected if you are fine tuning).

@@ -182,7 +182,7 @@ expected-line-ending-format=
[BASIC]

# Good variable names which should always be accepted, separated by a comma
good-names=i,j,k,ex,Run,_
good-names=i,j,k,ex,Run,_,logger

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Nov 21, 2018

Member

Hmm, I didn't realize you could do this. Nice find.

@@ -351,13 +351,14 @@ def _check_is_dict(self, new_history, value):
return value

@staticmethod
def from_file(params_file: str, params_overrides: str = "") -> 'Params':
def from_file(params_file: str, params_overrides: str = "", ext_vars: dict = {}) -> 'Params':
# pylint: disable=dangerous-default-value

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Nov 21, 2018

Member

I think I agree with pylint here - why not just have this be None, and add a line to the logic below? Much less error-prone in future maintenance of this code.

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Nov 21, 2018

Member

It'd also be nice to document what this does.


@overrides
def count_vocab_items(self, token: Token, counter: Dict[str, Dict[str, int]]):
# If we only use pretrained models, we don't need to do anything here.

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Nov 21, 2018

Member

In the docstring above, it looked like you were recommending a different class for pretrained models. Are you talking about pretrained WordpieceTokenizers here?


def _add_encoding_to_vocabulary(self, vocabulary: Vocabulary) -> None:
# pylint: disable=protected-access
for word, idx in self.vocab.items():

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Nov 21, 2018

Member

If this takes a while, you might consider putting a logging statement (or even a tqdm) in here.

logger = logging.getLogger(__name__)


class BertIndexer(TokenIndexer[int]):

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Nov 21, 2018

Member

Maybe call this a WordpieceIndexer? This is more general than BERT. The class below is the BERT-specific one.

bert_model: ``BertModel``
The BERT model being wrapped.
"""
def __init__(self,

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Nov 21, 2018

Member

All on one line? Also, what's the motivation for splitting this class up into two? So that people can either instantiate the BertModel themselves or use a string to reference it? Do we really need both of these?

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Nov 21, 2018

Member

Oh, looking at the config fixture, looks like this will let you train your own bert model if you want? Ok, yeah, that's definitely sufficient motivation to split this up.

Parameters
----------
input_ids: ``torch.LongTensor``
The wordpiece ids for each input sentence.

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Nov 21, 2018

Member

Sentence? I'm not sure that's correct. It'd probably clear up what you meant here if you gave an expected shape.

If an input consists of two sentences (as in the BERT paper),
tokens from the first sentence should have type 0 and tokens from
the second sentence should have type 1. If you don't provide this
(the default BertIndexer doesn't) then it's assumed to be all 0s.

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Nov 21, 2018

Member

Is there a way to modify the indexer to provide this? It's fine to do it in another PR, but I'd at least open an issue to track adding that. Seems pretty important if you want to use this for SQuAD.

passage1 = "There were four major HDTV systems tested by SMPTE in the late 1970s, and in 1979 an SMPTE study group released A Study of High Definition Television Systems:"
question1 = "Who released A Study of High Definition Television Systems?"

passage2 = """Broca, being what today would be called a neurosurgeon, had taken an interest in the pathology of speech. He wanted to localize the difference between man and the other animals, which appeared to reside in speech. He discovered the speech center of the human brain, today called Broca's area after him. His interest was mainly in Biological anthropology, but a German philosopher specializing in psychology, Theodor Waitz, took up the theme of general and social anthropology in his six-volume work, entitled Die Anthropologie der Naturvölker, 1859–1864. The title was soon translated as "The Anthropology of Primitive Peoples". The last two volumes were published posthumously."""

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Nov 21, 2018

Member

You're using triple quotes already; can you make this into multiple shorter lines instead of one super long one? Same with the passage above.

input_mask = (input_ids != 0).long()

all_encoder_layers, _ = self.bert_model(input_ids, input_mask, token_type_ids)
sequence_output = all_encoder_layers[-1]

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Nov 21, 2018

Member

The top layer is a particularly bad layer to use for transfer (talk to Nelson about why, or wait for his final internship presentation). If we're fine-tuning, this is ok, but if we're not, we probably need some kind of scalar mix or something here, instead of just taking the last layer. This is probably why your performance is so poor for your NER experiment.

This comment has been minimized.

Copy link
@DeNeutoy

DeNeutoy Nov 21, 2018

Collaborator

Table 2 of the original elmo paper suggests otherwise (unless this is specific to transformers)? Perhaps this accounts for ~1 F1 difference, not 20+.

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Nov 21, 2018

Member

Layer 1 is pretty consistently better than high layers across tasks, in Nelson's experiments (with 2 and 4 layer ELMo). The top layer is specialized for language modeling, and only a few tasks are close enough to benefit from that specialization (turns out NER is one of those). For transformers, the story is a bit different, but middle layers are still better than the top layer.

Table 2 of the ELMo paper didn't try just using layer 1.

This comment has been minimized.

Copy link
@DeNeutoy

DeNeutoy Nov 21, 2018

Collaborator

I wasn't saying there isn't a difference - just that even with the top layer, it should be there or thereabouts. So that isn't the problem/difference to look into here for the NER model that Joel trained.

This comment has been minimized.

Copy link
@DeNeutoy

DeNeutoy Nov 21, 2018

Collaborator

I missed that slack convo. Nevermind.

@joelgrus

This comment has been minimized.

Copy link
Collaborator Author

commented Nov 27, 2018

@susht3

This comment has been minimized.

Copy link

commented Nov 27, 2018

here's the config I used:

https://gist.github.com/joelgrus/7cdb8fb2d81483a8d9ca121d9c617514

thanks a lot, but i still have the shape problem, my input shape is (batch_size, number, len), i want to get the embedding of (batch_size, number, len, embedding_dim), but i have the error that shape 1 is not match, it means that number maybe not padding

@joelgrus

This comment has been minimized.

Copy link
Collaborator Author

commented Nov 27, 2018

what is "number" here? what does your dataset look like?

@susht3

This comment has been minimized.

Copy link

commented Nov 27, 2018

what is "number" here? what does your dataset look like?

my dataset is quac dataset from here: https://github.com/allenai/allennlp/blob/master/allennlp/data/dataset_readers/reading_comprehension/quac.py,
i see that bert has test on squad dataset, but i get error on quac dataset

@joelgrus

This comment has been minimized.

Copy link
Collaborator Author

commented Nov 27, 2018

I see, the vanilla Embedding has code to explicitly deal with this case:

https://github.com/allenai/allennlp/blob/master/allennlp/modules/token_embedders/embedding.py#L123

I will try to fix this today.

@joelgrus

This comment has been minimized.

Copy link
Collaborator Author

commented Nov 27, 2018

@susht3 this should be resolved by #2104

@matthew-z

This comment has been minimized.

Copy link

commented Dec 1, 2018

@joelgrus I was wondering if I could avoid using offsets ? I found that it will cause mask be inconsistent with the bert tensor, so I think I should use bert tokenizer in the data reader but I am not sure how to do it. I only found a word splitter for bert-basic, but I guess it is only for Chinese data (character based).

@joelgrus

This comment has been minimized.

Copy link
Collaborator Author

commented Dec 1, 2018

mask is not supposed to be consistent with the bert tensor, the intended shapes are (in the simple case)

input_ids: (batch_size, wordpiece_sequence_length)
offsets: (batch_size, tokens_sequence_length)
mask: (batch_size, tokens_sequence_length)

but anyway, if you look at the BertEmbedder, the offsets are optional:

https://github.com/allenai/allennlp/blob/master/allennlp/modules/token_embedders/bert_token_embedder.py#L60

so if you change the embedder_to_indexer_map in your config to not pass the offsets into the token embedder (see: https://gist.github.com/joelgrus/7cdb8fb2d81483a8d9ca121d9c617514#file-ner-bert-jsonnet-L34) (most likely in that case you can just get rid of the embedder_to_indexer_map) then the output of the embedder will be one bert vector per wordpiece, which it sounds like is what you want?

@matthew-z

This comment has been minimized.

Copy link

commented Dec 2, 2018

@joelgrus Thank you for the response.

Yes, I tried to remove bert-offsets from embedder_to_indexer_map , and then the bert encoded input became (batch_size, wordpiece_sequence_length) instead of (batch_size, tokens_sequence_length), then this code will not work because #tokens != #wordpieces:

mask = util.get_text_field_mask(inputs)
encoded_inputs = self.text_field_embedder(inputs)
logits = self.seq2vec(encoded_inputs, mask)

In other words, I want a mask in this shape: (batch_size, wordpiece_sequence_length).
I used a simple workaround to use bert tokenzer to let token = wordpiece:

@WordSplitter.register("wordPiece")
class WordPiece(WordSplitter):

    def __init__(self, pretrained_model, do_lowercase) -> None:
        super().__init__()
        self.tokenizer = BertTokenizer.from_pretrained(pretrained_model, do_lower_case=do_lowercase)

    @overrides
    def split_words(self, sentence: str) -> List[Token]:
        return [Token(t) for t in self.tokenizer.tokenize(sentence) if t]

config looks like:

  "dataset_reader": {
    "type": "mydataset",
    "tokenizer": {
         "type":"word",  
         "word_splitter":{
              "type":"wordPiece", 
              "pretrained_model": "bert-base-uncased",
              "do_lowercase": true
         }
    },
    "token_indexers": {
      "bert": {
          "type": "bert-pretrained",
          "pretrained_model": "bert-base-uncased",
          "do_lowercase": true,
      },
    },
  },

Then mask will become (batch_size, wordpiece_sequence_length).

@joelgrus

This comment has been minimized.

Copy link
Collaborator Author

commented Dec 2, 2018

the problem here is that util.get_text_field_mask sees the "mask" key in your inputs and uses that as the mask even though here it's not what you want.

I think a simpler solution here is just not to use util.get_text_field_mask, you can do something like

mask = inputs["bert"] != 0

which is what the token embedder does internally:

https://github.com/allenai/allennlp/blob/master/allennlp/modules/token_embedders/bert_token_embedder.py#L85

@matthew-z

This comment has been minimized.

Copy link

commented Dec 2, 2018

This solution is indeed much simpler.

Thank you!

@bheinzerling

This comment has been minimized.

Copy link

commented Dec 4, 2018

Did anybody manage to get a CoNLL'03 dev score close to the reported 96.4 F1 with BERT_base? Best I've managed to get is 94.6 using the last transformer layer for classification, as described in the paper (huggingface/transformers#64 (comment)).

@matt-gardner

This comment has been minimized.

Copy link
Member

commented Dec 4, 2018

@bheinzerling, this is from a slack conversation a couple of weeks ago with @matt-peters:

I've run a bunch of combinations, trying to initially reproduce their results in table 7.

They left out some important details in the paper for the NER task, namely that they used document context for each word since it's available in the raw data. So I was never able to reproduce their results since I was using sentence context.

It also makes their results not directly comparable to previous work, since your standard glove + biLSTM would also presumably improve with document context...

FYI: with my implementation, I got dev F1 95.09 +/- 0.07 for 2x200 dim LSTM + CRF for second-to-last layer (they reported 95.6 in table 7 without CRF). This uses sentence context to compute the BERT activations.

(This was done outside of AllenNLP.)

The last layer is slightly worse than the second-to-last layer, so your number seems to agree with @matt-peters' results.

@matt-peters

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2018

FWIW, adding document context improves F1 a little to reproduce the results in Table 7 (+/- noise from random seeds). Note that these numbers are from extracting features from BERT, not fine-tuning.

@qiuwei

This comment has been minimized.

Copy link

commented Dec 5, 2018

@joelgrus I had a look at the config for NER you provided. It seems that you didn't pad the sentence with the special token [CLS] at the start of each sentence.

However, I found this quite crucial in my local experiments. Did I miss anything?

@joelgrus

This comment has been minimized.

Copy link
Collaborator Author

commented Dec 5, 2018

let me take a look

@qiuwei

This comment has been minimized.

Copy link

commented Dec 5, 2018

@matt-peters Hi matt, could you illustrate a bit more about adding document context?
Did you use the whole document(I believe that would often exceed the max length allowed by bert?) or just add a few sentences around the target sentence?

When a larger context is used, did your model predict the NER labels for the context as well?

@joelgrus

This comment has been minimized.

Copy link
Collaborator Author

commented Dec 5, 2018

@qiuwei it looks like you are right and it's a "bug" in the token indexer, I'll open an issue for it, thanks for finding this

@matt-peters

This comment has been minimized.

Copy link
Contributor

commented Dec 7, 2018

@qiuwei - I took the easy / simple implementation approach and just chunked the document into non-overlapping segments if it exceeded the maximum length usable by BERT. The only wrinkle is ensuring not to chunk the document in the middle of an entity span. This way each token still has an annotation, and the NER model still predicts labels for every token.

@bikestra

This comment has been minimized.

Copy link

commented Mar 6, 2019

@matt-peters Were you able to reproduce BERT paper's results once you introduced the document context? I was able to get dev F1 95.3 using sentence context but this is still 1.1% point behind what authors claim, and I didn't see much boost using document context.

This was work done outside of allennlp, but I thought independently reproducing their result using any tool would help all of us to progress; I am finding trouble anyone who was able to successfully reproduce their NER results. Sorry if this disturbed allennlp contributors.

@matt-peters

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2019

I haven't tried to reproduce the fine tuning result (96.4 dev F1, BERT base, Table 3).

@kamalkraj

This comment has been minimized.

Copy link

commented Mar 19, 2019

https://github.com/kamalkraj/BERT-NER
Replicated results from BERT paper

@kugwzk kugwzk referenced this pull request May 21, 2019
@stefan-it stefan-it referenced this pull request Aug 4, 2019
7 of 7 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.