add BERT token embedder #2067

joelgrus · 2018-11-16T23:58:05Z

this is ready for review. in addition to the included unit tests, I trained two NER models using these embeddings (unfortunately, I realized this morning, I used the uncased BERT model, which seems like a bad idea for NER)

(1) only BERT embeddings: https://beaker-internal.allenai.org/ex/ex_rnk3mcplnpjz/tasks
(2) BERT embeddings + character embeddings: https://beaker-internal.allenai.org/ex/ex_nrq8d5vw5cb2/tasks

(apologies to non-AI2 people for the beaker-internal links)

as discussed offline, because of the positional encodings the BERT embedding has a max sequence length and will crash if you feed it longer sequences. this implementation simply truncates longer sequences and logs a warning. I left a TODO to come up with something better.

SparkJiao · 2018-11-17T02:32:59Z

Very appreciate your work about bert. I'm considering using bert on my task so could I use your implementation now?

joelgrus · 2018-11-17T04:59:02Z

I wouldn't recommend it using it for anything important, the code hasn't been thoroughly tested yet.

(you can try it and tell me how well it works though. 😀)

SparkJiao · 2018-11-17T05:00:58Z

OK, thank u very much. I’ll have a try first. 😊 发件人: Joel Grus<mailto:notifications@github.com> 发送时间: 2018年11月17日 12:59 收件人: allenai/allennlp<mailto:allennlp@noreply.github.com> 抄送: KKnotalone<mailto:jiaofangkai@hotmail.com>; Comment<mailto:comment@noreply.github.com> 主题: Re: [allenai/allennlp] WIP: add BERT token embedder (#2067) I wouldn't recommend it using it for anything important, the code hasn't been thoroughly tested yet. (you can try it and tell me how well it works though. 😀) — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#2067 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/APtN4CmvACR1_F3if_Yxzig5G83-JNPbks5uv5ebgaJpZM4YnJkI>.

hzeng-otterai · 2018-11-17T06:19:02Z

Thanks for all the work!
I prefer that the BERT code being included in the allennlp library instead of pip install it. One reason is that people may want to experiment with the code using allennlp framework.
Also, ELMo and BERT in the same place is fun.

SparkJiao · 2018-11-17T10:35:10Z

Unfortunately I'd say that I have find a bug.
While sorting in the batches using the sorting keys where has a dict {"passage", "num_tokens"} it is reported that the key num_tokens didn't exist. The bug is because the bert token indexer has made an array of tokens with different lengths with single_id token_indexer and character_tokenizer and the key "num_tokens" will not be added into the dict "padding_lengths".
The config file is dialog_qa.jsonnet and I will try to modified the sorting keys temporarily

SparkJiao · 2018-11-17T11:19:19Z

So are there some problems while using GloVe and singId token characters and bert at the same time? Because bert will make sequences with different lengths from the others? And we may not joint the word embedding ? Sorry to bother you

thomwolf · 2018-11-17T11:33:55Z

Hi @joelgrus, I've released our implementation on pip (see https://github.com/huggingface/pytorch-pretrained-BERT). Sorry for the delay! Tell me if I can change anything to make it easier for you to integrate in AllenNLP!

joelgrus · 2018-11-17T15:39:03Z

the timing is perfect, thanks so much!

joelgrus · 2018-11-17T15:49:56Z

@SparkJiao the indexer produces a bert-offsets field that contains the indices of the last wordpiece for each word. if you pass this to the bert token embedder, it will only return the embedding of the last wordpiece per original token, which gives it the right size. (whether this is the optimal way to accomplish this I'm less sure of)

you can do this from your config file by adding something like the following to your text_field_embedder:

            "allow_unmatched_keys": true,
            "embedder_to_indexer_map": {
                "tokens": ["tokens"],  // etc
                "bert": ["bert", "bert-offsets"]
            },

so that the model knows to pass the offsets into the bert token embedder.

SparkJiao · 2018-11-19T03:31:38Z

@joelgrus Well, sorry to reply lately and very appreciate to your help. I recently met some other problems and I can't solve them so I may wait for your official tutorial:(
Thanks a lot at last!

susht3 · 2018-11-19T11:24:44Z

This is mostly implemented, with two caveats:

(1) the end to end test for it is failing (and is itself not fully written), but for reasons that I'm pretty sure are unrelated to BERT or the new code (and that I'm pretty sure are related to me doing something really dumb somewhere that I haven't figured out)

(2) huggingface said they're going to pip release their implementation, but afaik they haven't, so for now I copied over the relevant files, but I'm using them as if they were a library, so when the library gets released I should (in theory) just have to change a handful of import statements and everything should work.

now they have pip release their implementation, could you please add bert?

matt-gardner

Code largely looks good. The big question to me is how to use BERT embeddings if you're not fine tuning (or making sure that it works as expected if you are fine tuning).

matt-gardner · 2018-11-21T18:26:59Z

.pylintrc

@@ -182,7 +182,7 @@ expected-line-ending-format=
 [BASIC]

 # Good variable names which should always be accepted, separated by a comma
-good-names=i,j,k,ex,Run,_
+good-names=i,j,k,ex,Run,_,logger


Hmm, I didn't realize you could do this. Nice find.

matt-gardner · 2018-11-21T18:28:35Z

allennlp/common/params.py

@@ -351,13 +351,14 @@ def _check_is_dict(self, new_history, value):
        return value

    @staticmethod
-    def from_file(params_file: str, params_overrides: str = "") -> 'Params':
+    def from_file(params_file: str, params_overrides: str = "", ext_vars: dict = {}) -> 'Params':
+        # pylint: disable=dangerous-default-value


I think I agree with pylint here - why not just have this be None, and add a line to the logic below? Much less error-prone in future maintenance of this code.

It'd also be nice to document what this does.

matt-gardner · 2018-11-21T18:39:08Z

allennlp/data/token_indexers/bert_indexer.py

+
+    @overrides
+    def count_vocab_items(self, token: Token, counter: Dict[str, Dict[str, int]]):
+        # If we only use pretrained models, we don't need to do anything here.


In the docstring above, it looked like you were recommending a different class for pretrained models. Are you talking about pretrained WordpieceTokenizers here?

matt-gardner · 2018-11-21T18:41:46Z

allennlp/data/token_indexers/bert_indexer.py

+
+    def _add_encoding_to_vocabulary(self, vocabulary: Vocabulary) -> None:
+        # pylint: disable=protected-access
+        for word, idx in self.vocab.items():


If this takes a while, you might consider putting a logging statement (or even a tqdm) in here.

matt-gardner · 2018-11-21T18:46:31Z

allennlp/data/token_indexers/bert_indexer.py

+logger = logging.getLogger(__name__)
+
+
+class BertIndexer(TokenIndexer[int]):


Maybe call this a WordpieceIndexer? This is more general than BERT. The class below is the BERT-specific one.

matt-gardner · 2018-11-21T18:57:26Z

allennlp/modules/token_embedders/bert_token_embedder.py

+    bert_model: ``BertModel``
+        The BERT model being wrapped.
+    """
+    def __init__(self,


All on one line? Also, what's the motivation for splitting this class up into two? So that people can either instantiate the BertModel themselves or use a string to reference it? Do we really need both of these?

Oh, looking at the config fixture, looks like this will let you train your own bert model if you want? Ok, yeah, that's definitely sufficient motivation to split this up.

matt-gardner · 2018-11-21T18:59:51Z

allennlp/modules/token_embedders/bert_token_embedder.py

+        Parameters
+        ----------
+        input_ids: ``torch.LongTensor``
+            The wordpiece ids for each input sentence.


Sentence? I'm not sure that's correct. It'd probably clear up what you meant here if you gave an expected shape.

matt-gardner · 2018-11-21T19:03:10Z

allennlp/modules/token_embedders/bert_token_embedder.py

+            If an input consists of two sentences (as in the BERT paper),
+            tokens from the first sentence should have type 0 and tokens from
+            the second sentence should have type 1.  If you don't provide this
+            (the default BertIndexer doesn't) then it's assumed to be all 0s.


Is there a way to modify the indexer to provide this? It's fine to do it in another PR, but I'd at least open an issue to track adding that. Seems pretty important if you want to use this for SQuAD.

matt-gardner · 2018-11-21T19:05:14Z

allennlp/tests/modules/token_embedders/bert_embedder_test.py

+        passage1 = "There were four major HDTV systems tested by SMPTE in the late 1970s, and in 1979 an SMPTE study group released A Study of High Definition Television Systems:"
+        question1 = "Who released A Study of High Definition Television Systems?"
+
+        passage2 = """Broca, being what today would be called a neurosurgeon, had taken an interest in the pathology of speech. He wanted to localize the difference between man and the other animals, which appeared to reside in speech. He discovered the speech center of the human brain, today called Broca's area after him. His interest was mainly in Biological anthropology, but a German philosopher specializing in psychology, Theodor Waitz, took up the theme of general and social anthropology in his six-volume work, entitled Die Anthropologie der Naturvölker, 1859–1864. The title was soon translated as "The Anthropology of Primitive Peoples". The last two volumes were published posthumously."""


You're using triple quotes already; can you make this into multiple shorter lines instead of one super long one? Same with the passage above.

matt-gardner · 2018-11-21T19:09:00Z

allennlp/modules/token_embedders/bert_token_embedder.py

+        input_mask = (input_ids != 0).long()
+
+        all_encoder_layers, _ = self.bert_model(input_ids, input_mask, token_type_ids)
+        sequence_output = all_encoder_layers[-1]


The top layer is a particularly bad layer to use for transfer (talk to Nelson about why, or wait for his final internship presentation). If we're fine-tuning, this is ok, but if we're not, we probably need some kind of scalar mix or something here, instead of just taking the last layer. This is probably why your performance is so poor for your NER experiment.

Table 2 of the original elmo paper suggests otherwise (unless this is specific to transformers)? Perhaps this accounts for ~1 F1 difference, not 20+.

Layer 1 is pretty consistently better than high layers across tasks, in Nelson's experiments (with 2 and 4 layer ELMo). The top layer is specialized for language modeling, and only a few tasks are close enough to benefit from that specialization (turns out NER is one of those). For transformers, the story is a bit different, but middle layers are still better than the top layer.

Table 2 of the ELMo paper didn't try just using layer 1.

I wasn't saying there isn't a difference - just that even with the top layer, it should be there or thereabouts. So that isn't the problem/difference to look into here for the NER model that Joel trained.

I missed that slack convo. Nevermind.

matthew-z · 2018-12-02T10:05:23Z

@joelgrus Thank you for the response.

Yes, I tried to remove bert-offsets from embedder_to_indexer_map , and then the bert encoded input became (batch_size, wordpiece_sequence_length) instead of (batch_size, tokens_sequence_length), then this code will not work because #tokens != #wordpieces:

mask = util.get_text_field_mask(inputs)
encoded_inputs = self.text_field_embedder(inputs)
logits = self.seq2vec(encoded_inputs, mask)

In other words, I want a mask in this shape: (batch_size, wordpiece_sequence_length).
I used a simple workaround to use bert tokenzer to let token = wordpiece:

@WordSplitter.register("wordPiece")
class WordPiece(WordSplitter):

    def __init__(self, pretrained_model, do_lowercase) -> None:
        super().__init__()
        self.tokenizer = BertTokenizer.from_pretrained(pretrained_model, do_lower_case=do_lowercase)

    @overrides
    def split_words(self, sentence: str) -> List[Token]:
        return [Token(t) for t in self.tokenizer.tokenize(sentence) if t]

config looks like:

  "dataset_reader": {
    "type": "mydataset",
    "tokenizer": {
         "type":"word",  
         "word_splitter":{
              "type":"wordPiece", 
              "pretrained_model": "bert-base-uncased",
              "do_lowercase": true
         }
    },
    "token_indexers": {
      "bert": {
          "type": "bert-pretrained",
          "pretrained_model": "bert-base-uncased",
          "do_lowercase": true,
      },
    },
  },

Then mask will become (batch_size, wordpiece_sequence_length).

joelgrus · 2018-12-02T15:08:29Z

the problem here is that util.get_text_field_mask sees the "mask" key in your inputs and uses that as the mask even though here it's not what you want.

I think a simpler solution here is just not to use util.get_text_field_mask, you can do something like

mask = inputs["bert"] != 0

which is what the token embedder does internally:

https://github.com/allenai/allennlp/blob/master/allennlp/modules/token_embedders/bert_token_embedder.py#L85

matthew-z · 2018-12-02T15:12:05Z

This solution is indeed much simpler.

Thank you!

bheinzerling · 2018-12-04T01:55:34Z

Did anybody manage to get a CoNLL'03 dev score close to the reported 96.4 F1 with BERT_base? Best I've managed to get is 94.6 using the last transformer layer for classification, as described in the paper (huggingface/transformers#64 (comment)).

matt-gardner · 2018-12-04T03:55:24Z

@bheinzerling, this is from a slack conversation a couple of weeks ago with @matt-peters:

I've run a bunch of combinations, trying to initially reproduce their results in table 7.

They left out some important details in the paper for the NER task, namely that they used document context for each word since it's available in the raw data. So I was never able to reproduce their results since I was using sentence context.

It also makes their results not directly comparable to previous work, since your standard glove + biLSTM would also presumably improve with document context...

FYI: with my implementation, I got dev F1 95.09 +/- 0.07 for 2x200 dim LSTM + CRF for second-to-last layer (they reported 95.6 in table 7 without CRF). This uses sentence context to compute the BERT activations.

(This was done outside of AllenNLP.)

The last layer is slightly worse than the second-to-last layer, so your number seems to agree with @matt-peters' results.

matt-peters · 2018-12-04T17:15:29Z

FWIW, adding document context improves F1 a little to reproduce the results in Table 7 (+/- noise from random seeds). Note that these numbers are from extracting features from BERT, not fine-tuning.

qiuwei · 2018-12-05T01:31:45Z

@joelgrus I had a look at the config for NER you provided. It seems that you didn't pad the sentence with the special token [CLS] at the start of each sentence.

However, I found this quite crucial in my local experiments. Did I miss anything?

joelgrus · 2018-12-05T02:29:37Z

let me take a look

qiuwei · 2018-12-05T02:33:50Z

@matt-peters Hi matt, could you illustrate a bit more about adding document context?
Did you use the whole document(I believe that would often exceed the max length allowed by bert?) or just add a few sentences around the target sentence?

When a larger context is used, did your model predict the NER labels for the context as well?

joelgrus · 2018-12-05T14:20:14Z

@qiuwei it looks like you are right and it's a "bug" in the token indexer, I'll open an issue for it, thanks for finding this

matt-peters · 2018-12-07T19:14:09Z

@qiuwei - I took the easy / simple implementation approach and just chunked the document into non-overlapping segments if it exceeded the maximum length usable by BERT. The only wrinkle is ensuring not to chunk the document in the middle of an entity span. This way each token still has an annotation, and the NER model still predicts labels for every token.

bikestra · 2019-03-06T22:08:06Z

@matt-peters Were you able to reproduce BERT paper's results once you introduced the document context? I was able to get dev F1 95.3 using sentence context but this is still 1.1% point behind what authors claim, and I didn't see much boost using document context.

This was work done outside of allennlp, but I thought independently reproducing their result using any tool would help all of us to progress; I am finding trouble anyone who was able to successfully reproduce their NER results. Sorry if this disturbed allennlp contributors.

matt-peters · 2019-03-06T22:34:56Z

I haven't tried to reproduce the fine tuning result (96.4 dev F1, BERT base, Table 3).

kamalkraj · 2019-03-19T14:38:26Z

https://github.com/kamalkraj/BERT-NER
Replicated results from BERT paper

pasinit · 2019-10-27T17:57:15Z

Did anyone try to compute the average of the word piece instead of using the first workpiece of each word? For example if I have the token "longtoken" which is split in "long" "token", for my understanding now for the whole token one can easily take the embedding for long. How easy would it be instead to take the average of "long" and "token"?

wangxinyu0922 · 2020-10-26T13:01:09Z

So does anyone successfully reproduce the score reported in the paper? I want to reproduce the score but I find it very hard to do that (only 91.5~ with document context + finetuning).

dsindex · 2021-04-03T12:51:17Z

@wangxinyu0922

with document context, i reached 92.35% (bert-base-cased)

dsindex/ntagger#4 (comment)

joelgrus added 8 commits November 15, 2018 10:35

bert wip

025cf1b

update pylintrc

e90345b

bert

a0cadbe

debugging test

e6ff247

fix bug in token indexer

72ed878

get end to end test working

405795e

remove print statements

97fabda

add back missing line

ccc3784

joelgrus added 8 commits November 19, 2018 10:07

use pip installed bert version

93396a9

clean up class hierarchy

78724eb

keep working on BERT

a0a59cc

Merge branch 'master' into bert

1ff2c7e

fix bert tests

212bda1

fix mypy + pylint

ef4f02b

Merge branch 'master' into bert

c7df632

add comments

3a4c0bc

joelgrus changed the title ~~WIP: add BERT token embedder~~ add BERT token embedder Nov 21, 2018

joelgrus requested a review from DeNeutoy November 21, 2018 15:33

matt-gardner approved these changes Nov 21, 2018

View reviewed changes

joelgrus mentioned this pull request Dec 5, 2018

BERT indexer should add [CLS] and [SEP] tokens #2141

Closed

bheinzerling mentioned this pull request Jan 28, 2019

Feature extraction for sequential labelling huggingface/transformers#64

Closed

stefan-it mentioned this pull request Feb 6, 2019

Using Pre-trained ELMo Representations for Many Languages in Flair flairNLP/flair#438

Closed

bikestra mentioned this pull request Mar 6, 2019

[BERT] BERT for Named Entity Recognition dmlc/gluon-nlp#593

Open

kugwzk mentioned this pull request May 21, 2019

About 'X' label kamalkraj/BERT-NER#1

Closed

sIncerass mentioned this pull request Jun 30, 2019

Cannot reproduce your reported F1 score kamalkraj/BERT-NER#18

Closed

stefan-it mentioned this pull request Aug 4, 2019

pytorch-pretrained-bert to pytorch-transformers upgrade flairNLP/flair#873

Closed

7 tasks

This was referenced May 7, 2020

Does anyone have the XLNet (and ALBERT) NER performance on CONLL-2003 huggingface/transformers#3677

Closed

[Request] NER Scripts on CoNLL 2003 dataset huggingface/transformers#4250

Closed

guevara mentioned this pull request Jul 28, 2020

The Illustrated BERT ELMo and co. (How NLP Cracked Transfer Learning) Jay Alammar Visualizing machine learning one concept at a time. guevara/read-it-later#6945

Open

guevara mentioned this pull request May 5, 2023

The Illustrated BERT ELMo and co. (How NLP Cracked Transfer Learning) guevara/read-it-later#9311

Open

guevara mentioned this pull request Jun 2, 2023

The Illustrated BERT ELMo and co. (How NLP Cracked Transfer Learning) Jay Alammar Visualizing machine learning one concept at a time. guevara/read-it-later#9401

Open

guevara mentioned this pull request Jul 2, 2024

The Illustrated BERT ELMo and co. (How NLP Cracked Transfer Learning) guevara/read-it-later#11440

Open

		logger = logging.getLogger(__name__)


		class BertIndexer(TokenIndexer[int]):

add BERT token embedder #2067

add BERT token embedder #2067

Conversation

joelgrus commented Nov 16, 2018 • edited Loading

SparkJiao commented Nov 17, 2018

joelgrus commented Nov 17, 2018

SparkJiao commented Nov 17, 2018 via email

hzeng-otterai commented Nov 17, 2018

SparkJiao commented Nov 17, 2018

SparkJiao commented Nov 17, 2018

thomwolf commented Nov 17, 2018

joelgrus commented Nov 17, 2018

joelgrus commented Nov 17, 2018

SparkJiao commented Nov 19, 2018

susht3 commented Nov 19, 2018

matt-gardner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthew-z commented Dec 2, 2018 • edited Loading

joelgrus commented Dec 2, 2018

matthew-z commented Dec 2, 2018

bheinzerling commented Dec 4, 2018 • edited Loading

matt-gardner commented Dec 4, 2018

matt-peters commented Dec 4, 2018

qiuwei commented Dec 5, 2018

joelgrus commented Dec 5, 2018

qiuwei commented Dec 5, 2018

joelgrus commented Dec 5, 2018

matt-peters commented Dec 7, 2018

bikestra commented Mar 6, 2019

matt-peters commented Mar 6, 2019

kamalkraj commented Mar 19, 2019

pasinit commented Oct 27, 2019

wangxinyu0922 commented Oct 26, 2020

dsindex commented Apr 3, 2021 • edited Loading

joelgrus commented Nov 16, 2018 •

edited

Loading

matthew-z commented Dec 2, 2018 •

edited

Loading

bheinzerling commented Dec 4, 2018 •

edited

Loading

dsindex commented Apr 3, 2021 •

edited

Loading