Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add BertIterator (MultiDataSetIterator for BERT training) #7430

Merged
merged 9 commits into from Apr 4, 2019

Conversation

@AlexDBlack
Copy link
Contributor

commented Apr 2, 2019

Fixes: #7287

@treo
Copy link

left a comment

Overall a good start. I'm mostly nitpicking on documentation and some edge case checks.

Concerning the tasks I'm not quite satisfied.

From my understanding of the paper, the big thing about Bert is that they pre-train using both the MLM / Cloze task and next sentence classification task at the same time. This can also be seen when they create training examples in https://github.com/google-research/bert/blob/master/create_pretraining_data.py#L219. They are using the same masked input for both tasks there.

The SEQ_CLASSIFICATION task also doesn't look like it is not an instance of the next-sentence classification task.

To make this a proper BertIterator in my opinion we need to add both the next-sentence classification task, as well as allow multi task training.

* <b>RANK2_IDX</b>: return int32 [minibatch, numTokens] array with entries being class numbers<br>
* <b>RANK3_NCL</b>: return float32 [minibatch, numClasses, numTokens] array with 1-hot entries along dimension 1<br>
* <b>RANK3_NLC</b>: return float32 [minibatch, numTokens, numClasses] array with 1-hot entries along dimension 2<br>
* <br>

This comment has been minimized.

Copy link
@treo

treo Apr 3, 2019

Maybe add a short explanation when someone would use each of those options

@AlexDBlack

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2019

@treo Thanks for the review

From my understanding of the paper, the big thing about Bert is that they pre-train using both the MLM / Cloze task and next sentence classification task at the same time.

Yes, that is my understanding also, and the omission was intentional at this point. I want to get something usable sooner rather than later, even if it's not optimal. My main goal for now was the sequence classification task as that's what we'll be using initially for transfer learning.

As for supporting the "next sentence classification" task, I want that. But it'll be more complex to implement: instead of iterating over sentences, we'll want to iterate over documents, and we'l need some automated way to break text into sentences.
Alternatively (or, additionally), we introduce a "labelled sentence pair iterator" for this.

So, future additions we'll need here:

  • Sentence matching task (supervised)
  • Next sentence classification task
  • Per token classification - #7436

There's some non-trivial challenges with those, and I planned to address them separately. (Javadoc should reflect that, however)

The SEQ_CLASSIFICATION task also doesn't look like it is not an instance of the next-sentence classification task.

It's not. It's designed for transfer learning, single sequence in, single label out... I'll clarify that.

@AlexDBlack AlexDBlack force-pushed the ab_bert_iterator branch from 1d06db3 to dc6d529 Apr 4, 2019

@treo

This comment has been minimized.

Copy link

commented Apr 4, 2019

Great, with the clarification that the intent wasn't to support all the unsupervised pre-training options yet, and the fixes in the java doc, only the question of when you'd use RANK3_NLC, my guess is that some of the imported models may need it?

@AlexDBlack AlexDBlack merged commit 288e258 into master Apr 4, 2019

1 check was pending

Codacy/PR Quality Review Hang in there, Codacy is reviewing your Pull request.
Details

@AlexDBlack AlexDBlack deleted the ab_bert_iterator branch Apr 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.