Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
[WIP] Add BertIterator (MultiDataSetIterator for BERT training) #7430
treo left a comment
Overall a good start. I'm mostly nitpicking on documentation and some edge case checks.
Concerning the tasks I'm not quite satisfied.
From my understanding of the paper, the big thing about Bert is that they pre-train using both the MLM / Cloze task and next sentence classification task at the same time. This can also be seen when they create training examples in https://github.com/google-research/bert/blob/master/create_pretraining_data.py#L219. They are using the same masked input for both tasks there.
The SEQ_CLASSIFICATION task also doesn't look like it is not an instance of the next-sentence classification task.
To make this a proper BertIterator in my opinion we need to add both the next-sentence classification task, as well as allow multi task training.
@treo Thanks for the review
Yes, that is my understanding also, and the omission was intentional at this point. I want to get something usable sooner rather than later, even if it's not optimal. My main goal for now was the sequence classification task as that's what we'll be using initially for transfer learning.
As for supporting the "next sentence classification" task, I want that. But it'll be more complex to implement: instead of iterating over sentences, we'll want to iterate over documents, and we'l need some automated way to break text into sentences.
So, future additions we'll need here:
There's some non-trivial challenges with those, and I planned to address them separately. (Javadoc should reflect that, however)
It's not. It's designed for transfer learning, single sequence in, single label out... I'll clarify that.