Pretraining SciBERT #32

xycforgithub · 2019-04-11T20:08:26Z

Hi,
The repo does not seem to contain the codes to pretrain the model on semantic scholar. Do you plan to release those codes and the pretrain data? Thanks!

Yichong

ibeltagy · 2019-04-11T20:57:57Z

We used the BERT code from google https://github.com/google-research/bert

xycforgithub · 2019-04-11T20:58:52Z

Thanks! And where did you get the data?

ibeltagy · 2019-04-11T21:05:40Z

As mentioned in the paper, we used the semantic scholar corpus which is not publicly available. The publicly available part is this https://api.semanticscholar.org/corpus/, which has titles and abstracts but not full text.

eric-haibin-lin · 2019-04-30T16:17:23Z

@ibeltagy thanks for the reply. What mlm_loss should I expect the model to converge to if I used the same dataset?

ibeltagy · 2019-05-10T18:19:15Z

With 512 tokens, the losses are around the following numbers:

loss = 1.311045
masked_lm_accuracy = 0.7187241
masked_lm_loss = 1.2882848
next_sentence_accuracy = 0.9939219
next_sentence_loss = 0.0196654```

sibyl1956 · 2019-05-12T23:38:16Z

@ibeltagy Thanks for reply. Does that mean that your pretrained Sci-Bert model reaches masked_lm_accuracy of around 0.718? However, in the BERT original model, they reaches around 0.98 masked_lm_accuracy and about 1.0 next_sentence accuracy. May I ask do you think a masked_lm_accuracy of around 0.718 is enough?
I am training also my own model on a customized dataset, which adds around 1000 new tokens that are not in BERT model. Currently my model also reaches about over 0.7 masked_lm_accuracy and improves very slowly since then. Thus I would like to know what is the masked_lm_accuracy or next_sentence_accuracy that I should expect for my pre-trained model to achieve. Are there any tricks for fine-tuning the pretrained model on customized corpus?

kyleclo · 2019-06-28T20:32:45Z

@sibyl1956 This is a good point and I suspect this is due to the noisy PDF parse in the scientific corpus. Namely, we didn't do anything to remove tables, equations, weird tokens, etc. that was output by PDFBox in converting the raw PDF to a text stream. You can expect it's essentially impossible for the model to really predict these masked tokens. We're currently in the process of curating an updated larger & cleaner version of the pretraining corpus, and will investigate whether the noisy tokens are the cause for this. As it is now, the current released SciBERT weights are still very good for downstream tasks (but we can definitely do better).

kyleclo · 2019-06-28T20:34:01Z

I'm closing this issue for now since it looks like the original question chain was answered. Feel free to reopen or start new Issue

kyleclo closed this as completed Jun 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretraining SciBERT #32

Pretraining SciBERT #32

xycforgithub commented Apr 11, 2019 •

edited

ibeltagy commented Apr 11, 2019

xycforgithub commented Apr 11, 2019

ibeltagy commented Apr 11, 2019 •

edited

eric-haibin-lin commented Apr 30, 2019

ibeltagy commented May 10, 2019

sibyl1956 commented May 12, 2019 •

edited

kyleclo commented Jun 28, 2019

kyleclo commented Jun 28, 2019

Pretraining SciBERT #32

Pretraining SciBERT #32

Comments

xycforgithub commented Apr 11, 2019 • edited

ibeltagy commented Apr 11, 2019

xycforgithub commented Apr 11, 2019

ibeltagy commented Apr 11, 2019 • edited

eric-haibin-lin commented Apr 30, 2019

ibeltagy commented May 10, 2019

sibyl1956 commented May 12, 2019 • edited

kyleclo commented Jun 28, 2019

kyleclo commented Jun 28, 2019

xycforgithub commented Apr 11, 2019 •

edited

ibeltagy commented Apr 11, 2019 •

edited

sibyl1956 commented May 12, 2019 •

edited