New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pretraining SciBERT #32
Comments
We used the BERT code from google https://github.com/google-research/bert |
Thanks! And where did you get the data? |
As mentioned in the paper, we used the semantic scholar corpus which is not publicly available. The publicly available part is this https://api.semanticscholar.org/corpus/, which has titles and abstracts but not full text. |
@ibeltagy thanks for the reply. What mlm_loss should I expect the model to converge to if I used the same dataset? |
With 512 tokens, the losses are around the following numbers:
|
@ibeltagy Thanks for reply. Does that mean that your pretrained Sci-Bert model reaches masked_lm_accuracy of around 0.718? However, in the BERT original model, they reaches around 0.98 masked_lm_accuracy and about 1.0 next_sentence accuracy. May I ask do you think a masked_lm_accuracy of around 0.718 is enough? |
@sibyl1956 This is a good point and I suspect this is due to the noisy PDF parse in the scientific corpus. Namely, we didn't do anything to remove tables, equations, weird tokens, etc. that was output by PDFBox in converting the raw PDF to a text stream. You can expect it's essentially impossible for the model to really predict these masked tokens. We're currently in the process of curating an updated larger & cleaner version of the pretraining corpus, and will investigate whether the noisy tokens are the cause for this. As it is now, the current released SciBERT weights are still very good for downstream tasks (but we can definitely do better). |
I'm closing this issue for now since it looks like the original question chain was answered. Feel free to reopen or start new Issue |
Hi,
The repo does not seem to contain the codes to pretrain the model on semantic scholar. Do you plan to release those codes and the pretrain data? Thanks!
Yichong
The text was updated successfully, but these errors were encountered: