Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretraining SciBERT #32

Closed
xycforgithub opened this issue Apr 11, 2019 · 8 comments
Closed

Pretraining SciBERT #32

xycforgithub opened this issue Apr 11, 2019 · 8 comments

Comments

@xycforgithub
Copy link

xycforgithub commented Apr 11, 2019

Hi,
The repo does not seem to contain the codes to pretrain the model on semantic scholar. Do you plan to release those codes and the pretrain data? Thanks!

Yichong

@ibeltagy
Copy link
Collaborator

We used the BERT code from google https://github.com/google-research/bert

@xycforgithub
Copy link
Author

Thanks! And where did you get the data?

@ibeltagy
Copy link
Collaborator

ibeltagy commented Apr 11, 2019

As mentioned in the paper, we used the semantic scholar corpus which is not publicly available. The publicly available part is this https://api.semanticscholar.org/corpus/, which has titles and abstracts but not full text.

@eric-haibin-lin
Copy link

@ibeltagy thanks for the reply. What mlm_loss should I expect the model to converge to if I used the same dataset?

@ibeltagy
Copy link
Collaborator

With 512 tokens, the losses are around the following numbers:

loss = 1.311045
masked_lm_accuracy = 0.7187241
masked_lm_loss = 1.2882848
next_sentence_accuracy = 0.9939219
next_sentence_loss = 0.0196654```

@sibyl1956
Copy link

sibyl1956 commented May 12, 2019

@ibeltagy Thanks for reply. Does that mean that your pretrained Sci-Bert model reaches masked_lm_accuracy of around 0.718? However, in the BERT original model, they reaches around 0.98 masked_lm_accuracy and about 1.0 next_sentence accuracy. May I ask do you think a masked_lm_accuracy of around 0.718 is enough?
I am training also my own model on a customized dataset, which adds around 1000 new tokens that are not in BERT model. Currently my model also reaches about over 0.7 masked_lm_accuracy and improves very slowly since then. Thus I would like to know what is the masked_lm_accuracy or next_sentence_accuracy that I should expect for my pre-trained model to achieve. Are there any tricks for fine-tuning the pretrained model on customized corpus?

@kyleclo
Copy link
Collaborator

kyleclo commented Jun 28, 2019

@sibyl1956 This is a good point and I suspect this is due to the noisy PDF parse in the scientific corpus. Namely, we didn't do anything to remove tables, equations, weird tokens, etc. that was output by PDFBox in converting the raw PDF to a text stream. You can expect it's essentially impossible for the model to really predict these masked tokens. We're currently in the process of curating an updated larger & cleaner version of the pretraining corpus, and will investigate whether the noisy tokens are the cause for this. As it is now, the current released SciBERT weights are still very good for downstream tasks (but we can definitely do better).

@kyleclo
Copy link
Collaborator

kyleclo commented Jun 28, 2019

I'm closing this issue for now since it looks like the original question chain was answered. Feel free to reopen or start new Issue

@kyleclo kyleclo closed this as completed Jun 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants