-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making Book Corpus #43
Comments
The original paper (BERT) use "the concatenation of BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words)." what do you mean "Movie Corpus"? |
@mapingshuo Sorry It's my fault. haha I just made that title in 5seconds :) thank you!! 👍 |
That's okay, I am looking for a valid Book Corpus too. |
Both GPT and BERT were trained on bookscorpus. Presumably there's a private copy people are passing about. There's some web scrapers out there designed for recreating the bookscorpus but this repetition of work seems unnecessary. If anyone finds a copy, do let me know! |
Building the same corpus with original paper. Please share your tips to preprocess and download the file. It would be great to share preprocessed data using dropbox or google drive etc.
The text was updated successfully, but these errors were encountered: