Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making Book Corpus #43

Open
codertimo opened this issue Oct 30, 2018 · 5 comments
Open

Making Book Corpus #43

codertimo opened this issue Oct 30, 2018 · 5 comments
Labels
help wanted Extra attention is needed

Comments

@codertimo
Copy link
Owner

codertimo commented Oct 30, 2018

Building the same corpus with original paper. Please share your tips to preprocess and download the file. It would be great to share preprocessed data using dropbox or google drive etc.

@codertimo codertimo changed the title Making Wikipedia Corpus Making Movie Corpus Oct 30, 2018
@codertimo
Copy link
Owner Author

#32

@codertimo codertimo added the help wanted Extra attention is needed label Oct 30, 2018
@mapingshuo
Copy link

The original paper (BERT) use "the concatenation of BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words)." what do you mean "Movie Corpus"?

@codertimo codertimo changed the title Making Movie Corpus Making Book Corpus Oct 30, 2018
@codertimo
Copy link
Owner Author

@mapingshuo Sorry It's my fault. haha I just made that title in 5seconds :) thank you!! 👍

@mapingshuo
Copy link

That's okay, I am looking for a valid Book Corpus too.

@Henry-E
Copy link

Henry-E commented Jan 11, 2019

Both GPT and BERT were trained on bookscorpus. Presumably there's a private copy people are passing about. There's some web scrapers out there designed for recreating the bookscorpus but this repetition of work seems unnecessary. If anyone finds a copy, do let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants