Making Book Corpus #43

codertimo · 2018-10-30T05:28:19Z

Building the same corpus with original paper. Please share your tips to preprocess and download the file. It would be great to share preprocessed data using dropbox or google drive etc.

codertimo · 2018-10-30T05:31:37Z

#32

mapingshuo · 2018-10-30T07:28:27Z

The original paper (BERT) use "the concatenation of BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words)." what do you mean "Movie Corpus"?

codertimo · 2018-10-30T07:38:59Z

@mapingshuo Sorry It's my fault. haha I just made that title in 5seconds :) thank you!! 👍

mapingshuo · 2018-10-30T08:10:00Z

That's okay, I am looking for a valid Book Corpus too.

Henry-E · 2019-01-11T12:36:38Z

Both GPT and BERT were trained on bookscorpus. Presumably there's a private copy people are passing about. There's some web scrapers out there designed for recreating the bookscorpus but this repetition of work seems unnecessary. If anyone finds a copy, do let me know!

codertimo changed the title ~~Making Wikipedia Corpus~~ Making Movie Corpus Oct 30, 2018

codertimo added the help wanted Extra attention is needed label Oct 30, 2018

codertimo changed the title ~~Making Movie Corpus~~ Making Book Corpus Oct 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making Book Corpus #43

Making Book Corpus #43

codertimo commented Oct 30, 2018 •

edited

codertimo commented Oct 30, 2018

mapingshuo commented Oct 30, 2018

codertimo commented Oct 30, 2018

mapingshuo commented Oct 30, 2018

Henry-E commented Jan 11, 2019

Making Book Corpus #43

Making Book Corpus #43

Comments

codertimo commented Oct 30, 2018 • edited

codertimo commented Oct 30, 2018

mapingshuo commented Oct 30, 2018

codertimo commented Oct 30, 2018

mapingshuo commented Oct 30, 2018

Henry-E commented Jan 11, 2019

codertimo commented Oct 30, 2018 •

edited