Create license-compliant version of the Pile #65

albertvillanova · 2021-11-16T09:18:16Z

As discussed with @StellaAthena, this would be useful to constitute the English-language component of the dataset, possibly augmented by the Spotify transcripts dataset.

The creation of this dataset might be decomposed into smaller subsets, as reported by @StellaAthena (see #65 (comment)):

StellaAthena · 2021-11-17T03:19:40Z

Under the assumption that we are targeting CC-BY-SA licensed content (which can be disputed… is there a plan for what the licensing for the whole dataset should be yet?) here’s how the current Pile breaks down:

Components of the Pile and their licensing

Pile CC: Unclear, see below
PubMed Central: this was downloaded in a license-compliant fashion
Books3: excluded
OpenWebText2: Unclear, see below
arXiv: needs to be redownloaded and filtered by license
GitHub: to be replaced by a license-compliant code dataset compiled by Google
FreeLaw: Good as-is, I have acquired permission to use this from the org that owns the data
StackExchange: Good as-is
US PTO: Good as-is
PubMed: Good as-is
Project Gutenberg: Good as-is
OpenSubtitles: Excluded. Although their website claims to be license complaint this is an obvious lie. They even posted the script of Wonder Woman before the movie debuted. there’s no way in hell they had Disney’s permission to do that.
Wikipedia (en): Good as-is
DM Mathematics: Good as-is
Ubuntu IRC: Good as-is
BookCorpus2: Excluded
EuroParl: Good as-is
HackerNews: Good as is
YouTube Subtitles: excluded
PhilPapers: I need to double check but this is either good as-is or needs to be redownloaded and filtered by license
NIH ExPorter: Good as-is
Enron Emails: Good as-is

A note on licensing and scraping: Pile-CC and OpenWebText2 pose challenges for legal and ethical compliance. The widespread attitude among organizations seems to be that the common crawl is “it’s own thing” as a dataset and that ToS compliance only requires compliance with CC’s ToS. I think that this is highly dubious ethically, but the same policy would reasonably extend to OWT2. In reality, I strongly suspect that the real reason for this attitude is that it is convenient rather than sensible.

Updating the Pile

Excluding these data sources removed approximately one quarter of the current text of the Pile and massively decreases the proportion of books and subtitle-like text found in it. Consequently, I believe it would be a good idea to identify and add more data to compensate, preferably a lot of it. I need to do some math to figure out how many tokens the deduplicated and cut down Pile is, but I would like at least 300 billion tokens according to the GPT-2 tokenizer and preferably more like 400B so that we can be reasonably confident future tokenizers won’t make the data fall under 300B tokens.

i think that this is also a good opportunity to rectify two issues with for the original pile:

Finding non-western dialects of English
Duplication

The Pile is quite biased towards American and UK English dialects. We sought out sources in Indian English, African American Vernacular English, and several African English dialects but failed to find significant sources of text. It would be excellent if we could identify sources of text in those dialects.

When the Pile came out, the prevailing opinion was that upsampling high information subsets is a good way to improve LM performance. Subsequent research has shown this to be empirically false, and so I highly recommend that we seek to not only not upsample but also apply the 13-grams deduplication technique that has become popular.

Potential Sources of Additional Data:

38 GB of SEC data here.

Apparently English language Project Gutenberg is supposed to be 4-6x the size of what we included in the Pile, see here.

Scraping new content from various websites since the original release.

Many of the Pile components come from governmental sources. Can we find English-language governmental sources in African or Asian dialects of English? Presumably the Kenyan government produces a lot of text, but I do not know where to find it.

albertvillanova · 2022-01-27T17:13:05Z

I'm working on this: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_the_pile

albertvillanova added the data catalog Gathering data from data sources label Nov 16, 2021

albertvillanova changed the title ~~Create the license-compliant version of the Pile~~ Create license-compliant version of the Pile Nov 22, 2021

albertvillanova mentioned this issue Nov 29, 2021

Create dataset arxiv #156

Closed

albertvillanova added the language modeling script Need Language Modeling loading script label Jan 27, 2022

albertvillanova self-assigned this Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create license-compliant version of the Pile #65

Create license-compliant version of the Pile #65

albertvillanova commented Nov 16, 2021 •

edited

StellaAthena commented Nov 17, 2021 •

edited

albertvillanova commented Jan 27, 2022

Create license-compliant version of the Pile #65

Create license-compliant version of the Pile #65

Comments

albertvillanova commented Nov 16, 2021 • edited

StellaAthena commented Nov 17, 2021 • edited

albertvillanova commented Jan 27, 2022

albertvillanova commented Nov 16, 2021 •

edited

StellaAthena commented Nov 17, 2021 •

edited