Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create license-compliant version of the Pile #65

Open
8 of 27 tasks
albertvillanova opened this issue Nov 16, 2021 · 2 comments
Open
8 of 27 tasks

Create license-compliant version of the Pile #65

albertvillanova opened this issue Nov 16, 2021 · 2 comments
Assignees
Labels
data catalog Gathering data from data sources language modeling script Need Language Modeling loading script

Comments

@albertvillanova
Copy link
Member

albertvillanova commented Nov 16, 2021

As discussed with @StellaAthena, this would be useful to constitute the English-language component of the dataset, possibly augmented by the Spotify transcripts dataset.

The creation of this dataset might be decomposed into smaller subsets, as reported by @StellaAthena (see #65 (comment)):

@albertvillanova albertvillanova added the data catalog Gathering data from data sources label Nov 16, 2021
@StellaAthena
Copy link

StellaAthena commented Nov 17, 2021

Under the assumption that we are targeting CC-BY-SA licensed content (which can be disputed… is there a plan for what the licensing for the whole dataset should be yet?) here’s how the current Pile breaks down:

Components of the Pile and their licensing

  1. Pile CC: Unclear, see below
  2. PubMed Central: this was downloaded in a license-compliant fashion
  3. Books3: excluded
  4. OpenWebText2: Unclear, see below
  5. arXiv: needs to be redownloaded and filtered by license
  6. GitHub: to be replaced by a license-compliant code dataset compiled by Google
  7. FreeLaw: Good as-is, I have acquired permission to use this from the org that owns the data
  8. StackExchange: Good as-is
  9. US PTO: Good as-is
  10. PubMed: Good as-is
  11. Project Gutenberg: Good as-is
  12. OpenSubtitles: Excluded. Although their website claims to be license complaint this is an obvious lie. They even posted the script of Wonder Woman before the movie debuted. there’s no way in hell they had Disney’s permission to do that.
  13. Wikipedia (en): Good as-is
  14. DM Mathematics: Good as-is
  15. Ubuntu IRC: Good as-is
  16. BookCorpus2: Excluded
  17. EuroParl: Good as-is
  18. HackerNews: Good as is
  19. YouTube Subtitles: excluded
  20. PhilPapers: I need to double check but this is either good as-is or needs to be redownloaded and filtered by license
  21. NIH ExPorter: Good as-is
  22. Enron Emails: Good as-is

A note on licensing and scraping: Pile-CC and OpenWebText2 pose challenges for legal and ethical compliance. The widespread attitude among organizations seems to be that the common crawl is “it’s own thing” as a dataset and that ToS compliance only requires compliance with CC’s ToS. I think that this is highly dubious ethically, but the same policy would reasonably extend to OWT2. In reality, I strongly suspect that the real reason for this attitude is that it is convenient rather than sensible.

Updating the Pile

Excluding these data sources removed approximately one quarter of the current text of the Pile and massively decreases the proportion of books and subtitle-like text found in it. Consequently, I believe it would be a good idea to identify and add more data to compensate, preferably a lot of it. I need to do some math to figure out how many tokens the deduplicated and cut down Pile is, but I would like at least 300 billion tokens according to the GPT-2 tokenizer and preferably more like 400B so that we can be reasonably confident future tokenizers won’t make the data fall under 300B tokens.

i think that this is also a good opportunity to rectify two issues with for the original pile:

  1. Finding non-western dialects of English
  2. Duplication

The Pile is quite biased towards American and UK English dialects. We sought out sources in Indian English, African American Vernacular English, and several African English dialects but failed to find significant sources of text. It would be excellent if we could identify sources of text in those dialects.

When the Pile came out, the prevailing opinion was that upsampling high information subsets is a good way to improve LM performance. Subsequent research has shown this to be empirically false, and so I highly recommend that we seek to not only not upsample but also apply the 13-grams deduplication technique that has become popular.

Potential Sources of Additional Data:

38 GB of SEC data here.

Apparently English language Project Gutenberg is supposed to be 4-6x the size of what we included in the Pile, see here.

Scraping new content from various websites since the original release.

Many of the Pile components come from governmental sources. Can we find English-language governmental sources in African or Asian dialects of English? Presumably the Kenyan government produces a lot of text, but I do not know where to find it.

@albertvillanova albertvillanova changed the title Create the license-compliant version of the Pile Create license-compliant version of the Pile Nov 22, 2021
@albertvillanova albertvillanova added the language modeling script Need Language Modeling loading script label Jan 27, 2022
@albertvillanova albertvillanova self-assigned this Jan 27, 2022
@albertvillanova
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data catalog Gathering data from data sources language modeling script Need Language Modeling loading script
Development

No branches or pull requests

2 participants