-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create license-compliant version of the Pile #65
Comments
Under the assumption that we are targeting CC-BY-SA licensed content (which can be disputed… is there a plan for what the licensing for the whole dataset should be yet?) here’s how the current Pile breaks down: Components of the Pile and their licensing
A note on licensing and scraping: Pile-CC and OpenWebText2 pose challenges for legal and ethical compliance. The widespread attitude among organizations seems to be that the common crawl is “it’s own thing” as a dataset and that ToS compliance only requires compliance with CC’s ToS. I think that this is highly dubious ethically, but the same policy would reasonably extend to OWT2. In reality, I strongly suspect that the real reason for this attitude is that it is convenient rather than sensible. Updating the Pile Excluding these data sources removed approximately one quarter of the current text of the Pile and massively decreases the proportion of books and subtitle-like text found in it. Consequently, I believe it would be a good idea to identify and add more data to compensate, preferably a lot of it. I need to do some math to figure out how many tokens the deduplicated and cut down Pile is, but I would like at least 300 billion tokens according to the GPT-2 tokenizer and preferably more like 400B so that we can be reasonably confident future tokenizers won’t make the data fall under 300B tokens. i think that this is also a good opportunity to rectify two issues with for the original pile:
The Pile is quite biased towards American and UK English dialects. We sought out sources in Indian English, African American Vernacular English, and several African English dialects but failed to find significant sources of text. It would be excellent if we could identify sources of text in those dialects. When the Pile came out, the prevailing opinion was that upsampling high information subsets is a good way to improve LM performance. Subsequent research has shown this to be empirically false, and so I highly recommend that we seek to not only not upsample but also apply the 13-grams deduplication technique that has become popular. Potential Sources of Additional Data: 38 GB of SEC data here. Apparently English language Project Gutenberg is supposed to be 4-6x the size of what we included in the Pile, see here. Scraping new content from various websites since the original release. Many of the Pile components come from governmental sources. Can we find English-language governmental sources in African or Asian dialects of English? Presumably the Kenyan government produces a lot of text, but I do not know where to find it. |
I'm working on this: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_the_pile |
As discussed with @StellaAthena, this would be useful to constitute the English-language component of the dataset, possibly augmented by the Spotify transcripts dataset.
The creation of this dataset might be decomposed into smaller subsets, as reported by @StellaAthena (see #65 (comment)):
The text was updated successfully, but these errors were encountered: