C4 Documetation

This is a companion website for our paper Documenting the English Colossal Clean Crawled Corpus. We present some of the first documentation for the contents of the Colossal Clean Crawled Corpus (C4), a massive web-crawled dataset used for pretraining large language models like T5.

Accessing C4

Raw data download: If you want to download C4 to run your own analysis on it, we have made the raw C4 data available to download here.

Search index: To search through the data using our search engine, we made Check it out!

Documenting issues

Our paper documents many interesting and/or problematic issues with the C4 dataset, but it is in no way an exhaustive exploration of the data. We hope the discussions page on this github repository will be a welcome place to document and discuss further issues with the data. If you find something interesting that you want to share, please start a new discussion and let everyone know what you found.


