This is a companion website for our paper Documenting the English Colossal Clean Crawled Corpus. We present some of the first documentation for the contents of the Colossal Clean Crawled Corpus (C4), a massive web-crawled dataset used for pretraining large language models like T5.
Raw data download: If you want to download C4 to run your own analysis on it, we have made the raw C4 data available to download here.
Search index: To search through the data using our search engine, we made https://c4-search.apps.allenai.org/. Check it out!
Our paper documents many interesting and/or problematic issues with the C4 dataset, but it is in no way an exhaustive exploration of the data. We hope the discussions page on this github repository will be a welcome place to document and discuss further issues with the data. If you find something interesting that you want to share, please start a new discussion and let everyone know what you found.