C4 Documetation

This is a companion website for our paper Documenting the English Colossal Clean Crawled Corpus. We present some of the first documentation for the contents of the Colossal Clean Crawled Corpus (C4), a massive web-crawled dataset used for pretraining large language models like T5.

Accessing C4

Raw data download: If you want to download C4 to run your own analysis on it, we have made the raw C4 data available to download here.

Search index: To search through the data using our search engine, we made https://c4-search.apps.allenai.org/. Check it out!

Documenting issues

Our paper documents many interesting and/or problematic issues with the C4 dataset, but it is in no way an exhaustive exploration of the data. We hope the discussions page on this github repository will be a welcome place to document and discuss further issues with the data. If you find something interesting that you want to share, please start a new discussion and let everyone know what you found.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

C4 Documetation

Accessing C4

Documenting issues

About

Releases

Packages

allenai/c4-documentation

Folders and files

Latest commit

History

Repository files navigation

C4 Documetation

Accessing C4

Documenting issues

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages