Skip to content

Commit

Permalink
Merge pull request #174 from allenai/v0-small
Browse files Browse the repository at this point in the history
More mixer configs
  • Loading branch information
rodneykinney committed May 25, 2023
2 parents 4737c53 + 22b0582 commit 0d487c2
Show file tree
Hide file tree
Showing 8 changed files with 189 additions and 2 deletions.
9 changes: 9 additions & 0 deletions pretrain_data/common_crawl/NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ We ran the CCNet pipeline over 25 dumps from 2020-05 to 2023-06. Different vers

Sharded output of CCNet pipline. Duplicate paragraphs removed (exact match, but only comparing against a ~2% sample of paragraphs in the corpus). Bucketed by language (fasttext), and English perplexity on wikipedia-trained 5-gram language model.

**v0-en** is the re-sharded English content of `v0`

### v1

Post-process of v0. Drop non-English documents. Deduplicate whole documents by URL. Coalesce shards.
Expand All @@ -14,10 +16,17 @@ Post-process of v0. Drop non-English documents. Deduplicate whole documents by U

**v1-small** is an 8.5% sample of `v1`, about 300B tokens.

**v1-small-head** is a sample of the `cc_en_head` (low-perplexity) subset of `v1`

**v1-small-head-middle** is a sample of the `cc_en_head` and `cc_en_middle` (low- and mid-perplexity) setset of `v1`

### v2

Post-process of v1. Remove duplicate paragraphs across the entire corpus

**v2-small** is a post-process of `v1-small` to remove duplicate paragraphs.


## CCNet Overview

We run a fork of CCNet at https://github.com/allenai/cc_net.git
Expand Down
2 changes: 1 addition & 1 deletion pretrain_data/mixer/config/v0-small.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"streams": [
{
"name": "v_small",
"name": "v0_small",
"documents": [
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2023-06/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-49/*/en_head.json.gz",
Expand Down
43 changes: 43 additions & 0 deletions pretrain_data/mixer/config/v0/en-head.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
{
"streams": [
{
"name": "cc_en_head",
"documents": [
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2023-06/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-49/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-40/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-33/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-27/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-21/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-05/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-43/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-39/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-31/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-25/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-21/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-17/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-10/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-04/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-50/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-45/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-40/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-34/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-29/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-24/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-16/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-10/*/en_head.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-05/*/en_head.json.gz"
],
"attributes": [],
"output": {
"path": "pretraining-data/sources/common-crawl/v0-en/documents/cc_en_head",
"max_size_in_bytes": 4294967296
}
}
],
"work_dir": {
"input": "/data1/work/input",
"output": "/data2/work/output"
},
"processes": 128
}
43 changes: 43 additions & 0 deletions pretrain_data/mixer/config/v0/en-middle.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
{
"streams": [
{
"name": "cc_en_middle",
"documents": [
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2023-06/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-49/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-40/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-33/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-27/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-21/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-05/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-43/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-39/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-31/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-25/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-21/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-17/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-10/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-04/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-50/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-45/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-40/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-34/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-29/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-24/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-16/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-10/*/en_middle.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-05/*/en_middle.json.gz"
],
"attributes": [],
"output": {
"path": "pretraining-data/sources/common-crawl/v0-en/documents/cc_en_middle",
"max_size_in_bytes": 4294967296
}
}
],
"work_dir": {
"input": "/data1/work/input",
"output": "/data2/work/output"
},
"processes": 128
}
43 changes: 43 additions & 0 deletions pretrain_data/mixer/config/v0/en-tail.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
{
"streams": [
{
"name": "cc_en_tail",
"documents": [
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2023-06/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-49/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-40/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-33/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-27/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-21/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2022-05/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-43/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-39/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-31/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-25/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-21/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-17/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-10/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2021-04/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-50/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-45/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-40/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-34/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-29/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-24/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-16/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-10/*/en_tail.json.gz",
"pretraining-data/sources/common-crawl/v0/documents/mined_split/2020-05/*/en_tail.json.gz"
],
"attributes": [],
"output": {
"path": "pretraining-data/sources/common-crawl/v0-en/documents/cc_en_tail",
"max_size_in_bytes": 4294967296
}
}
],
"work_dir": {
"input": "/data1/work/input",
"output": "/data2/work/output"
},
"processes": 128
}
25 changes: 25 additions & 0 deletions pretrain_data/mixer/config/v1-small/head-middle.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"streams": [
{
"name": "v1_small-head-middle",
"documents": [
"pretraining-data/sources/common-crawl/v1/documents/cc_en_head/*",
"pretraining-data/sources/common-crawl/v1/documents/cc_en_middle/*"
],
"output": {
"path": "pretraining-data/sources/common-crawl/v1-small-head-middle/documents",
"max_size_in_bytes": 21474836480
},
"attributes": ["sample"],
"filter": {
"include": ["$.attributes[?(@.sample__random_number_v1__random[0][2] < 0.17)]"],
"exclude": []
}
}
],
"work_dir": {
"input": "/data1/work/input",
"output": "/data2/work/output"
},
"processes": 128
}
24 changes: 24 additions & 0 deletions pretrain_data/mixer/config/v1-small/head.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"streams": [
{
"name": "v1_small-head",
"documents": [
"pretraining-data/sources/common-crawl/v1/documents/cc_en_head/*"
],
"output": {
"path": "pretraining-data/sources/common-crawl/v1-small-head/documents",
"max_size_in_bytes": 8589934592
},
"attributes": ["sample"],
"filter": {
"include": ["$.attributes[?(@.sample__random_number_v1__random[0][2] < 0.34)]"],
"exclude": []
}
}
],
"work_dir": {
"input": "/data1/work/input",
"output": "/data2/work/output"
},
"processes": 128
}
2 changes: 1 addition & 1 deletion pretrain_data/mixer/src/shard.rs
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ impl Shard {
};
shards.push(shard);
stream_shard_count += 1;
shard_size = 0;
shard_size = *size;
shard_inputs = Vec::new();
}
shard_inputs.push(input.clone());
Expand Down

0 comments on commit 0d487c2

Please sign in to comment.