We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When thomas/re_organize branch is merged, we need rename the slurm scripts inside repos/data-preparation/sourcing/cc_pseudo_crawl/seeds_batch_1 by:
thomas/re_organize
repos/data-preparation/sourcing/cc_pseudo_crawl/seeds_batch_1
download_warc.slurm -> 01_download_warc.slurm download_warc_trial_4.slurm -> 02_download_warc_trial_4.slurm download_warc_trial_5.slurm -> 03_download_warc_trial_5.slurm download_warc_too_big.slurm -> 04_download_warc_too_big.slurm redownload_warc.slurm -> 05_redownload_warc.slurm check_errors_in_dataset.slurm -> 06_check_errors_in_dataset.slurm preprocess_warc.slurm -> 08_preprocess_warc.slurm extract_text_and_html_metadata.slurm -> 09_extract_text_and_html_metadata.slurm shard_by_seed_id.slurm -> 10_shard_by_seed_id.slurm merge_seed_shards.slurm -> 11_merge_seed_shards.slurm shard_and_compress.slurm -> 12_shard_and_compress.slurm
Then we still need to find which of the following files have been used in the end divide_in_subshards.slurm or divide_in_subshards_1000.slurm (step 7)
divide_in_subshards.slurm
divide_in_subshards_1000.slurm
EDIT: divide_in_subshards.slurm is the step 7 and divide_in_subshards_1000.slurm is reality done in step 10
The text was updated successfully, but these errors were encountered:
No branches or pull requests
When
thomas/re_organize
branch is merged, we need rename the slurm scripts insiderepos/data-preparation/sourcing/cc_pseudo_crawl/seeds_batch_1
by:Then we still need to find which of the following files have been used in the end
divide_in_subshards.slurm
ordivide_in_subshards_1000.slurm
(step 7)EDIT:
divide_in_subshards.slurm
is the step 7 anddivide_in_subshards_1000.slurm
is reality done in step 10The text was updated successfully, but these errors were encountered: