Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rename cc_pseudo_crawl seed_batch_1 slurm scripts #3

Closed
SaulLu opened this issue Apr 25, 2022 · 0 comments
Closed

rename cc_pseudo_crawl seed_batch_1 slurm scripts #3

SaulLu opened this issue Apr 25, 2022 · 0 comments

Comments

@SaulLu
Copy link
Collaborator

SaulLu commented Apr 25, 2022

When thomas/re_organize branch is merged, we need rename the slurm scripts inside repos/data-preparation/sourcing/cc_pseudo_crawl/seeds_batch_1 by:

download_warc.slurm  -> 01_download_warc.slurm    
download_warc_trial_4.slurm -> 02_download_warc_trial_4.slurm    
download_warc_trial_5.slurm -> 03_download_warc_trial_5.slurm
download_warc_too_big.slurm -> 04_download_warc_too_big.slurm
redownload_warc.slurm -> 05_redownload_warc.slurm
check_errors_in_dataset.slurm -> 06_check_errors_in_dataset.slurm  
preprocess_warc.slurm -> 08_preprocess_warc.slurm  
extract_text_and_html_metadata.slurm -> 09_extract_text_and_html_metadata.slurm  
shard_by_seed_id.slurm -> 10_shard_by_seed_id.slurm
merge_seed_shards.slurm -> 11_merge_seed_shards.slurm
shard_and_compress.slurm -> 12_shard_and_compress.slurm

Then we still need to find which of the following files have been used in the end divide_in_subshards.slurm or divide_in_subshards_1000.slurm (step 7)

EDIT: divide_in_subshards.slurm is the step 7 and divide_in_subshards_1000.slurm is reality done in step 10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant