Skip to content

update merge_preprocessed_data to use distributed merge#82

Merged
thomasw21 merged 9 commits intobigscience-workshop:mainfrom
adammoody:mergescript
Sep 21, 2021
Merged

update merge_preprocessed_data to use distributed merge#82
thomasw21 merged 9 commits intobigscience-workshop:mainfrom
adammoody:mergescript

Conversation

@adammoody
Copy link
Copy Markdown
Contributor

@adammoody adammoody commented Aug 26, 2021

This extends the merge_preprocessed_data.py script to optionally use a parallel merge. It changes that script to assume a distributed launch if given --merge distributed.

@adammoody adammoody changed the title WIP: update merge_preprocessed_data to use parallel merge update merge_preprocessed_data to use parallel merge Sep 1, 2021
@adammoody
Copy link
Copy Markdown
Contributor Author

@thomasw21 , would you please also review this one when you get a chance? This smaller PR might be an easier review/merge. Thanks.

Copy link
Copy Markdown
Member

@thomasw21 thomasw21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Some small comments.

  • Can you document this somewhere? That his option is available and how to use it? I'd imagine you'd have to use torch.distributed.run to run this in parallel
  • Can you write a test? I haven't really followed the new test framework, but I'm guessing we can at least use a single node with multiple ranks in CI.

Comment on lines +30 to +31
group.add_argument('--local_rank', type=int, default=None,
help='Local rank of calling process on its node (from torch.distributed.launch).')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

@adammoody adammoody Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in there because the torch.distributed.launch environment invokes the script and specifies this option. If the script does not process --local_rank, then argparse kicks out with an unsupported option error.

Search for "--local_rank" on this page:
https://pytorch.org/docs/stable/distributed.html

That same documentation says that the launcher will not specify --local_rank if $LOCAL_RANK is set in the environment. So we could drop this option if we instruct users to set $LOCAL_RANK in our directions on how to run.

I think it's a bit easier for the user if we continue to handle the option, but either way if fine by me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's keep this. My bad I didn't know torc distributed would add arguments to parse.

@@ -1,8 +1,15 @@
import os
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed.

@adammoody
Copy link
Copy Markdown
Contributor Author

adammoody commented Sep 15, 2021

This looks good! Some small comments.

  • Can you document this somewhere? That his option is available and how to use it? I'd imagine you'd have to use torch.distributed.run to run this in parallel

  • Can you write a test? I haven't really followed the new test framework, but I'm guessing we can at least use a single node with multiple ranks in CI.

Yes, I'll do both of those.

@adammoody
Copy link
Copy Markdown
Contributor Author

I added a test case for both serial and distributed merges.

@adammoody
Copy link
Copy Markdown
Contributor Author

@thomasw21 , I took another pass on this one to address your suggestions.

Copy link
Copy Markdown
Member

@thomasw21 thomasw21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks ! I have one very small nit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dset = dset.select(range(linelimit))
dset = dset[:linelimit]

I think dataset supports slicing.

EDIT: it's actually not the same thing, and select returns a datasets.arrow_dataset.Dataset object, so yours is better.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried the dset = dset[:linelimit] slice approach first and discovered that it does not return a dataset but that dset.select() did.

# initialize distributed environment if distributed merge requested
if args.merge == 'distributed':
if args.torch_backend is None:
args.torch_backend = 'gloo'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a print saying we default to 'gloo'?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks. That's in there now.

@adammoody adammoody changed the title update merge_preprocessed_data to use parallel merge update merge_preprocessed_data to use distributed merge Sep 21, 2021
@thomasw21 thomasw21 merged commit 0c82064 into bigscience-workshop:main Sep 21, 2021
@thomasw21
Copy link
Copy Markdown
Member

Great work ! Thank you !

@adammoody
Copy link
Copy Markdown
Contributor Author

Thanks again for your help, @thomasw21 .

ofirpress pushed a commit to ofirpress/Megatron-DeepSpeed that referenced this pull request Sep 23, 2021
…orkshop#82)

* update merge_preprocessed_data to use parallel merge

* indexed_dataset: add docstrings to merge and gather methods

* merge_preprocessed_data: tweak interface, add documentation

* merge: improvements after testing

* tests: serial and distributed merge

* avoid setting pythonpath within script

* merge script: fix typo in usage comments

* print default backend when not set in distributed merge
SaulLu added a commit to SaulLu/Megatron-DeepSpeed that referenced this pull request Sep 24, 2021
@huu4ontocord
Copy link
Copy Markdown
Contributor

Hi @adammoody , excellent work on data preprocessing! Would you like to help with a script to do multi-node text preprocessing, including filtering, perplexity sampling and clustering? I was thinking about using your work as base. It will be very interesting I think. Lmk

@adammoody
Copy link
Copy Markdown
Contributor Author

Hi @ontocord . Sure, I'd be happy to see how I might help. Do you have pointers on what you're looking to do?

@huu4ontocord
Copy link
Copy Markdown
Contributor

huu4ontocord commented Oct 26, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants