Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open-sourced PipeDLRM #122

Open
wants to merge 2 commits into
base: pipedlrm
Choose a base branch
from

Conversation

YanzhaoWu
Copy link

The open-sourced version of PipeDLRM, consisting of 5 functioning components, profiler, optimizer, runtime implementation, modeling and visualizer. PipeDLRM is built on top of DLRM with some components from PipeDream (https://github.com/msr-fiddle/pipedream).

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 14, 2020
@facebook-github-bot
Copy link

Hi @YanzhaoWu!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@ConnollyLeon
Copy link

Hi Yanzhao,

I am very curious about your work. Could you please show some more instructions on how to run it in your github? It would help me and the others a lot.

Thanks!

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

2 similar comments
@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@dmudiger
Copy link
Contributor

dmudiger commented Sep 9, 2020

Can we remove the empty "init.py " files

@dmudiger
Copy link
Contributor

dmudiger commented Sep 9, 2020

Hi Yanzhao,

I am very curious about your work. Could you please show some more instructions on how to run it in your github? It would help me and the others a lot.

Thanks!

Thank you for your interest in this work, we are currently actively reviewing this PR to merge it in. In the meanwhile please feel free to try it out, you can find the detail instructions here - https://github.com/facebookresearch/dlrm/pull/122/files#diff-22b1984e9055744bcb6b52260dfdfb71

@dmudiger
Copy link
Contributor

bring the discussion from the email thread back here, perhaps we can look at including some of the Pipedream components here linked as a submodule rather than copy them over ?

@YanzhaoWu
Copy link
Author

Hi Yanzhao,

I am very curious about your work. Could you please show some more instructions on how to run it in your github? It would help me and the others a lot.

Thanks!

Thank you very much for your interest in our project. Besides, you may also check the script (https://github.com/facebookresearch/dlrm/pull/122/files#diff-bc0c739ba93024f3443445a48fd0319b) for running PipeDRLM on the Kaggle DAC dataset with a 3-stage pipeline. Hope it will be helpful.

@YanzhaoWu
Copy link
Author

Can we remove the empty "init.py " files

Sure. Currently, the empty init.py files are used to treat the directories containing this file as Python packages, which will be used in PipeDLRM. We may remove them as we reorganize the codebase.

@TimJZ
Copy link

TimJZ commented Sep 19, 2020

Hi Yanzhao,
I'm having some trouble running the code and I'm wondering if you could provide some help. I'm mainly confused with the meaning of several variables in shell script.

I currently have one node and 4 GPUs, I'm wondering what are the num_input_rank, nrank and ngpus I should set up correspondingly.

From my understanding, the nranks represents the number of GPUs on one machine, therefore I set it to 4. I've tried several numbers for num_input_rank and so far all of them gave me errors such as:

File "../communication.py", line 42, in init
dist.init_process_group(backend, rank=rank, world_size=world_size)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 434, in init_process_group
timeout=timeout)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 505, in _new_process_group_helper
timeout=timeout)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:898] Connect timeout [172.17.0.4]:26516

Could you please give me some recommendations on how to set these number correctly? Thank you so much!

@YanzhaoWu
Copy link
Author

Hi Yanzhao,
I'm having some trouble running the code and I'm wondering if you could provide some help. I'm mainly confused with the meaning of several variables in shell script.

I currently have one node and 4 GPUs, I'm wondering what are the num_input_rank, nrank and ngpus I should set up correspondingly.

From my understanding, the nranks represents the number of GPUs on one machine, therefore I set it to 4. I've tried several numbers for num_input_rank and so far all of them gave me errors such as:

File "../communication.py", line 42, in init
dist.init_process_group(backend, rank=rank, world_size=world_size)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 434, in init_process_group
timeout=timeout)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 505, in _new_process_group_helper
timeout=timeout)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:898] Connect timeout [172.17.0.4]:26516

Could you please give me some recommendations on how to set these number correctly? Thank you so much!

Thank you very much for your interest in our project.
Sure. The num_input_rank represents the number of replications of stage 0 since stage 0 also handles the input data loader. For your case, you could just set it as 1. The nrank is the number of ranks (GPUs) used for running PipeDLRM. For your case, with num_input_rank=1, you could just set it nrank=3 (3 GPUs <-> 3 Stages, no replication). This setting for this script should be fine.

However, we still need to modify the model configuration file (models/dlrm/gpus=3/$conf_file) correspondingly.
For the above settings, you could just try conf_file=mp_conf.json in this script.

@deepakn94
Copy link

This looks cool!

Agree with @dmudiger that the PipeDream parts of the code can probably be removed from this codebase, especially if you haven't made any changes -- will make the diff easier to look at. If you have some changes to PipeDream that you think would be broadly useful, I am happy to upstream them to PipeDream if you send me a PR.

@sanjay-k-mukherjee
Copy link

sanjay-k-mukherjee commented Oct 29, 2020

We are running pipedlrm with nrank=4 . In our case, with num_input_rank=3, and nrank=4. I am using the default script
"../../exp/pipeline/dlrm_dac_pytorch.sh" to execute.
I am presently observing the following issue :-
File "main_with_runtime.py", line 627, in <module> num_versions=num_versions, lr=args.learning_rate) File "../sgd.py", line 23, in __init__ macrobatch=macrobatch, File "../optimizer.py", line 41, in __init__ master_parameters, **optimizer_args) File "/opt/conda/lib/python3.6/site-packages/torch/optim/sgd.py", line 68, in __init__ super(SGD, self).__init__(params, defaults) File "/opt/conda/lib/python3.6/site-packages/torch/optim/optimizer.py", line 47, in __init__ raise ValueError("optimizer got an empty parameter list") ValueError: optimizer got an empty parameter list

And with nrank=6 . I observe the following failure.
File "main_with_runtime.py", line 585, in <module> dp.make_random_loader_with_sampler(args, train_data, test_data, num_ranks_in_first_stage) TypeError: 'NoneType' object is not iterable kiTraceback (most recent call last): File "main_with_runtime.py", line 585, in <module> dp.make_random_loader_with_sampler(args, train_data, test_data, num_ranks_in_first_stage) TypeError: 'NoneType' object is not iterable

@YanzhaoWu
Copy link
Author

We are running pipedlrm with nrank=4 . In our case, with num_input_rank=3, and nrank=4. I am using the default script
"../../exp/pipeline/dlrm_dac_pytorch.sh" to execute.
I am presently observing the following issue :-
File "main_with_runtime.py", line 627, in <module> num_versions=num_versions, lr=args.learning_rate) File "../sgd.py", line 23, in __init__ macrobatch=macrobatch, File "../optimizer.py", line 41, in __init__ master_parameters, **optimizer_args) File "/opt/conda/lib/python3.6/site-packages/torch/optim/sgd.py", line 68, in __init__ super(SGD, self).__init__(params, defaults) File "/opt/conda/lib/python3.6/site-packages/torch/optim/optimizer.py", line 47, in __init__ raise ValueError("optimizer got an empty parameter list") ValueError: optimizer got an empty parameter list

And with nrank=6 . I observe the following failure.
File "main_with_runtime.py", line 585, in <module> dp.make_random_loader_with_sampler(args, train_data, test_data, num_ranks_in_first_stage) TypeError: 'NoneType' object is not iterable kiTraceback (most recent call last): File "main_with_runtime.py", line 585, in <module> dp.make_random_loader_with_sampler(args, train_data, test_data, num_ranks_in_first_stage) TypeError: 'NoneType' object is not iterable

Thank you very much for your interests in our project.
For the first problem, it seems that the PyTorch model was not correctly initialized. So the optimizer cannot obtain the trainable model parameters. You may need to compile PyTorch with the corresponding patches under the pytorch_patches folder.

For the second issue, it seems that the train_data or test_data is NoneType. The input ranks will load the actual training data while other ranks will generate random data to ensure the consistency of the number of iterations across different ranks. You need to check the configuration to ensure the num_batches is correct. Besides, it is suggested that you first try num_input_rank=1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants