Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable comm_replay in PARAM by Integrating and Refactoring Comm Code #112

Conversation

TaekyungHeo
Copy link
Contributor

@TaekyungHeo TaekyungHeo commented May 10, 2024

Summary

  • Code Migration: Copied all comm_replay-related code from train/comms/pt to et_replay/lib/comm. The decision to copy rather than create symbolic links was mandatory to avoid dependency issues and maintain a stable and self-contained code environment, ensuring that the et_replay project remains functional even if the source files change.
  • Code Cleanup: Removed obsolete files such as dlrm.py and comms.py to streamline the codebase.
  • Configuration Update: Modified import statements and updated pyproject.toml to align with the new directory structure, ensuring proper package management.

Test Plan

$ pip install .
Processing /Users/theo/param/et_replay
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: et_replay
  Building wheel for et_replay (pyproject.toml) ... done
  Created wheel for et_replay: filename=et_replay-1.0.0-py3-none-any.whl size=61490 sha256=d4e4433c55487d790e6bb1bb892eca268348a148f3365d3587fac90aa38692ee
  Stored in directory: /private/var/folders/z0/c9mq5j4s6n14n0_gs7nlt6mc0000gp/T/pip-ephem-wheel-cache-jxux47rn/wheels/3b/3f/aa/d3fc853f83c22c6f3eeb09763570c2cc8031a1a414cb3c18b6
Successfully built et_replay
Installing collected packages: et_replay
  Attempting uninstall: et_replay
    Found existing installation: et_replay 1.0.0
    Uninstalling et_replay-1.0.0:
      Successfully uninstalled et_replay-1.0.0
Successfully installed et_replay-1.0.0

$ comm_replay  
[BLOCKED as expected]

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 10, 2024
@TaekyungHeo TaekyungHeo changed the title Copy comm_replay to et_replay Copy and Enable comm_replay in et_replay May 10, 2024
@TaekyungHeo TaekyungHeo changed the title Copy and Enable comm_replay in et_replay Enable comm_replay in PARAM by Integrating and Refactoring Comm Code May 10, 2024
@TaekyungHeo TaekyungHeo marked this pull request as ready for review May 10, 2024 01:24
@TaekyungHeo TaekyungHeo force-pushed the et-replay-refactor-comm-replay branch 4 times, most recently from 87da146 to 738f380 Compare May 14, 2024 02:12
@TaekyungHeo TaekyungHeo force-pushed the et-replay-refactor-comm-replay branch from 738f380 to b79b86a Compare May 14, 2024 16:49
@facebook-github-bot
Copy link
Contributor

@briancoutinho has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@TaekyungHeo
Copy link
Contributor Author

The Facebook internal linter is failing. What should I do? Do you have any suggestions?

@facebook-github-bot
Copy link
Contributor

@TaekyungHeo has updated the pull request. You must reimport the pull request before landing.

@TaekyungHeo TaekyungHeo force-pushed the et-replay-refactor-comm-replay branch from 122b198 to b4d9a7a Compare May 15, 2024 23:32
@facebook-github-bot
Copy link
Contributor

@TaekyungHeo has updated the pull request. You must reimport the pull request before landing.

@TaekyungHeo TaekyungHeo force-pushed the et-replay-refactor-comm-replay branch from b4d9a7a to a8dd759 Compare May 15, 2024 23:33
@facebook-github-bot
Copy link
Contributor

@TaekyungHeo has updated the pull request. You must reimport the pull request before landing.

Co-authored-by: Brian Coutinho <bcoutinho@meta.com>
@TaekyungHeo TaekyungHeo force-pushed the et-replay-refactor-comm-replay branch from a8dd759 to 07d14f8 Compare May 15, 2024 23:36
@facebook-github-bot
Copy link
Contributor

@TaekyungHeo has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@briancoutinho has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

briancoutinho pushed a commit to briancoutinho/HolisticTraceAnalysis that referenced this pull request May 16, 2024
Summary:
- **Code Migration**: Copied all comm_replay-related code from train/comms/pt to et_replay/lib/comm. The decision to copy rather than create symbolic links was mandatory to avoid dependency issues and maintain a stable and self-contained code environment, ensuring that the et_replay project remains functional even if the source files change.
- **Code Cleanup**: Removed obsolete files such as dlrm.py and comms.py to streamline the codebase.
- **Configuration Update**: Modified import statements and updated pyproject.toml to align with the new directory structure, ensuring proper package management.

X-link: facebookresearch/param#112

Differential Revision: D57354772

Pulled By: briancoutinho
facebook-github-bot pushed a commit to facebookresearch/HolisticTraceAnalysis that referenced this pull request May 17, 2024
…137)

Summary:
Pull Request resolved: #137

- **Code Migration**: Copied all comm_replay-related code from train/comms/pt to et_replay/lib/comm. The decision to copy rather than create symbolic links was mandatory to avoid dependency issues and maintain a stable and self-contained code environment, ensuring that the et_replay project remains functional even if the source files change.
- **Code Cleanup**: Removed obsolete files such as dlrm.py and comms.py to streamline the codebase.
- **Configuration Update**: Modified import statements and updated pyproject.toml to align with the new directory structure, ensuring proper package management.

X-link: facebookresearch/param#112

Test Plan:
```
$ pip install .
Processing /Users/theo/param/et_replay
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: et_replay
  Building wheel for et_replay (pyproject.toml) ... done
  Created wheel for et_replay: filename=et_replay-1.0.0-py3-none-any.whl size=61490 sha256=d4e4433c55487d790e6bb1bb892eca268348a148f3365d3587fac90aa38692ee
  Stored in directory: /private/var/folders/z0/c9mq5j4s6n14n0_gs7nlt6mc0000gp/T/pip-ephem-wheel-cache-jxux47rn/wheels/3b/3f/aa/d3fc853f83c22c6f3eeb09763570c2cc8031a1a414cb3c18b6
Successfully built et_replay
Installing collected packages: et_replay
  Attempting uninstall: et_replay
    Found existing installation: et_replay 1.0.0
    Uninstalling et_replay-1.0.0:
      Successfully uninstalled et_replay-1.0.0
Successfully installed et_replay-1.0.0

$ comm_replay
[BLOCKED as expected]
```

Run on mast
buck2 run mode/opt -c hpc_comms.use_ncclx=2.18.3 param_bench/train/comms/pt:launcher -- --launcher mast --cluster=MastProdCluster --dp networkai_mast_job_identity --hw tc_any --nnode 8 --ppn 8 --z=0 --module commsTraceReplay --trace-path manifold://param/tree/shengbao/et/torchx-conda-xlformers_ncclexp_70b_fp8_fsdp_pp_ctran_ag-tgqvxwkz --trace-type et --reuse-tensors

https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-param-commsTraceReplay-64gpus-allreduce-5f66a4?job_attempt=0&version=0&tab=scheduling&env=PRODUCTION

Reviewed By: shengbao-zheng

Differential Revision: D57354772

Pulled By: briancoutinho

fbshipit-source-id: f4563f6f4823e8f8b097d68aa35da3461aa4c0a0
@facebook-github-bot
Copy link
Contributor

@briancoutinho merged this pull request in e99ef20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants