Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify ET replay and add Integration testing #91

Open
briancoutinho opened this issue Feb 9, 2024 · 2 comments
Open

Simplify ET replay and add Integration testing #91

briancoutinho opened this issue Feb 9, 2024 · 2 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@briancoutinho
Copy link
Contributor

Summary

Today there are two replay logic supported - one for only comms and one for compute+comm. This has several drawbacks

  • Code exists in two directories - compute and comms
  • Changes in one side are not testing compatibility with the other.
  • @TaekyungHeo has observed bugs and crashes with the replay when both compute and comms is enabled.

Crashes

@TaekyungHeo to add more info on how to reproduce issues

Code unification

Basic idea is to pull things out to a replay directory and unify the code
Details TBD

Integration testing

Ensure changes are unit tested to avoid impact to external users.

@briancoutinho briancoutinho added bug Something isn't working enhancement New feature or request labels Feb 9, 2024
@TaekyungHeo
Copy link
Contributor

I am sharing my version of PARAM, which you can find here: https://github.com/TaekyungHeo/param/commits/comms-bugfix/

The most critical issue in commsTraceReplay involves the execution trace parsing logic. A quick fix is available in my branch, although I am uncertain about its correctness.
The second critical issue concerns the construction of execution trace paths. It functions correctly within Meta, but fails externally. I plan to submit a PR to address this.
The third critical issue is in running comms replay and comp replay simultaneously. During my time at Meta, this was not supported, and I am currently unsure if it has been addressed. This issue primarily stems from inconsistent trace path construction logic between comms replay and comp replay.

Additionally, while I was at Meta, I validated the computation replay results. Unfortunately, I have not had the opportunity to validate communication replay or the combination of communication and computation replay.

@TaekyungHeo
Copy link
Contributor

I have created two pull requests (PRs) related to comms trace replay as shown below. Please review the PRs when you get a chance.

Additionally, I have identified another issue in parsing execution traces, but I am not sure whether it is a universal bug or not. You can find the bugfix here: TaekyungHeo@f8042dd

There is also another pull request for trace_link.py available at: #90

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants