Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload scripts to link execution trace and kineto trace #79

Closed
wants to merge 11 commits into from

Conversation

Mingyu-Liang
Copy link
Contributor

@Mingyu-Liang Mingyu-Liang commented Jun 29, 2023

This implementation supports two match modes: exact match and approximate match. For exact match, two traces must contain the same iteration and the one-on-one operator linking will be based on the execution order. For approximate match, we will first transform the traces to graphs and perform the best-effort match using the tree edit distance. Both will create an enhanced version of execution trace with execution duration of each operator.

Example use:

python trace_link.py --et-file PATH_TO_ET --kineto-file PATH_TO_KINETO_TRACE --annotation OPERATOR_NAME_TO_SLICE_THE_KINETO_TRACE

Currently we assume ET only contains one iteration and kineto trace may contain multiple iterations, and in case of multiple, the user needs to specify an operator that occurs once per iteration to help slice these iterations. By default, the DataLoader is used.

@facebook-github-bot
Copy link
Contributor

Hi @Mingyu-Liang!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@TaekyungHeo
Copy link
Contributor

@Mingyu-Liang , I appreciate your contribution! I have a question regarding the commit you've made. I noticed that your code requires user annotations ("iteration#{iteration_number}"). I'm wondering if there might be a way to implement the same functionality without the necessity for these user annotations? I've been unable to locate such annotations in my execution graph and Kineto trace.

@Mingyu-Liang
Copy link
Contributor Author

@TaekyungHeo, thanks for the comment. The user annotation is used to locate a specific segment in the kineto trace (assuming ET collects only one iteration but kineto collects multiple, and we want to match ET to its corresponding part in kineto). Do you have any suggestions how we may achieve this without the annotations?

@TaekyungHeo
Copy link
Contributor

Thanks, @Mingyu-Liang. Actually, I do not have a good idea. Do you have any idea, @louisfeng?

@louisfeng
Copy link
Contributor

Hey guys, I'll look at this in more detail. At a high level, we have two possibilities:

  1. ET and kineto traces are collected in different iterations, some iteration dependent labels will be different.
  2. ET and kineto traces are collected in the same iterations, everything should match up.

We have annotations in prod, so we could use this label matching scheme. But it's not general enough for broader users. I will post a draft diff, where we will have a unique ID for each record function events in the trace that can be used to correlate between traces.

@Mingyu-Liang
Copy link
Contributor Author

Hi guys, I update a new version that also supports the approximate match. As @louisfeng mentioned, the previous exact match assumes the traces are collected in the same iterations and everything should match up. Then this approximate match should work when traces are collected in different iterations and we will perform the best-effort match based on the tree edit distance. This can be a temporary fix before more support from the trace side is available.

Still, users may need to specify an operator to help slice the multiple iterations in the kineto trace and the default is DataLoader.

@louisfeng
Copy link
Contributor

Could you please also add some unit tests with using example json trace (ET and Kineto) files?

@@ -0,0 +1,309 @@
import networkx as nx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a new dependency. I'll take a look at networkx.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mingyu-Liang & @louisfeng , We may need to add networkx, numpy, and scipy to requirements.txt

train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
@Mingyu-Liang
Copy link
Contributor Author

Mingyu-Liang commented Jul 12, 2023

Where would be the right place to put the traces?

@louisfeng
Copy link
Contributor

Please put the unit test in this dir: https://github.com/facebookresearch/param/tree/main/train/compute/python/test

You can put the test scripts in that directory and create a sub-dir data for the test trace files.

Thank you @Mingyu-Liang

Copy link
Contributor

@briancoutinho briancoutinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this :)
Curious how much time does it take to run this on a ET & kineto trace generally? I had tried something like this in the past using edit distance (but i did not use networkx so that might make it much faster).

PS: Will look into leveragint this with https://github.com/facebookresearch/HolisticTraceAnalysis
once landed

train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
train/compute/python/tools/trace_link.py Show resolved Hide resolved
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
train/compute/python/tools/trace_link.py Outdated Show resolved Hide resolved
@louisfeng
Copy link
Contributor

louisfeng commented Jul 13, 2023

@Mingyu-Liang Thanks for adding the tests. We can merge this PR after some clean ups and continue to iterate/improve in future PRs.

@Mingyu-Liang
Copy link
Contributor Author

Thanks for adding this :) Curious how much time does it take to run this on a ET & kineto trace generally? I had tried something like this in the past using edit distance (but i did not use networkx so that might make it much faster).

PS: Will look into leveragint this with https://github.com/facebookresearch/HolisticTraceAnalysis once landed

Hi @briancoutinho, thanks for the great comments. For the example traces I uploaded with around 1k nodes, it takes a few seconds to find the sub-optimal mapping between two traces. Finding the exact graph edit distance is NP-hard and often slow, so here I used the function (optimize_graph_edit_distance) with approximations. You can find more details about how networkx handles it here https://networkx.org/documentation/stable/reference/algorithms/similarity.html.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 15, 2023
@TaekyungHeo
Copy link
Contributor

@Mingyu-Liang,

Firstly, I would like to express my gratitude for your valuable contribution to this project!

While testing your code with an example trace file and a kineto trace, I noticed that the memory consumption was quite substantial. Specifically, with input files approximately 5.5MB in size, the memory usage escalated beyond 100GB, which exceeded the available resources.

I was wondering if you might have any insights into why this might be the case? Are there any potential solutions or methods you might suggest for reducing the memory consumption of this tool?

@Mingyu-Liang
Copy link
Contributor Author

@Mingyu-Liang,

Firstly, I would like to express my gratitude for your valuable contribution to this project!

While testing your code with an example trace file and a kineto trace, I noticed that the memory consumption was quite substantial. Specifically, with input files approximately 5.5MB in size, the memory usage escalated beyond 100GB, which exceeded the available resources.

I was wondering if you might have any insights into why this might be the case? Are there any potential solutions or methods you might suggest for reducing the memory consumption of this tool?

Hi @TaekyungHeo , may I know which example trace are you using and if you make any change to the code? I tried both example traces on my side and did not observe that substantial memory usage.

@TaekyungHeo
Copy link
Contributor

TaekyungHeo commented Jul 18, 2023

@Mingyu-Liang, I have been working with one of the Meta-internal traces. Due to confidentiality restrictions, I am able to share it with Wenyin (@wfu-fb) but, unfortunately, not with you. For your information, I haven't made any changes to the existing code. Would it be possible for me to run the tool with your example traces?

@Mingyu-Liang
Copy link
Contributor Author

Mingyu-Liang commented Jul 18, 2023

@TaekyungHeo Of course, you can try run with:

python tools/trace_link.py --et-file test/data/resnet_et.json --kineto-file test/data/resnet_kineto.json --annotation 'enumerate(DataLoader)#_MultiProcessingDataLoaderIter.__next__'

or

python tools/trace_link.py --et-file test/data/linear_et.json --kineto-file test/data/linear_kineto.json --annotation Optimizer.step#SGD.step

@facebook-github-bot
Copy link
Contributor

@louisfeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@Mingyu-Liang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@louisfeng merged this pull request in 842a6ae.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants