Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fixing duplication in operator profiling #15240

Merged
merged 9 commits into from Jun 21, 2019
Merged

Fixing duplication in operator profiling #15240

merged 9 commits into from Jun 21, 2019

Conversation

Zha0q1
Copy link
Contributor

@Zha0q1 Zha0q1 commented Jun 13, 2019

Description

fix: #10520
fix: #15243
For detailed explanations of the cause as well the proposed fix of the issue, please refer to https://cwiki.apache.org/confluence/display/MXNET/Fixing+Duplication+in+Operator+Profiling?moved=true

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@Zha0q1 Zha0q1 changed the title Fixing duplication in operator profiling [WIP] Fixing duplication in operator profiling Jun 13, 2019
@Zha0q1 Zha0q1 changed the title [WIP] Fixing duplication in operator profiling Fixing duplication in operator profiling Jun 14, 2019
@vandanavk
Copy link
Contributor

@mxnet-label-bot add [Profiler, pr-awaiting-review]

@marcoabreu marcoabreu added pr-awaiting-review PR is waiting for code review Profiler MXNet profiling issues labels Jun 14, 2019
Copy link
Contributor

@sandeep-krishnamurthy sandeep-krishnamurthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add few test case?

src/profiler/profiler.h Outdated Show resolved Hide resolved
src/profiler/profiler.h Outdated Show resolved Hide resolved
@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Jun 14, 2019

Can we add few test case?

Yeah, after we merge #15132. This way, I can use the json return to test.

and 'Count' in target_dict['Time']['operator']['sqrt'] \
and '_plus_scalar' in target_dict['Time']['operator'] \
and 'Count' in target_dict['Time']['operator']['_plus_scalar']
# thet are called once and twice respectively
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: they ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Nice catch!

@access2rohit
Copy link
Contributor

LGTM! just fix the typo in comments

Copy link
Contributor

@sandeep-krishnamurthy sandeep-krishnamurthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment.

@@ -843,6 +854,8 @@ struct ProfileTask : public ProfileDuration {
VTUNE_ONLY_CODE(std::unique_ptr<vtune::VTuneTask> vtune_task_);
/*! \brief NVTX duration object */
NVTX_ONLY_CODE(std::unique_ptr<nvtx::NVTXDuration> nvtx_duration_);
/*! \brief not to add this stat to AggregateStats */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 'not to add...' -> Add profiler stat to AggregateStats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

/*!
* \brief Whether to add stat to AggregateStats
*/
void enableAggregateStats(bool enabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the all the places this is True. Except in VTune task case.
Can we make it default parameter with default value True?
So in only VTune Task case, we set it to False. In other places, we don't need to make changes.

Copy link
Contributor Author

@Zha0q1 Zha0q1 Jun 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can default enabled to true. enableAggregateStats() is a new function I added and the only place it is called is in line 1167 with enabled == false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_aggregate_ has already been defaulted to true

Copy link
Contributor

@sandeep-krishnamurthy sandeep-krishnamurthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM

@sandeep-krishnamurthy sandeep-krishnamurthy added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Jun 20, 2019
@Zha0q1 Zha0q1 closed this Jun 20, 2019
@Zha0q1 Zha0q1 reopened this Jun 20, 2019
@Zha0q1 Zha0q1 closed this Jun 20, 2019
@Zha0q1 Zha0q1 reopened this Jun 20, 2019
@Zha0q1 Zha0q1 closed this Jun 21, 2019
@Zha0q1 Zha0q1 reopened this Jun 21, 2019
@Zha0q1 Zha0q1 closed this Jun 21, 2019
@Zha0q1 Zha0q1 reopened this Jun 21, 2019
@sandeep-krishnamurthy
Copy link
Contributor

Hi @Zha0q1, I assume you are closing and opening PR to pass CI?
Are tests passing locally on your machine? If tests are flaky is there a Github issue to track it?
Because, each CI run is very costly.

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Jun 21, 2019

Hi @Zha0q1, I assume you are closing and opening PR to pass CI?
Are tests passing locally on your machine? If tests are flaky is there a Github issue to track it?
Because, each CI run is very costly.

Sorry I will use CI more frugally. Yes they are passing locally.
Yesterday some of the tests were not assigned a worker thread and the time limit was exceeded, so I restarted the tests

@sandeep-krishnamurthy
Copy link
Contributor

Hi @Zha0q1, I assume you are closing and opening PR to pass CI?
Are tests passing locally on your machine? If tests are flaky is there a Github issue to track it?
Because, each CI run is very costly.

Sorry I will use CI more frugally. Yes they are passing locally.
Yesterday some of the tests were not assigned a worker thread and the time limit was exceeded, so I restarted the tests

Thanks. No problem. If it is passing locally and failing in CI, then CI should help us. Please file Github issue if you see Flaky issues so we can fix it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-merge Review and CI is complete. Ready to Merge Profiler MXNet profiling issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Profiler: Operator Aggregate Stats Duplication MXNet operator profile aggregate counter issue
5 participants