Skip to content

Commit

Permalink
make sure to stopTrace() on exception (pytorch#125131)
Browse files Browse the repository at this point in the history
If there's an exception during collection it can result in the profiler never being stopped properly. As a result all subsequent tests that use profiling will also fail - even if they pass in isolation.

I'm hoping this fixes the flakyness in pytorch#124253, pytorch#124220, pytorch#82720, pytorch#119346, pytorch#119364, pytorch#119490, pytorch#119526, pytorch#119537 (and the currently closed pytorch#82864).

Before:
```
(py312) $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/profiler/test_profiler.py
===================================================================================================================== FAILURES =====================================================================================================================
============================================================================================================= short test summary info ==============================================================================================================
FAILED test/profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_with_kineto - AssertionError: Element counts were not equal:
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_conv2d_bias_followed_by_batchnorm2d_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern_benchmark - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_fp32_matmul_pattern - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_matmul_dim_fp16_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_kineto_multigpu - torch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'events'
FAILED test/profiler/test_profiler.py::TestProfiler::test_oom_tracing - AssertionError: RuntimeError not raised
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_basic_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_close_in_scope_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_complex_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_multiple_preexisting_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_open_in_scope_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_optimizer_parameters_sgd - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_refcounts - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_sparse_tensors - RuntimeError: Can't disable Kineto profiler when it's not running
==================================================================================================== 16 failed, 26 passed, 53 skipped in 25.51s ====================================================================================================
```
After:
```
(py312) $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/profiler/test_profiler.py
===================================================================================================================== FAILURES =====================================================================================================================
============================================================================================================= short test summary info ==============================================================================================================
FAILED test/profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_with_kineto - AssertionError: Element counts were not equal:
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern - RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/data/users/aorenste/pytorch/torch/csrc/autograd/profiler_python.cpp":969...
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern_benchmark - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_fp32_matmul_pattern - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestProfiler::test_kineto_multigpu - torch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'events'
FAILED test/profiler/test_profiler.py::TestProfiler::test_oom_tracing - AssertionError: RuntimeError not raised
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_optimizer_parameters_sgd - RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/data/users/aorenste/pytorch/torch/csrc/autograd/profiler_python.cpp":969, please...
==================================================================================================== 7 failed, 35 passed, 53 skipped in 31.51s =====================================================================================================
```

Pull Request resolved: pytorch#125131
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
  • Loading branch information
aorenste authored and andoorve committed May 1, 2024
1 parent d97568c commit 9fdd4ae
Showing 1 changed file with 13 additions and 2 deletions.
15 changes: 13 additions & 2 deletions torch/csrc/profiler/collection.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1409,8 +1409,19 @@ RecordQueue::getRecords(
}

if (python_tracer_) {
for (const auto& i : python_tracer_->getEvents(
converter, python_enters, static_cast<c10::time_t>(end_time_ns))) {
std::vector<std::shared_ptr<torch::profiler::impl::Result>> ev;
try {
ev = python_tracer_->getEvents(
converter, python_enters, static_cast<c10::time_t>(end_time_ns));
} catch (std::exception& e) {
// Normally addKinetoEvents() below will stop the trace - but if an
// exception happens here then the events will never be stopped and future
// runs will be broken - so make sure to stopTrace() if we see an
// exception.
torch::profiler::impl::kineto::stopTrace();
throw;
}
for (const auto& i : ev) {
out.push_back(i);
}
python_tracer_.reset();
Expand Down

0 comments on commit 9fdd4ae

Please sign in to comment.