Attempt to improve implementation of ProfilerDisabler #20

gimki · 2021-03-04T18:35:43Z

Issue #, if available:

Description of changes:

First iteration:

Introduce new concept in ProfilerDisabler. We break the should_stop_profiler function into 2 functions: should_stop_profiler and should_stop_sampling
should_stop_sampling is expected to be called whenever we sample. It does a check of the limit every time we sample (every second). This check ensure the cpu overhead brought by sampling and aggregating sample under the limit.
should_stop_profiler is expected to be called whenever we refresh config or submit profile. This check ensure the cpu overhead brought by refresh config and submit profile is under the limit.
should_stop_profiler is calculated as "total cpu time spent on sampling + aggregating + refresh config + submit profile" divided by "profile.get_active_millis_since_start()". This formula is debatable as we are comparing cpu time against wall clock time. However, it seems to be the simplest solution with reasonable accuracy.

Note: I am opening this PR is for discussion. The code is not at its final stage yet as it has not been tested properly yet.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

As sampleAndAggregate happens every sampling_interval, our cpu usage checker estimate the cpu usage by taking the cpu time spent in sampleAndAggregate divided by sampling_interval. This change will not be merged until we also have a checker monitoring the cpu usage for submitProfile and refreshConfig.

Introduce should_stop_sampling and should_stop_profiling. should_stop_sampling is called whenever we sample while should_stop_profiling is only called after we refresh config or submit profile.

codeguru_profiler_agent/profiler_disabler.py

mirelap-amazon · 2021-03-08T12:53:57Z

codeguru_profiler_agent/profiler_runner.py

        """
-        if self.profiler_disabler.should_stop_profiling():
+        if self.profiler_disabler.should_stop_sampling():
            logger.info("Profiler will not start.")


Could you update the logger with details that will not start due to sampling's conditions? Same above at "if self.profiler_disabler.should_stop_profiling(profile=self.collector.profile)", add a log similar, to make a clear distinction between the 2 when debugging.

I agree; but I dont think I would update the log here.

I think it would fits better if I update the log message in ProfilerDisabler such as:

"Profiler sampling cpu usage limit reached: {:.2f} % (limit: {:.2f} %), will stop CodeGuru Profiler." vs "Profiler overall cpu usage limit reached: {:.2f} % (limit: {:.2f} %), will stop CodeGuru Profiler."

For exactly this line, the only way that disabler returned False for should_stop_profiling is when killswitch file is detected (cpu and memory check should not be carried out as the agent just started and timer would not do the check until we have several entries in the timer.

After all, if disabler returns false here, it must return error message with either killswitch is detected or cpu/ memory limit breached (either related to sampling or overall). :D

codeguru_profiler_agent/profiler_disabler.py

PapaPedro · 2021-03-09T12:56:21Z

codeguru_profiler_agent/profiler_runner.py

-        return True
+                return RunProfilerStatus(success=True, should_check_overall=True, should_reset=True)
+            self._sample_and_aggregate()
+        return RunProfilerStatus(success=True, should_check_overall=refreshed_config, should_reset=False)


Unless I am mistaken, this will make us check for the overall overhead twice per cycle, after we refresh config and after we report instead of just after we report. Why not return should_check_overall=False here and drop the refreshed_config variable?

Because I think we should stop the agent if the overall cpu_usage for calling configure_agent is high as well. The check is cheap anyway so I went for "better to be safe than sorry" approach :)

But unfortunately this check for "overall" overhead makes sense only when we have data points for all the times we called runProfiler, most of the time it only samples which should have little overhead and sometimes it calls agent config or reports which we know takes more cpu.

Just before we call the agent configuration we should have cleared the previous data points which means if you run the check right after you will only consider the overhead of calling agent config which we know can be high. Actually with current implementation of the check, it will always return False because of the MINIMUM_MEASURES_IN_DURATION_METRICS and at that time we have only 1 data point.

PapaPedro

just a typo I missed last time otherwise it looks good to me

PapaPedro · 2021-03-10T10:34:31Z

test/unit/test_profiler_disabler.py

+        # timer: (0.5*20/100) * 100= 10%
+        assert self.process_duration_check.is_overall_cpu_usage_limit_reached(self.profile)
+
+    def test_when_average_duragtion_is_below_limit_it_returns_false(self):


typo "duragtion"

Colman Yau added 2 commits March 4, 2021 15:47

New implementation of ProfilerDisabler: iteration 1

5014202

Introduce should_stop_sampling and should_stop_profiling. should_stop_sampling is called whenever we sample while should_stop_profiling is only called after we refresh config or submit profile.

mirelap-amazon reviewed Mar 8, 2021

View reviewed changes

codeguru_profiler_agent/profiler_disabler.py Outdated Show resolved Hide resolved

mirelap-amazon reviewed Mar 8, 2021

View reviewed changes

codeguru_profiler_agent/profiler_disabler.py Show resolved Hide resolved

mirelap-amazon reviewed Mar 8, 2021

View reviewed changes

Colman Yau and others added 2 commits March 9, 2021 11:16

Add test for new ProfilerDisabler implementation

335a9ab

Merge branch 'main' into issue-19

f53971d

PapaPedro reviewed Mar 9, 2021

View reviewed changes

Include changes suggested in review

aed725e

PapaPedro approved these changes Mar 10, 2021

View reviewed changes

Update check to be carried out only after profile submission

33152e2

gimki force-pushed the issue-19 branch from f2f65c1 to 33152e2 Compare March 10, 2021 10:38

PapaPedro approved these changes Mar 10, 2021

View reviewed changes

gimki merged commit f49ff7e into main Mar 10, 2021

gimki linked an issue Mar 22, 2021 that may be closed by this pull request

Improvement in CPU usage checker #19

Closed

mirelap-amazon deleted the issue-19 branch March 29, 2021 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attempt to improve implementation of ProfilerDisabler #20

Attempt to improve implementation of ProfilerDisabler #20

Uh oh!

gimki commented Mar 4, 2021

Uh oh!

Uh oh!

Uh oh!

mirelap-amazon Mar 8, 2021

Uh oh!

gimki Mar 8, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PapaPedro Mar 9, 2021

Uh oh!

gimki Mar 9, 2021

Uh oh!

PapaPedro Mar 9, 2021

Uh oh!

PapaPedro left a comment

Uh oh!

PapaPedro Mar 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Attempt to improve implementation of ProfilerDisabler #20

Attempt to improve implementation of ProfilerDisabler #20

Uh oh!

Conversation

gimki commented Mar 4, 2021

Uh oh!

Uh oh!

Uh oh!

mirelap-amazon Mar 8, 2021

Choose a reason for hiding this comment

Uh oh!

gimki Mar 8, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PapaPedro Mar 9, 2021

Choose a reason for hiding this comment

Uh oh!

gimki Mar 9, 2021

Choose a reason for hiding this comment

Uh oh!

PapaPedro Mar 9, 2021

Choose a reason for hiding this comment

Uh oh!

PapaPedro left a comment

Choose a reason for hiding this comment

Uh oh!

PapaPedro Mar 10, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants