Skip to content

Conversation

@gimki
Copy link
Contributor

@gimki gimki commented Mar 4, 2021

Issue #, if available:

#19

Description of changes:

First iteration:

  1. Introduce new concept in ProfilerDisabler. We break the should_stop_profiler function into 2 functions: should_stop_profiler and should_stop_sampling

  2. should_stop_sampling is expected to be called whenever we sample. It does a check of the limit every time we sample (every second). This check ensure the cpu overhead brought by sampling and aggregating sample under the limit.

  3. should_stop_profiler is expected to be called whenever we refresh config or submit profile. This check ensure the cpu overhead brought by refresh config and submit profile is under the limit.

  4. should_stop_profiler is calculated as "total cpu time spent on sampling + aggregating + refresh config + submit profile" divided by "profile.get_active_millis_since_start()". This formula is debatable as we are comparing cpu time against wall clock time. However, it seems to be the simplest solution with reasonable accuracy.

Note: I am opening this PR is for discussion. The code is not at its final stage yet as it has not been tested properly yet.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Colman Yau added 2 commits March 4, 2021 15:47
As sampleAndAggregate happens every sampling_interval, our cpu usage checker
estimate the cpu usage by taking the cpu time spent in sampleAndAggregate
divided by sampling_interval.

This change will not be merged until we also have a checker monitoring
the cpu usage for submitProfile and refreshConfig.
Introduce should_stop_sampling and should_stop_profiling.

should_stop_sampling is called whenever we sample while should_stop_profiling
is only called after we refresh config or submit profile.
"""
if self.profiler_disabler.should_stop_profiling():
if self.profiler_disabler.should_stop_sampling():
logger.info("Profiler will not start.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the logger with details that will not start due to sampling's conditions? Same above at "if self.profiler_disabler.should_stop_profiling(profile=self.collector.profile)", add a log similar, to make a clear distinction between the 2 when debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; but I dont think I would update the log here.

I think it would fits better if I update the log message in ProfilerDisabler such as:

"Profiler sampling cpu usage limit reached: {:.2f} % (limit: {:.2f} %), will stop CodeGuru Profiler." vs "Profiler overall cpu usage limit reached: {:.2f} % (limit: {:.2f} %), will stop CodeGuru Profiler."

For exactly this line, the only way that disabler returned False for should_stop_profiling is when killswitch file is detected (cpu and memory check should not be carried out as the agent just started and timer would not do the check until we have several entries in the timer.

After all, if disabler returns false here, it must return error message with either killswitch is detected or cpu/ memory limit breached (either related to sampling or overall). :D

return True
return RunProfilerStatus(success=True, should_check_overall=True, should_reset=True)
self._sample_and_aggregate()
return RunProfilerStatus(success=True, should_check_overall=refreshed_config, should_reset=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am mistaken, this will make us check for the overall overhead twice per cycle, after we refresh config and after we report instead of just after we report. Why not return should_check_overall=False here and drop the refreshed_config variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I think we should stop the agent if the overall cpu_usage for calling configure_agent is high as well. The check is cheap anyway so I went for "better to be safe than sorry" approach :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But unfortunately this check for "overall" overhead makes sense only when we have data points for all the times we called runProfiler, most of the time it only samples which should have little overhead and sometimes it calls agent config or reports which we know takes more cpu.

Just before we call the agent configuration we should have cleared the previous data points which means if you run the check right after you will only consider the overhead of calling agent config which we know can be high. Actually with current implementation of the check, it will always return False because of the MINIMUM_MEASURES_IN_DURATION_METRICS and at that time we have only 1 data point.

Copy link
Contributor

@PapaPedro PapaPedro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a typo I missed last time otherwise it looks good to me

# timer: (0.5*20/100) * 100= 10%
assert self.process_duration_check.is_overall_cpu_usage_limit_reached(self.profile)

def test_when_average_duragtion_is_below_limit_it_returns_false(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "duragtion"

@gimki gimki merged commit f49ff7e into main Mar 10, 2021
@gimki gimki linked an issue Mar 22, 2021 that may be closed by this pull request
@mirelap-amazon mirelap-amazon deleted the issue-19 branch March 29, 2021 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improvement in CPU usage checker

3 participants