-
Notifications
You must be signed in to change notification settings - Fork 11
Attempt to improve implementation of ProfilerDisabler #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As sampleAndAggregate happens every sampling_interval, our cpu usage checker estimate the cpu usage by taking the cpu time spent in sampleAndAggregate divided by sampling_interval. This change will not be merged until we also have a checker monitoring the cpu usage for submitProfile and refreshConfig.
Introduce should_stop_sampling and should_stop_profiling. should_stop_sampling is called whenever we sample while should_stop_profiling is only called after we refresh config or submit profile.
| """ | ||
| if self.profiler_disabler.should_stop_profiling(): | ||
| if self.profiler_disabler.should_stop_sampling(): | ||
| logger.info("Profiler will not start.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the logger with details that will not start due to sampling's conditions? Same above at "if self.profiler_disabler.should_stop_profiling(profile=self.collector.profile)", add a log similar, to make a clear distinction between the 2 when debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; but I dont think I would update the log here.
I think it would fits better if I update the log message in ProfilerDisabler such as:
"Profiler sampling cpu usage limit reached: {:.2f} % (limit: {:.2f} %), will stop CodeGuru Profiler." vs "Profiler overall cpu usage limit reached: {:.2f} % (limit: {:.2f} %), will stop CodeGuru Profiler."
For exactly this line, the only way that disabler returned False for should_stop_profiling is when killswitch file is detected (cpu and memory check should not be carried out as the agent just started and timer would not do the check until we have several entries in the timer.
After all, if disabler returns false here, it must return error message with either killswitch is detected or cpu/ memory limit breached (either related to sampling or overall). :D
| return True | ||
| return RunProfilerStatus(success=True, should_check_overall=True, should_reset=True) | ||
| self._sample_and_aggregate() | ||
| return RunProfilerStatus(success=True, should_check_overall=refreshed_config, should_reset=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I am mistaken, this will make us check for the overall overhead twice per cycle, after we refresh config and after we report instead of just after we report. Why not return should_check_overall=False here and drop the refreshed_config variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I think we should stop the agent if the overall cpu_usage for calling configure_agent is high as well. The check is cheap anyway so I went for "better to be safe than sorry" approach :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But unfortunately this check for "overall" overhead makes sense only when we have data points for all the times we called runProfiler, most of the time it only samples which should have little overhead and sometimes it calls agent config or reports which we know takes more cpu.
Just before we call the agent configuration we should have cleared the previous data points which means if you run the check right after you will only consider the overhead of calling agent config which we know can be high. Actually with current implementation of the check, it will always return False because of the MINIMUM_MEASURES_IN_DURATION_METRICS and at that time we have only 1 data point.
PapaPedro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a typo I missed last time otherwise it looks good to me
test/unit/test_profiler_disabler.py
Outdated
| # timer: (0.5*20/100) * 100= 10% | ||
| assert self.process_duration_check.is_overall_cpu_usage_limit_reached(self.profile) | ||
|
|
||
| def test_when_average_duragtion_is_below_limit_it_returns_false(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "duragtion"
Issue #, if available:
#19
Description of changes:
First iteration:
Introduce new concept in ProfilerDisabler. We break the should_stop_profiler function into 2 functions: should_stop_profiler and should_stop_sampling
should_stop_sampling is expected to be called whenever we sample. It does a check of the limit every time we sample (every second). This check ensure the cpu overhead brought by sampling and aggregating sample under the limit.
should_stop_profiler is expected to be called whenever we refresh config or submit profile. This check ensure the cpu overhead brought by refresh config and submit profile is under the limit.
should_stop_profiler is calculated as "total cpu time spent on sampling + aggregating + refresh config + submit profile" divided by "profile.get_active_millis_since_start()". This formula is debatable as we are comparing cpu time against wall clock time. However, it seems to be the simplest solution with reasonable accuracy.
Note: I am opening this PR is for discussion. The code is not at its final stage yet as it has not been tested properly yet.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.