Add support for Python profiling of Server requests #3502

webbnh · 2023-07-24T23:56:34Z

This PR adds support for performing execution profiling of Pbench Server API requests.

The profiling is performed by the Python Standard Library cProfile package. Profiling is controlled via Gunicorn pre- and post-request hook functions, which are installed via a Gunicorn configuration file (which is actually Python code...pretty cool).

How to use it: if the PB_PROFILE_DUMP_FILE environment variable is defined when shell.py executes, it adds the configuration file to the Gunicorn command line, and everything else is handled automagically.

Profiling data is collected only during a single API request, starting when that request is invoked and ending when the request is complete. If the environment variable is defined to a file path, then the profile data is dumped to that file when the request completes. The data from each request overwrites the data from the previous request.

I recommend defining PB_PROFILE_DUMP_FILE to /srv/pbench/public_html/server_profile.prof: this results in the profile data being placed where it can easily be fetched by pointing a browser at https://<server_address>/server_profile.prof. Once downloaded, the profile data can be nicely visualized using SnakeViz.

Note that if PB_PROFILE_DUMP_FILE is defined instead to an empty string, the top time-consuming functions are dumped to the log, instead. Unfortunately, the list appears twice in the log -- I don't know why -- but, since it's so easy to use the file dump and so much more effective to use SnakeViz's "icicle graph", I'm not overly concerned about it.

There is no turn-key method at present for setting PB_PROFILE_DUMP_FILE in the canned server (the easiest way seems to be editing the server/lib/systemd/pbench-server.service file before the container is built or editing /etc/systemd/system/pbench-server.service inside the container when it's running). But, this support was quite valuable in tracking down the issues with upload timeouts, so I figured I would just capture it in its current state, which is quite usable once it is enabled.

PBENCH-1228

lib/pbench/profiling.conf.py

dbutenhof · 2023-07-25T11:45:59Z

lib/pbench/profiling.conf.py

+    worker object (ain't Python grand!?), and we use the worker's log.)
+    """
+    worker.profile = cProfile.Profile()
+    worker.log.info(f"PROFILING {worker.pid}: {req.uri}")


I'm generally bothered by the "single API" profiling, since it relies on "stopping the ride" at just the right time. A more useful alternative might be to provide a way to specify a full or base URI to profile, e.g., /api/v1/upload/.* and only profile on a match.

I'll grant that it's not a fully general solution (but it was sufficiently useful to my needs). Using an environment variable to control it means restarting the server to enable/disable the profiling, which, in my mind, relegates it to use in a controlled/test environment...where "stopping the ride" is really not a problem.

We can imagine lots of ways to improve the controls over it, but I would defer them until we have an actual need.

dbutenhof · 2023-07-25T11:51:26Z

lib/pbench/profiling.conf.py

+        s = StringIO()
+        ps = pstats.Stats(worker.profile, stream=s).sort_stats("cumulative")
+        ps.print_stats(30)  # Limit the number of lines output
+        worker.log.info(f"[{worker.pid}] [INFO] {s.getvalue()}")


Despite the limitations in format, this gives the advantage of profiling data for every call in an accessible place. The [INFO], by the way, seems unnecessary bulk. Hopefully pstats.Stats identifies the URI?

I concur that there is value to being able to put it in the journal-log, which is why I've retained that as a choice. I just don't see the value of having it there and in the external file...so, as the code stands, it's one or the other. Also, the output in the log is only the top 30 functions in terms of the cumulative time attributed to them including their subroutine calls; unfortunately, the "tail" is pretty long (between 980 and 1646), so 30 functions doesn't capture much of it...but posting more than 30 will make a mess of the log...which makes the external file much more attractive.

I've revised the logging to remove much of the unnecessary bulk. It was there in the code that I was cribbing from, so I kept it. (That output format seems to be consistent with other Gunicorn messages, but I concur that it doesn't serve this purpose well, so I've removed it.)

pstats.Stats does not identify the URI, but the log message in one of the previous lines does, so the context is available in the log. (I considered adding it to this message, but I don't think it actually accomplishes anything over having it in an adjacent message.)

webbnh

This is now "ready for review".

lib/pbench/profiling.conf.py

webbnh · 2023-07-29T03:55:27Z

lib/pbench/profiling.conf.py

+    worker object (ain't Python grand!?), and we use the worker's log.)
+    """
+    worker.profile = cProfile.Profile()
+    worker.log.info(f"PROFILING {worker.pid}: {req.uri}")


I'll grant that it's not a fully general solution (but it was sufficiently useful to my needs). Using an environment variable to control it means restarting the server to enable/disable the profiling, which, in my mind, relegates it to use in a controlled/test environment...where "stopping the ride" is really not a problem.

We can imagine lots of ways to improve the controls over it, but I would defer them until we have an actual need.

webbnh · 2023-07-29T03:59:44Z

lib/pbench/profiling.conf.py

+        s = StringIO()
+        ps = pstats.Stats(worker.profile, stream=s).sort_stats("cumulative")
+        ps.print_stats(30)  # Limit the number of lines output
+        worker.log.info(f"[{worker.pid}] [INFO] {s.getvalue()}")


I concur that there is value to being able to put it in the journal-log, which is why I've retained that as a choice. I just don't see the value of having it there and in the external file...so, as the code stands, it's one or the other. Also, the output in the log is only the top 30 functions in terms of the cumulative time attributed to them including their subroutine calls; unfortunately, the "tail" is pretty long (between 980 and 1646), so 30 functions doesn't capture much of it...but posting more than 30 will make a mess of the log...which makes the external file much more attractive.

I've revised the logging to remove much of the unnecessary bulk. It was there in the code that I was cribbing from, so I kept it. (That output format seems to be consistent with other Gunicorn messages, but I concur that it doesn't serve this purpose well, so I've removed it.)

pstats.Stats does not identify the URI, but the log message in one of the previous lines does, so the context is available in the log. (I considered adding it to this message, but I don't think it actually accomplishes anything over having it in an adjacent message.)

dbutenhof

The duplication of log output is a concern. (Are you really generating two dump files, also, and just not noticing?) But since this is something that needs to be explicitly enabled with a server restart, and despite the significant shortcomings of the current approach provides at worst a viable template for valuable information going forward, I'm OK with just pushing it through and polishing the floors later when we get around to it... 😁

webbnh · 2023-07-31T22:16:06Z

The duplication of log output is a concern.

At this point, I would call it a "mystery" rather than a "concern" (and, I suspect that it is actually a feature of the Python logger which we just weren't expecting to see show up like that).

The two entries in the log have different prefixes and different formatting (i.e., one of them has all the text (including newlines) in a single log entry while the other logs the same text with each line in a separate log message). This seems to me like they both come from the same call in the source code but go through different handlers in the logger stack whose output ends up redirected (unexpectedly, perhaps) to the same place.

Are you really generating two dump files, also, and just not noticing?

That is an excellent question...I don't know how we could tell.... 😆

webbnh added Server Code Infrastructure labels Jul 24, 2023

webbnh added this to the v0.73 milestone Jul 24, 2023

webbnh self-assigned this Jul 24, 2023

webbnh force-pushed the server_profiling branch from a2b8c01 to e91454b Compare July 25, 2023 00:35

dbutenhof reviewed Jul 25, 2023

View reviewed changes

Experiment with Python profiling of Server requests

00501a4

webbnh force-pushed the server_profiling branch from e91454b to 00501a4 Compare July 28, 2023 22:34

Reduce the unnecessary bulk in the log

251fed8

webbnh marked this pull request as ready for review July 29, 2023 04:16

webbnh commented Jul 29, 2023

View reviewed changes

webbnh requested review from dbutenhof and ndokos July 29, 2023 04:17

dbutenhof approved these changes Jul 31, 2023

View reviewed changes

ndokos approved these changes Jul 31, 2023

View reviewed changes

webbnh merged commit 7682537 into distributed-system-analysis:main Jul 31, 2023
4 checks passed

webbnh deleted the server_profiling branch July 31, 2023 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Python profiling of Server requests #3502

Add support for Python profiling of Server requests #3502

webbnh commented Jul 24, 2023 •

edited

dbutenhof Jul 25, 2023

webbnh Jul 29, 2023

dbutenhof Jul 25, 2023

webbnh Jul 29, 2023

webbnh left a comment

webbnh Jul 29, 2023

webbnh Jul 29, 2023

dbutenhof left a comment

webbnh commented Jul 31, 2023

Add support for Python profiling of Server requests #3502

Add support for Python profiling of Server requests #3502

Conversation

webbnh commented Jul 24, 2023 • edited

dbutenhof Jul 25, 2023

Choose a reason for hiding this comment

webbnh Jul 29, 2023

Choose a reason for hiding this comment

dbutenhof Jul 25, 2023

Choose a reason for hiding this comment

webbnh Jul 29, 2023

Choose a reason for hiding this comment

webbnh left a comment

Choose a reason for hiding this comment

webbnh Jul 29, 2023

Choose a reason for hiding this comment

webbnh Jul 29, 2023

Choose a reason for hiding this comment

dbutenhof left a comment

Choose a reason for hiding this comment

webbnh commented Jul 31, 2023

webbnh commented Jul 24, 2023 •

edited