Add new cylc command to show task timings. #2286

trwhitcomb · 2017-05-16T16:59:22Z

Add a new query to the run database and a top-level driver
script that will allow interrogation of task times to facilitate
generation of statistics about wallclock time required for tasks,
including information necessary to generate batch queue wait times,
task run time (which was already available through the GUI), and
total wallclock time for the task.

Add a new query to the run database and a top-level driver script that will allow interrogation of task times to facilitate generation of statistics about wallclock time required for tasks, including information necessary to generate batch queue wait times, task run time (which was already avialable through the GUI), and total wallclock time for the task.

hjoliver · 2017-05-16T22:05:39Z

@trwhitcomb - thanks for this, looks like it will be useful. Partially addresses #1683

Note the Python PEP8 style check failed on Travis CI:

./tests/pep8/00-bin-lib.t .. 1/3 --- -	2017-05-16 17:29:51.952526986 +0000
+++ 00-bin-lib.stdout	2017-05-16 17:29:51.799232784 +0000
@@ -0,0 +1,6 @@
+/home/travis/build/cylc/cylc/lib/cylc/rundb.py:735:25: E241 multiple spaces after ':'
+/home/travis/build/cylc/cylc/lib/cylc/rundb.py:736:25: E241 multiple spaces after ':'
+/home/travis/build/cylc/cylc/lib/cylc/rundb.py:737:23: E241 multiple spaces after ':'
+/home/travis/build/cylc/cylc/lib/cylc/rundb.py:738:25: E241 multiple spaces after ':'
+/home/travis/build/cylc/cylc/lib/cylc/rundb.py:740:80: E501 line too long (104 > 79 characters)
+/home/travis/build/cylc/cylc/bin/cylc-show-times:58:18: E221 multiple spaces before operator

If @cylc/core agree this is a good thing, some tests will be needed to ensure future changes don't break it - let us know if you want some advice on how to do this.

trwhitcomb · 2017-05-16T23:08:44Z

@hjoliver Thanks. This came out of our desire to keep tabs on how long certain jobs were waiting in the queue. I have an additional script that reads this output, does some grouping, and then makes basic plots like the attached but haven't included that here as it introduces additional dependencies (such as Pandas and Matplotlib).

I'll have a look at the PEP8 errors (aaargh - those are PEP8 rules I don't like!) but will definitely need assistance for the tests.

hjoliver · 2017-05-16T23:35:52Z

@trwhitcomb - the plots look nice, BoM would be interested in this for sure. I think we could consider including this sort of thing in Cylc so long as we have clear documentation of requirements and clear error messages for those who don't have Pandas and Matplotlib installed.

Re PEP8 - initially I did not particularly agree with all the style rules, but I've come to love them 😀 If nothing else, enforced consistency has proved to be a very good thing.

arjclark · 2017-05-17T09:03:50Z

@trwhitcomb @hjoliver Sounds good to me and certainly the sort of thing various users are interested in. Would be nice if the command could natively support printing summary timings (i.e. queued time and execution time) via a cli switch - this could be extended to displaying average timings as well.

trwhitcomb · 2017-05-17T15:57:52Z

@arjclark actually the summary timing print was where this whole thing started. The original version of the script that we wrote performed those computations and the table that was printed out was based on the summary statistics rather than the individual listing you see here.

The reason that this PR doesn't include that functionality is that computation used Pandas for cleanly handling a lot of the data alignment issues and summary statistics computations, and I didn't want to introduce an additional package dependency to the cylc system.

That said, if there's interest in this, then perhaps it could be handled similar to the way that graphviz is now, i.e. try to import it and if it doesn't work, disable the features that depend on it. But that's not a call I wanted to make before getting input from the core development team!

Fix errors complained about PEP8 test in Travis CI.

arjclark · 2017-05-18T07:20:31Z

@trwhitcomb - I guess it's just alignment that's the issue though right? As far as summary stats are concerned you should be able to take your db results and calculate the stats in python from those? An ancient bit of scripting I wrote for someone gets all the useful data out of the database into a python structure and then does these lines:

# Set/Extract T_FMT as the timepoint format string used in the suite database via suite_params table
date_submit = datetime.strptime(submit_date_string, T_FMT)
date_start = datetime.strptime(start_date_string, T_FMT)
td = date_start - date_submit
queue_seconds = (td.microseconds + (td.seconds + td.days * 24 * 3600) * 10**6) / 10**6
# any formatting choices e.g. converting to hh:mm:ss format or whatever
# etc.

As far as alignment goes you can just go with the "crude" solution of padding each result entry with spaces until it's length is equal to the longest length in its column and then printing that (I did something to that effect in the rosie command line utils).

I think Pandas makes a lot of this easier but its not necessary.

trwhitcomb · 2017-05-18T15:42:07Z

@arjclark Alignment in this case is less about text alignment (i.e. column width) and more about alignment of indexed data while performing arithmetic operations - namely that with a large dataset being able to do something like:

queue_time = timings['start_time'] - timings['submit_time']

and ensuring that the index of (name, cycle, submit_num) is consistent and you're subtracting matching data together even if one of them is (potentially) missing is very convenient. You're right that it's not strictly necessary, but it does get around some nasty corner cases (especially where data isn't present).

hjoliver · 2017-05-18T21:23:45Z

@trwhitcomb @arjclark - IMO it is OK to add a new dependency on Pandas and Matplotlib for this kind of functionality. We should use the best tools for the job unless there's reason to believe that installation of these packages is very difficult.

trwhitcomb · 2017-05-19T16:51:32Z

Sounds good - I'll work on incorporating those features in.

For handling the case where packages are not installed, is it better to not show the options that require it, or always show everything but raise an error if the user uses a switch that requires a missing package?

hjoliver · 2017-05-21T21:23:56Z

@trwhitcomb - I vote for show all options and raise an error - otherwise users may not notice that the extra functionality is available.

Relegate the previous output format to an optional raw output and instead default to a much more useful summary output that shows computed queue times, run times, and total times based on the run database. Add a dependency to the Pandas library to compute summary statistics.

trwhitcomb · 2017-05-22T21:28:54Z

Here's an example of the summary timings available - because the distributions can be skewed and decidedly non-Gaussian, I've opted to go for a few more parameters than the regular "mean/standard deviation" pairs.

================================================================================
                    Host: localhost     Batch System: background
================================================================================
                                   queue_time
                                   ----------
       count      mean      std  min  25%  50%  75%  max
name
model      3  1.333333  0.57735    1    1    1  1.5    2
post       3  1.333333  0.57735    1    1    1  1.5    2
prep       1  3.000000      NaN    3    3    3  3.0    3
stop       1  1.000000      NaN    1    1    1  1.0    1

                                    run_time
                                    --------
       count       mean       std  min   25%  50%  75%  max
name
model      3  10.333333  1.527525    9   9.5   10   11   12
post       3  10.333333  1.527525    9   9.5   10   11   12
prep       1   9.000000       NaN    9   9.0    9    9    9
stop       1  10.000000       NaN   10  10.0   10   10   10

                                   total_time
                                   ----------
       count       mean       std  min  25%  50%  75%  max
name
model      3  11.666667  1.154701   11   11   11   12   13
post       3  11.666667  1.154701   11   11   11   12   13
prep       1  12.000000       NaN   12   12   12   12   12
stop       1  11.000000       NaN   11   11   11   11   11

hjoliver · 2017-05-22T22:33:13Z

Looks good.

Update the timing summary to include generated HTML showing both boxplots of the timing statistics (per task) as well as a tabular summary of the statistics. Refactor the summary generation to remove duplication between the text summary and HTML summary.

trwhitcomb · 2017-05-24T21:32:36Z

Updates are done - outputs now include:

raw output (suitable for custom downstream processing)
text-based summary output (requires Pandas)
HTML-based summary output (requires Pandas and matplotlib)

HTML output generates a page like this:

I also accept an output file argument to control where things go, and thanks to StackOverflow have a clean handler for output to either a file or to stdout.

I think at this point it's ready for a more thorough review - I know some tests should get written but I don't have a good handle on a) how to do that and b) what's best to test. Perhaps making sure that the database columns are what we need.

Additionally, does this need more documentation than what's automatically generated when the cug grabs the docstring?

hjoliver

Mostly looks good to me, tests as working, just a few comments to think about.

Noticed I get a repeated error if pandas is not installed:
Cannot import pandas - summary unavailable.Cannot import pandas - summary unavailable.

hjoliver · 2017-05-25T01:42:43Z

bin/cylc

@@ -237,6 +237,7 @@ utility_commands['cycle-point'] = [
 utility_commands['scp-transfer'] = ['scp-transfer']
 utility_commands['suite-state'] = ['suite-state']
 utility_commands['ls-checkpoints'] = ['ls-checkpoints']
+utility_commands['show-times'] = ['show-times']


I'm not sure I like the name cylc show-times. What about cylc report-timings. Or maybe just cylc report with special option for a timing report, leaving room for other kinds of report in the future without adding many more cylc sub-commands? (This needs input from @cylc/core I think)

FWIW my vote would be to rename to cylc report-timings and then modify into a report subcommand once we have something else ready to go.

Yeah, that's fine with me.

hjoliver · 2017-05-25T01:43:51Z

bin/cylc-show-times

+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+"""cylc [util] show-times [OPTIONS] REG
+


I'll suggest a few changes below to reduce help text verbosity a bit...

hjoliver · 2017-05-25T01:44:42Z

bin/cylc-show-times

+
+"""cylc [util] show-times [OPTIONS] REG
+
+Retrieve timing information for the given cylc suite that can be


Retrieve suite timing information for wait and run time performance analysis.

hjoliver · 2017-05-25T01:47:40Z

bin/cylc-show-times

+
+Retrieve timing information for the given cylc suite that can be
+used for fine-grain diagnostics for wait time and run time performance
+analysis.  Two main categories of output are available: summary output


"raw output, and text or HTML summary output, is available"

hjoliver · 2017-05-25T01:49:21Z

bin/cylc-show-times

+analysis.  Two main categories of output are available: summary output
+(in text or web format) and raw output.  If an output filename is
+supplied, output is redirected there otherwise both summary and raw
+outputs are printed to standard output.


A minor point, but given that all the output including SVG plots is text-based, an output filename option is not really necessary - just advise to redirect with > filename? (hmmm, or maybe we want an MS Windows port one day...)

hjoliver · 2017-05-25T01:59:12Z

bin/cylc-show-times

+    )
+    parser.add_option(
+        "-r", "--raw",
+        help="Show raw timing output for downstream processing.",


What do you, "for downstream processing"?

"downstream processing" meaning user-specific postprocessing that happens with the data. This is less necessary now that we're including some sort of summary capability within the command, but I can still see people wanting to do something extra and using the raw output as data input.

Oh, understood! I was confused as we commonly use "downstream processing" to describe bits of in-suite workflow downstream of some important task.

Maybe something like "raw timing output suitable for automated text parsing"?

hjoliver · 2017-05-25T02:05:25Z

bin/cylc-show-times

+    if output_options.count(True) > 1:
+        parser.error('Cannot combine output formats (choose one)')
+    if output_options.count(True) == 0:
+        # No output specified - choose summary by default


if not any(output_options)

hjoliver · 2017-05-25T02:21:51Z

bin/cylc-show-times

+
+    def write_category(self, buf, category, df_reshape, df_describe):
+        pass
+


should these use @abstractmethod?

It can, but one of the reasons I implemented it this way is so I don't need to define every single routine in the inheriting classes. If they're @abstractmethod, I'd need to add subroutines with pass in the child classes if it's not used (i.e. TextTimingSummary doesn't use a summary header or footer) If you'd like, though, I can modify to make it more explicit.

No, that's fine - it's not a major issue in this case.

hjoliver · 2017-05-25T02:26:17Z

lib/cylc/rundb.py

+            GROUP BY
+                e.name, e.cycle, e.submit_num
+        """ % {
+            'task_events': self.TABLE_TASK_EVENTS,


(this is the gnarliest db query I've seen to date ... not that I see many of these!)

I think it may be easier to use:

task_states.submit_num for the latest submit number.

task_jobs.time_submit for submit time.

task_jobs.time_run for start time.

task_jobs.time_exit for the exit time.

(The task_jobs table is designed to store this sort of information for Rose Bush, so should be a bit more efficient.)

hjoliver · 2017-05-25T02:30:22Z

bin/cylc-show-times

+                border-bottom: 1px solid grey;
+            }
+            svg {
+                width: 65%;


This makes the plots look too big, to me, although perhaps it depends on how much data is being displayed. Could be configurable?

hjoliver · 2017-05-25T03:29:48Z

PEP8 checks failed, btw.

hjoliver · 2017-05-25T03:30:37Z

@matthewrmshin - can you review too, or reassign?

trwhitcomb · 2017-05-25T04:13:45Z

@hjoliver spotted the pep8 - this time I actually ran it before committing and it caught a bunch - don't know why those last few slipped through. Will fix when I commit the response to your review.

matthewrmshin

Hi @trwhitcomb I have added some initial thoughts that you may want to consider, or perhaps open new issues for after this PR.

matthewrmshin · 2017-05-25T08:11:44Z

bin/cylc-show-times

+            }
+        """
+
+        buf.write('<html><head><style>%s</style></head><body>' % css)


Just wondering if you'd consider moving the display presentation logic into Jinja2 templates?

matthewrmshin · 2017-05-25T08:23:35Z

lib/cylc/rundb.py

+            GROUP BY
+                e.name, e.cycle, e.submit_num
+        """ % {
+            'task_events': self.TABLE_TASK_EVENTS,


I think it may be easier to use:

task_states.submit_num for the latest submit number.

task_jobs.time_submit for submit time.

task_jobs.time_run for start time.

task_jobs.time_exit for the exit time.

(The task_jobs table is designed to store this sort of information for Rose Bush, so should be a bit more efficient.)

Update several items based on review and revision comments. Some highlights: - command is now cylc report-timings - db query updated to mostly use the same table as Rose Bush, avoiding a pivot and triple join. The new query performs much better than the old query and is easier to read. - documentation updated for less verbosity - PEP8 updates that weren't caught by local pep8 - missing library error should only show up once

trwhitcomb · 2017-05-25T19:36:46Z

Thanks, guys! New revision should address most of these things.

@hjoliver I can see what you mean about the image size, but it does resize with the window. We could parameterize but I'm open to suggestions for the easiest way to handle that (I'm trying not to overcomplicate things intially).

@matthewrmshin Using task_jobs is a great idea, and simplifies things quite a bit. I had originally only been working in the task_events table and missed that the information was also in task_jobs. The times between the two tables are slightly different (typically only by a second or two) but the query runs much faster. I like the idea of Jinja2 templates for the report output, but think that should come in a follow-up issue as there are some things I'm not quite clear on.

trwhitcomb · 2017-05-25T19:37:29Z

Also, it looks like my local pep8 install isn't picking up some of the things that Travis CI fails on, so now I'll wait and see if the checks pass and follow-up if it finds something my pep8 missed.

Timing information requires obtaining the most recent submission number of a succeeded task. When this is done with the task events table, it requires a subquery but using the task states table (where only the most recent submit number is stored) allows a much simpler join operation to be used instead.

Catch one PEP8 violation caught by Travis CI but missed by my local pep8 run.

trwhitcomb · 2017-05-25T22:52:20Z

@hjoliver also, I'm having a mental block for the best way to handle configuring the plot width - suggest deferring that to Matt's suggestion of basing output on a Jinja2 template later.

matthewrmshin · 2017-05-26T08:26:01Z

lib/cylc/rundb.py

                    WHERE
-                        s.event='%(succeeded)s'
+                        event = '%(succeeded)s'


Sorry, I should also have mentioned that task_jobs.run_status == 0 is probably a better test for succeeded.

matthewrmshin · 2017-05-26T08:34:09Z

@trwhitcomb The task_events table was our original attempt to store timing information. The times in the task_events table are the times when the suite processes the events (which can be delayed if the suite is busy), whereas the times in the task_jobs are typically the times recorded by the cylc message command when it is called by the jobs. Therefore, the times in the task_jobs table should be more accurate for accounting purpose. (Obviously, these times will also differ slightly from the times reported by your batch system...)

Rather than computing the latest submit number for a succeeded task, the return status is also stored in the task_jobs table so we can get rid of joins altogether and use only task_jobs for better efficiency as well as reduced SQL complexity.

trwhitcomb · 2017-05-26T15:51:18Z

@matthewrmshin Thanks! That simplifies things even more.

hjoliver · 2017-05-26T21:09:40Z

@trwhitcomb - yes I'm happy to punt further improvements, such as plot customization and Jinja2 templating to future PRs.

hjoliver assigned trwhitcomb May 16, 2017

hjoliver added this to the soon milestone May 16, 2017

Update for PEP8 compliance.

c0fba5a

Fix errors complained about PEP8 test in Travis CI.

Fix several more PEP8 test failures.

bcfe7e7

hjoliver requested changes May 25, 2017

View reviewed changes

hjoliver requested a review from matthewrmshin May 25, 2017 03:31

matthewrmshin reviewed May 25, 2017

View reviewed changes

Correct PEP8 compliance

bb885fc

Catch one PEP8 violation caught by Travis CI but missed by my local pep8 run.

matthewrmshin reviewed May 26, 2017

View reviewed changes

Use single table for timing query.

73e341d

Rather than computing the latest submit number for a succeeded task, the return status is also stored in the task_jobs table so we can get rid of joins altogether and use only task_jobs for better efficiency as well as reduced SQL complexity.

hjoliver force-pushed the master branch 3 times, most recently from add7f5b to 7162266 Compare June 4, 2017 09:03

matthewrmshin approved these changes Jun 6, 2017

View reviewed changes

hjoliver approved these changes Jun 6, 2017

View reviewed changes

hjoliver merged commit a293467 into cylc:master Jun 6, 2017

hjoliver modified the milestones: next release, soon Jun 6, 2017

trwhitcomb deleted the show_timings branch June 7, 2017 22:54

		# along with this program. If not, see <http://www.gnu.org/licenses/>.

		"""cylc [util] show-times [OPTIONS] REG


		"""cylc [util] show-times [OPTIONS] REG

		Retrieve timing information for the given cylc suite that can be


		def write_category(self, buf, category, df_reshape, df_describe):
		pass

Add new cylc command to show task timings. #2286

Add new cylc command to show task timings. #2286

Conversation

trwhitcomb commented May 16, 2017 • edited

hjoliver commented May 16, 2017

trwhitcomb commented May 16, 2017

hjoliver commented May 16, 2017

arjclark commented May 17, 2017

trwhitcomb commented May 17, 2017

arjclark commented May 18, 2017

trwhitcomb commented May 18, 2017

hjoliver commented May 18, 2017

trwhitcomb commented May 19, 2017

hjoliver commented May 21, 2017

trwhitcomb commented May 22, 2017

hjoliver commented May 22, 2017

trwhitcomb commented May 24, 2017

hjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trwhitcomb May 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjoliver May 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjoliver commented May 25, 2017

hjoliver commented May 25, 2017

trwhitcomb commented May 25, 2017

matthewrmshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trwhitcomb commented May 25, 2017

trwhitcomb commented May 25, 2017 • edited

trwhitcomb commented May 25, 2017

Choose a reason for hiding this comment

matthewrmshin commented May 26, 2017

trwhitcomb commented May 26, 2017

hjoliver commented May 26, 2017

trwhitcomb commented May 16, 2017 •

edited

trwhitcomb May 25, 2017 •

edited

hjoliver May 25, 2017 •

edited

trwhitcomb commented May 25, 2017 •

edited