Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new cylc command to show task timings. #2286

Merged
merged 9 commits into from Jun 6, 2017

Conversation

trwhitcomb
Copy link
Collaborator

@trwhitcomb trwhitcomb commented May 16, 2017

Add a new query to the run database and a top-level driver
script that will allow interrogation of task times to facilitate
generation of statistics about wallclock time required for tasks,
including information necessary to generate batch queue wait times,
task run time (which was already available through the GUI), and
total wallclock time for the task.

Add a new query to the run database and a top-level driver
script that will allow interrogation of task times to facilitate
generation of statistics about wallclock time required for tasks,
including information necessary to generate batch queue wait times,
task run time (which was already avialable through the GUI), and
total wallclock time for the task.
@hjoliver
Copy link
Member

@trwhitcomb - thanks for this, looks like it will be useful. Partially addresses #1683

Note the Python PEP8 style check failed on Travis CI:

./tests/pep8/00-bin-lib.t .. 1/3 --- -	2017-05-16 17:29:51.952526986 +0000
+++ 00-bin-lib.stdout	2017-05-16 17:29:51.799232784 +0000
@@ -0,0 +1,6 @@
+/home/travis/build/cylc/cylc/lib/cylc/rundb.py:735:25: E241 multiple spaces after ':'
+/home/travis/build/cylc/cylc/lib/cylc/rundb.py:736:25: E241 multiple spaces after ':'
+/home/travis/build/cylc/cylc/lib/cylc/rundb.py:737:23: E241 multiple spaces after ':'
+/home/travis/build/cylc/cylc/lib/cylc/rundb.py:738:25: E241 multiple spaces after ':'
+/home/travis/build/cylc/cylc/lib/cylc/rundb.py:740:80: E501 line too long (104 > 79 characters)
+/home/travis/build/cylc/cylc/bin/cylc-show-times:58:18: E221 multiple spaces before operator

If @cylc/core agree this is a good thing, some tests will be needed to ensure future changes don't break it - let us know if you want some advice on how to do this.

@hjoliver hjoliver added this to the soon milestone May 16, 2017
@trwhitcomb
Copy link
Collaborator Author

@hjoliver Thanks. This came out of our desire to keep tabs on how long certain jobs were waiting in the queue. I have an additional script that reads this output, does some grouping, and then makes basic plots like the attached but haven't included that here as it introduces additional dependencies (such as Pandas and Matplotlib).
sample_timings_plot

I'll have a look at the PEP8 errors (aaargh - those are PEP8 rules I don't like!) but will definitely need assistance for the tests.

@hjoliver
Copy link
Member

@trwhitcomb - the plots look nice, BoM would be interested in this for sure. I think we could consider including this sort of thing in Cylc so long as we have clear documentation of requirements and clear error messages for those who don't have Pandas and Matplotlib installed.

Re PEP8 - initially I did not particularly agree with all the style rules, but I've come to love them 😀 If nothing else, enforced consistency has proved to be a very good thing.

@arjclark
Copy link
Contributor

@trwhitcomb @hjoliver Sounds good to me and certainly the sort of thing various users are interested in. Would be nice if the command could natively support printing summary timings (i.e. queued time and execution time) via a cli switch - this could be extended to displaying average timings as well.

@trwhitcomb
Copy link
Collaborator Author

@arjclark actually the summary timing print was where this whole thing started. The original version of the script that we wrote performed those computations and the table that was printed out was based on the summary statistics rather than the individual listing you see here.

The reason that this PR doesn't include that functionality is that computation used Pandas for cleanly handling a lot of the data alignment issues and summary statistics computations, and I didn't want to introduce an additional package dependency to the cylc system.

That said, if there's interest in this, then perhaps it could be handled similar to the way that graphviz is now, i.e. try to import it and if it doesn't work, disable the features that depend on it. But that's not a call I wanted to make before getting input from the core development team!

Fix errors complained about PEP8 test in Travis CI.
@arjclark
Copy link
Contributor

@trwhitcomb - I guess it's just alignment that's the issue though right? As far as summary stats are concerned you should be able to take your db results and calculate the stats in python from those? An ancient bit of scripting I wrote for someone gets all the useful data out of the database into a python structure and then does these lines:

# Set/Extract T_FMT as the timepoint format string used in the suite database via suite_params table
date_submit = datetime.strptime(submit_date_string, T_FMT)
date_start = datetime.strptime(start_date_string, T_FMT)
td = date_start - date_submit
queue_seconds = (td.microseconds + (td.seconds + td.days * 24 * 3600) * 10**6) / 10**6
# any formatting choices e.g. converting to hh:mm:ss format or whatever
# etc.

As far as alignment goes you can just go with the "crude" solution of padding each result entry with spaces until it's length is equal to the longest length in its column and then printing that (I did something to that effect in the rosie command line utils).

I think Pandas makes a lot of this easier but its not necessary.

@trwhitcomb
Copy link
Collaborator Author

@arjclark Alignment in this case is less about text alignment (i.e. column width) and more about alignment of indexed data while performing arithmetic operations - namely that with a large dataset being able to do something like:

queue_time = timings['start_time'] - timings['submit_time']

and ensuring that the index of (name, cycle, submit_num) is consistent and you're subtracting matching data together even if one of them is (potentially) missing is very convenient. You're right that it's not strictly necessary, but it does get around some nasty corner cases (especially where data isn't present).

@hjoliver
Copy link
Member

@trwhitcomb @arjclark - IMO it is OK to add a new dependency on Pandas and Matplotlib for this kind of functionality. We should use the best tools for the job unless there's reason to believe that installation of these packages is very difficult.

@trwhitcomb
Copy link
Collaborator Author

Sounds good - I'll work on incorporating those features in.

For handling the case where packages are not installed, is it better to not show the options that require it, or always show everything but raise an error if the user uses a switch that requires a missing package?

@hjoliver
Copy link
Member

@trwhitcomb - I vote for show all options and raise an error - otherwise users may not notice that the extra functionality is available.

Relegate the previous output format to an optional raw output
and instead default to a much more useful summary output that
shows computed queue times, run times, and total times based
on the run database.

Add a dependency to the Pandas library to compute summary
statistics.
@trwhitcomb
Copy link
Collaborator Author

Here's an example of the summary timings available - because the distributions can be skewed and decidedly non-Gaussian, I've opted to go for a few more parameters than the regular "mean/standard deviation" pairs.

================================================================================
                    Host: localhost     Batch System: background
================================================================================
                                   queue_time
                                   ----------
       count      mean      std  min  25%  50%  75%  max
name
model      3  1.333333  0.57735    1    1    1  1.5    2
post       3  1.333333  0.57735    1    1    1  1.5    2
prep       1  3.000000      NaN    3    3    3  3.0    3
stop       1  1.000000      NaN    1    1    1  1.0    1

                                    run_time
                                    --------
       count       mean       std  min   25%  50%  75%  max
name
model      3  10.333333  1.527525    9   9.5   10   11   12
post       3  10.333333  1.527525    9   9.5   10   11   12
prep       1   9.000000       NaN    9   9.0    9    9    9
stop       1  10.000000       NaN   10  10.0   10   10   10

                                   total_time
                                   ----------
       count       mean       std  min  25%  50%  75%  max
name
model      3  11.666667  1.154701   11   11   11   12   13
post       3  11.666667  1.154701   11   11   11   12   13
prep       1  12.000000       NaN   12   12   12   12   12
stop       1  11.000000       NaN   11   11   11   11   11

@hjoliver
Copy link
Member

Looks good.

Update the timing summary to include generated HTML showing both
boxplots of the timing statistics (per task) as well as a tabular
summary of the statistics.  Refactor the summary generation to
remove duplication between the text summary and HTML summary.
@trwhitcomb
Copy link
Collaborator Author

Updates are done - outputs now include:

  • raw output (suitable for custom downstream processing)
  • text-based summary output (requires Pandas)
  • HTML-based summary output (requires Pandas and matplotlib)

HTML output generates a page like this:
html_capture

I also accept an output file argument to control where things go, and thanks to StackOverflow have a clean handler for output to either a file or to stdout.

I think at this point it's ready for a more thorough review - I know some tests should get written but I don't have a good handle on a) how to do that and b) what's best to test. Perhaps making sure that the database columns are what we need.

Additionally, does this need more documentation than what's automatically generated when the cug grabs the docstring?

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me, tests as working, just a few comments to think about.

Noticed I get a repeated error if pandas is not installed:
Cannot import pandas - summary unavailable.Cannot import pandas - summary unavailable.

bin/cylc Outdated
@@ -237,6 +237,7 @@ utility_commands['cycle-point'] = [
utility_commands['scp-transfer'] = ['scp-transfer']
utility_commands['suite-state'] = ['suite-state']
utility_commands['ls-checkpoints'] = ['ls-checkpoints']
utility_commands['show-times'] = ['show-times']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like the name cylc show-times. What about cylc report-timings. Or maybe just cylc report with special option for a timing report, leaving room for other kinds of report in the future without adding many more cylc sub-commands? (This needs input from @cylc/core I think)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW my vote would be to rename to cylc report-timings and then modify into a report subcommand once we have something else ready to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fine with me.

# along with this program. If not, see <http://www.gnu.org/licenses/>.

"""cylc [util] show-times [OPTIONS] REG

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll suggest a few changes below to reduce help text verbosity a bit...


"""cylc [util] show-times [OPTIONS] REG

Retrieve timing information for the given cylc suite that can be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrieve suite timing information for wait and run time performance analysis.


Retrieve timing information for the given cylc suite that can be
used for fine-grain diagnostics for wait time and run time performance
analysis. Two main categories of output are available: summary output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"raw output, and text or HTML summary output, is available"

analysis. Two main categories of output are available: summary output
(in text or web format) and raw output. If an output filename is
supplied, output is redirected there otherwise both summary and raw
outputs are printed to standard output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor point, but given that all the output including SVG plots is text-based, an output filename option is not really necessary - just advise to redirect with > filename? (hmmm, or maybe we want an MS Windows port one day...)

)
parser.add_option(
"-r", "--raw",
help="Show raw timing output for downstream processing.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you, "for downstream processing"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"downstream processing" meaning user-specific postprocessing that happens with the data. This is less necessary now that we're including some sort of summary capability within the command, but I can still see people wanting to do something extra and using the raw output as data input.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, understood! I was confused as we commonly use "downstream processing" to describe bits of in-suite workflow downstream of some important task.

Maybe something like "raw timing output suitable for automated text parsing"?

if output_options.count(True) > 1:
parser.error('Cannot combine output formats (choose one)')
if output_options.count(True) == 0:
# No output specified - choose summary by default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not any(output_options)


def write_category(self, buf, category, df_reshape, df_describe):
pass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these use @abstractmethod?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, but one of the reasons I implemented it this way is so I don't need to define every single routine in the inheriting classes. If they're @abstractmethod, I'd need to add subroutines with pass in the child classes if it's not used (i.e. TextTimingSummary doesn't use a summary header or footer) If you'd like, though, I can modify to make it more explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's fine - it's not a major issue in this case.

GROUP BY
e.name, e.cycle, e.submit_num
""" % {
'task_events': self.TABLE_TASK_EVENTS,
Copy link
Member

@hjoliver hjoliver May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is the gnarliest db query I've seen to date ... not that I see many of these!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be easier to use:

  • task_states.submit_num for the latest submit number.
  • task_jobs.time_submit for submit time.
  • task_jobs.time_run for start time.
  • task_jobs.time_exit for the exit time.

(The task_jobs table is designed to store this sort of information for Rose Bush, so should be a bit more efficient.)

border-bottom: 1px solid grey;
}
svg {
width: 65%;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the plots look too big, to me, although perhaps it depends on how much data is being displayed. Could be configurable?

@hjoliver
Copy link
Member

PEP8 checks failed, btw.

@hjoliver
Copy link
Member

@matthewrmshin - can you review too, or reassign?

@trwhitcomb
Copy link
Collaborator Author

@hjoliver spotted the pep8 - this time I actually ran it before committing and it caught a bunch - don't know why those last few slipped through. Will fix when I commit the response to your review.

Copy link
Contributor

@matthewrmshin matthewrmshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @trwhitcomb I have added some initial thoughts that you may want to consider, or perhaps open new issues for after this PR.

}
"""

buf.write('<html><head><style>%s</style></head><body>' % css)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if you'd consider moving the display presentation logic into Jinja2 templates?

GROUP BY
e.name, e.cycle, e.submit_num
""" % {
'task_events': self.TABLE_TASK_EVENTS,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be easier to use:

  • task_states.submit_num for the latest submit number.
  • task_jobs.time_submit for submit time.
  • task_jobs.time_run for start time.
  • task_jobs.time_exit for the exit time.

(The task_jobs table is designed to store this sort of information for Rose Bush, so should be a bit more efficient.)

Update several items based on review and revision comments.  Some
highlights:
    - command is now cylc report-timings
    - db query updated to mostly use the same table as Rose Bush,
      avoiding a pivot and triple join.  The new query performs
      much better than the old query and is easier to read.
    - documentation updated for less verbosity
    - PEP8 updates that weren't caught by local pep8
    - missing library error should only show up once
@trwhitcomb
Copy link
Collaborator Author

Thanks, guys! New revision should address most of these things.

@hjoliver I can see what you mean about the image size, but it does resize with the window. We could parameterize but I'm open to suggestions for the easiest way to handle that (I'm trying not to overcomplicate things intially).

@matthewrmshin Using task_jobs is a great idea, and simplifies things quite a bit. I had originally only been working in the task_events table and missed that the information was also in task_jobs. The times between the two tables are slightly different (typically only by a second or two) but the query runs much faster. I like the idea of Jinja2 templates for the report output, but think that should come in a follow-up issue as there are some things I'm not quite clear on.

@trwhitcomb
Copy link
Collaborator Author

trwhitcomb commented May 25, 2017

Also, it looks like my local pep8 install isn't picking up some of the things that Travis CI fails on, so now I'll wait and see if the checks pass and follow-up if it finds something my pep8 missed.

Timing information requires obtaining the most recent submission
number of a succeeded task.  When this is done with the task
events table, it requires a subquery but using the task states
table (where only the most recent submit number is stored) allows
a much simpler join operation to be used instead.
Catch one PEP8 violation caught by Travis CI but missed by my local
pep8 run.
@trwhitcomb
Copy link
Collaborator Author

@hjoliver also, I'm having a mental block for the best way to handle configuring the plot width - suggest deferring that to Matt's suggestion of basing output on a Jinja2 template later.

WHERE
s.event='%(succeeded)s'
event = '%(succeeded)s'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should also have mentioned that task_jobs.run_status == 0 is probably a better test for succeeded.

@matthewrmshin
Copy link
Contributor

@trwhitcomb The task_events table was our original attempt to store timing information. The times in the task_events table are the times when the suite processes the events (which can be delayed if the suite is busy), whereas the times in the task_jobs are typically the times recorded by the cylc message command when it is called by the jobs. Therefore, the times in the task_jobs table should be more accurate for accounting purpose. (Obviously, these times will also differ slightly from the times reported by your batch system...)

Rather than computing the latest submit number for a succeeded
task, the return status is also stored in the task_jobs table
so we can get rid of joins altogether and use only task_jobs
for better efficiency as well as reduced SQL complexity.
@trwhitcomb
Copy link
Collaborator Author

@matthewrmshin Thanks! That simplifies things even more.

@hjoliver
Copy link
Member

@trwhitcomb - yes I'm happy to punt further improvements, such as plot customization and Jinja2 templating to future PRs.

@hjoliver hjoliver force-pushed the master branch 3 times, most recently from add7f5b to 7162266 Compare June 4, 2017 09:03
@hjoliver hjoliver merged commit a293467 into cylc:master Jun 6, 2017
@hjoliver hjoliver modified the milestones: next release, soon Jun 6, 2017
@trwhitcomb trwhitcomb deleted the show_timings branch June 7, 2017 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants