wrapper performance optimizations #413

carns · 2021-07-08T20:19:41Z

This pull requests adds multiple performance enhancements to the Darshan runtime wrappers. The optimizations are targeted for the Theta system at the ALCF, but will likely benefit other platforms as well.

These changes reduce function call overhead, reduce locking overhead, uses atomics in place of locks in some locations, and add optional functionality to use the Intel-specific rdtscp timer instruction on platforms where that is more performant than standard timer calls.

The following example shows the impact on a degenerate test case on Theta with 5 million I/O operations to tmpfs on a single process:

## NO DARSHAN static:
50000000 fprintf()s of size 1 each in 12.465365 seconds (4011113.939493 ops/s)
50000000 fscanf()s of size 1 each in 29.004269 seconds (1723884.156082 ops/s)
50000000 fread()s of size 1 each in 3.130983 seconds (15969424.993880 ops/s)

## darshan@main, enabled, static:
50000000 fprintf()s of size 1 each in 215.936336 seconds (231549.728577 ops/s)
50000000 fscanf()s of size 1 each in 263.471802 seconds (189773.628985 ops/s)
50000000 fread()s of size 1 each in 200.895207 seconds (248885.977447 ops/s)

## darshan@perf-opt, enabled, static:
50000000 fprintf()s of size 1 each in 27.554754 seconds (1814568.911264 ops/s)
50000000 fscanf()s of size 1 each in 52.301563 seconds (955994.373946 ops/s)
50000000 fread()s of size 1 each in 18.619044 seconds (2685422.507416 ops/s)

This PR should be rebased after the autotools changes are merged. It also needs additional testing, in particular to confirm that the rdtscp instruction produces counter values consistent with the default timer functions.

carns · 2021-07-09T14:55:08Z

Rebased this morning to a) consolidate commits more sensibly and eliminate reverted auxiliary timer library experiment and b) sync up with current origin/main

Will still need to rebase again after #381 .

- will be used to guide some wrapper performance optimizations

- eliminate darshan-core.h and fold into darshan.h - make darshan_core_wtime() and it's subroutine static inlines - switch to clock_gettime() to retrieve time

- no malloc/free or system calls in critical regions

carns · 2021-07-20T18:37:06Z

Rebased to current origin/main (i.e., to reflect automake changes)

darshan-runtime/lib/darshan.h

shanedsnyder · 2021-10-14T21:24:36Z

darshan-runtime/lib/darshan-hdf5.c

 /*********************************************************
 *        Wrappers for H5F functions of interest         * 
 *********************************************************/

 #define H5F_PRE_RECORD() do { \
-    HDF5_LOCK(); \
-    if(!darshan_core_disabled_instrumentation()) { \
+    if(!__darshan_disabled) { \


Note that this __darshan_disabled value is cached from the MAP_OR_FAIL macro which is called before the I/O routine, which we obviously aren't holding locks through. It's possible the following code paths get triggered while Darshan is in the process of shutting down, which probably requires some more care?

Some potential issues stand out:

It's possible internal module data structures (e.g., hdf5_file_runtime structure here) are still around, opening the opportunity for a wrapper to update them while Darshan is shutting down. This would happen if a Darshan I/O wrapper routine executed between the module's output() and cleanup() routines called by darshan-core (former sets which module records to output and the latter cleans up module data structures.

It's also possible that if Darshan has called cleanup() on a module, and then an I/O wrapper triggers for that module, that it may re-allocate module data structures.

These are for sure corner cases, because each would required Darshan to have started shutdown while a wrapper is between between MAP_OR_FAIL (where __darshan_disabled is set) and PRE_RECORD (where __darshan_disabled is checked).

darshan-core itself handles this gracefully, these issues are contained to the modules themselves. Issue 1) could lead to modification of records while Darshan is writing them, while Issue 2) likely just results in a memory leak.

Good catch. I don't think that 1) could happen, because the hdf5_file_runtime structure is protected separately by the HDF5_LOCK(). 2) looks possible; there isn't anything preventing it from re-registering the module and possibly getting into risky code paths while the core is shutting down.

I think we can modify the hdf5_file_runtime_initialize() (and it's equivalent in the other functions) to disallow re-initialization, at least until we have a use case for that to happen intentionally. That _initialize() call is likewise protected by HDF5_LOCK() already so it is Ok if it needs to check a protected flag.

There isn't anything to prevent the module from potentially calling the *_record_ref* functions at inappropriate times, but it looks like those are already appropriately serialized internally.

I'll adjust the PR to keep the modules from re-initializing, and also add some comments clarifying who's responsible for protecting which data structures.

Actually, maybe a slightly cleaner variation of the above would be to modify darshan_core_register_module() so that it can report failure, and have all of the modules check for success before allocating a module runtime structure.

The safety check to prevent re-initialization after shutdown has started could then be put in darshan_core_register_module() so it's not duplicated in N modules.

After further discussion: the modules are already protected against re-registration, because the core code sets the __darshan_core variable to NULL (preventing further registrations) before invoking any module output routines.

There was, however, still a possibility of modules updating counters in between the module _output() and _cleanup() routines, which could possibly cause inconsistent results in the log. This is fixed now in 9c86c8d.

shanedsnyder · 2021-10-15T21:06:08Z

I found an existing bug related to timer code in darshan_core_init() -- we should be using the _absolute() variant of wtime before darshan is initialized. That wasn't introduced by anything you did, I just noticed it while reviewing and confirmed it was leading to bogus values when enabling Darshan's internal timing (DARSHAN_INTERNAL_TIMING env var).

That's fixed via 1ce9145

carns force-pushed the carns/perf-opt branch 2 times, most recently from 7d950a4 to 18babb9 Compare July 9, 2021 14:52

carns added 10 commits July 20, 2021 14:05

add stdio benchmark

3465386

- will be used to guide some wrapper performance optimizations

refactoring related to wtime

3161c76

- eliminate darshan-core.h and fold into darshan.h - make darshan_core_wtime() and it's subroutine static inlines - switch to clock_gettime() to retrieve time

inline disabled_instrumentation() call

041d8ab

refactor init checks in stdio module

383ead4

static inline ref lookup

937c52b

use non-recursive mutex in core

2369790

- no malloc/free or system calls in critical regions

optional stdatomic serialization for core

bba625b

add rdtscp support

6083c54

make all modules match stdio init check convention

342772c

update docs and changelog for wrapper perf opts

4490009

carns force-pushed the carns/perf-opt branch from 18babb9 to 4490009 Compare July 20, 2021 18:35

add missing autoconf support for rdtscp

32ded6e

carns requested a review from shanedsnyder October 13, 2021 20:44

carns assigned shanedsnyder Oct 13, 2021

shanedsnyder reviewed Oct 14, 2021

View reviewed changes

darshan-runtime/lib/darshan.h Outdated Show resolved Hide resolved

shanedsnyder reviewed Oct 14, 2021

View reviewed changes

shanedsnyder added 3 commits October 15, 2021 15:56

remove darshan-core.h header

237ab5e

bug fix in calculating darshan_init timing

1ce9145

move changelog updates to v3.3.2

11ccbef

shanedsnyder and others added 5 commits October 15, 2021 16:10

fixup one more spot needing wtime_absolute

a7ed720

make wtime_offset a global

c3cb160

prevent counter updates after output() initiated

9c86c8d

small bug fixes

f083345

fix PRE_RECORD macros to unlock on failure

917aed6

shanedsnyder mentioned this pull request Oct 18, 2021

module registration memory leaks #527

Open

shanedsnyder approved these changes Oct 18, 2021

View reviewed changes

shanedsnyder merged commit 5b0da0e into main Oct 18, 2021

carns mentioned this pull request Nov 4, 2021

updates needed to match recent module API changes in darshan origin/main argonne-lcf/autoperf#1

Closed

carns deleted the carns/perf-opt branch February 14, 2022 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrapper performance optimizations #413

wrapper performance optimizations #413

carns commented Jul 8, 2021 •

edited

carns commented Jul 9, 2021

carns commented Jul 20, 2021

shanedsnyder Oct 14, 2021

carns Oct 15, 2021

carns Oct 15, 2021

carns Oct 18, 2021

shanedsnyder commented Oct 15, 2021

wrapper performance optimizations #413

wrapper performance optimizations #413

Conversation

carns commented Jul 8, 2021 • edited

carns commented Jul 9, 2021

carns commented Jul 20, 2021

shanedsnyder Oct 14, 2021

Choose a reason for hiding this comment

carns Oct 15, 2021

Choose a reason for hiding this comment

carns Oct 15, 2021

Choose a reason for hiding this comment

carns Oct 18, 2021

Choose a reason for hiding this comment

shanedsnyder commented Oct 15, 2021

carns commented Jul 8, 2021 •

edited