WIP: modify PyDarshan plotting routines to accept records as input #944

Yanlilyu · 2023-06-27T18:40:05Z

This branch modifies the existing plotting routines to take records as input rather than report objects. Three plotting files are changed, which are plot_opcounts.py, plot_access_histogram.py, and plot_common_access_table.py.
plot_opcounts() supports records input of POSIX, MPIIO, and STDIO modules, but doesn’t provide supports for those of H5F, H5D, PNETCDF_FILE, and PNETCDF_VAR modules. In order to sustain these modules, codes related to c library need to be revised. For now, if users try to generate plots for H5F, H5D, PNETCDF_FILE, PNETCDF_VAR modules, an error will be raised. Codes related to “agg_ioops” are deleted.
plot_access_histogram() supports records input of POSIX and MPIIO. It is not supporting HDF5 and PNETCDF modules. We will add the functions later. Codes related to “mod_agg_iohist” are deleted.
plot_common_access_table() accepts records input of POSIX, MPIIO modules.
We added xfails for some tests using HDF5/PNETCDF and updated the test xticks and labels to test_plot_exp_common.py.

tylerjereddy

I was able to reproduce many of the test failures seen in the CI. Something like this may get you a little farther along, but I'm seeing a bunch of unbound variables even after I fix that roughly.

diff --git a/darshan-util/pydarshan/darshan/cli/summary.py b/darshan-util/pydarshan/darshan/cli/summary.py
index d70a6104..370f3247 100644
--- a/darshan-util/pydarshan/darshan/cli/summary.py
+++ b/darshan-util/pydarshan/darshan/cli/summary.py
@@ -561,6 +561,7 @@ class ReportData:
                 pass
 
             if mod in ["POSIX", "MPI-IO", "H5D", "PNETCDF_VAR"]:
+                records = self.report.records[mod].to_df()
                 access_hist_description = (
                     "Histogram of read and write access sizes. The specific values "
                     "of the most frequently occurring access sizes can be found in "
@@ -570,7 +571,7 @@ class ReportData:
                     section_title=sect_title,
                     fig_title="Access Sizes",
                     fig_func=plot_access_histogram,
-                    fig_args=dict(report=self.report, mod=mod),
+                    fig_args=dict(record=records, mod=mod),
                     fig_description=access_hist_description,
                     fig_width=350,
                     fig_grid_area="acc_hist"
@@ -587,7 +588,7 @@ class ReportData:
                     section_title=sect_title,
                     fig_title="Common Access Sizes",
                     fig_func=plot_common_access_table.plot_common_access_table,
-                    fig_args=dict(report=self.report, mod=mod),
+                    fig_args=dict(record=records, mod=mod),
                     fig_description=com_acc_tbl_description,
                     fig_width=350,
                     fig_grid_area="common_acc_tbl"
@@ -600,7 +601,7 @@ class ReportData:
                     section_title=sect_title,
                     fig_title="Operation Counts",
                     fig_func=plot_opcounts,
-                    fig_args=dict(report=self.report, mod=mod),
+                    fig_args=dict(record=records, mod=mod),
                     fig_description="Histogram of I/O operation frequency.",
                     fig_width=350,
                     fig_grid_area="op_counts"

tylerjereddy · 2023-06-27T21:09:09Z

darshan-util/pydarshan/darshan/experimental/plots/plot_access_histogram.py

    """
    Plots a histogram of access sizes for specified module.

 	Args:
-		report (darshan.DarshanReport): report to generate plot from
+		record: record to generate plot from


We may need to be a bit more specific here (and in the other similarly-modified plotting functions)--I believe we'd want to support not just a singular record, and not just many records, but rather a dictionary that contains both the counter and fcounter record data structures, no?

yes, it's a dictionary, with 2 separate DataFrames: 'counters' (for integer record data) and 'fcounters' (for floating point record data)

tylerjereddy · 2023-06-27T21:44:47Z

darshan-util/pydarshan/darshan/experimental/plots/plot_common_access_table.py

    """
    Creates a table containing the most
    common access sizes and their counts.

    Parameters
    ----------
-    report: a ``darshan.DarshanReport``.
+    record: a dict.


I think we can be a bit more specific than "just some dictionary" here, it should have counter and fcounter dataframes nested within I suspect. To be honest, we may want to use our own custom class name for this just for clarity (it could still be interchangeable with the original plain dict).

In cffi_backend.py, rec_dict is used for the accumulate_records function. maybe we could use a name like acc_rec_dict?

Yeah, maybe? I was just thinking we kind of have a custom object that subclasses dict or something like that, and then it is clear to folks what the data structure should be since we define it ourselves somewhere. Probably a minor point that you could easily change later on though, so probably not a huge priority compared to dealing with all the testing/code changes.

class acc_rec_dict(dict): # stores darshan counter and fcounter dataframes pass

Or maybe there's a better idea?

I also wonder if it might make sense to provide some way to i.e., automatically produce such an object from multiple logs in a convenience function like produce_acc_rec_dict(list_of_logs). But that may also be something we could think about later on, if you and Shane haven't already.

I'm on board with coming up with a more specific class for a "record" type, even if it's just a dict subclass. I don't know if the DarshanRecordCollection thing we added is somehow useful for this or if that's meant to be hidden from users -- probably a better question for @jakobluettgau .

I think we're on the same page, but I'll clarify just to double check:

These routines should take a record-like object that has counters/fcounters data frames internally.

This record structure could correspond to a single file (i.e., a single file record from the set of file records a module collects) or it could correspond to arbitrary accumulations of record sets (i.e., all records for a given module in a single log, all records for a given module across multiple logs).

I also wonder if it might make sense to provide some way to i.e., automatically produce such an object from multiple logs in a convenience function like produce_acc_rec_dict(list_of_logs). But that may also be something we could think about later on, if you and Shane haven't already.

Maybe? At the very least, it might make sense to make the existing API (accumulate_records() in cffi backend) a little more flexible. It currently takes in a set of records and accumulates the entire set (with this set of records produced by the to_df() routine). But, to make this work across multiple logs we'd have to find a way to munge the record sets from each log into a single structure... Instead, it might make sense to provide an API that users can call repeatedly with new record sets, and then an additional routine to pull out the final result (derived metrics + summary record tuple).

yeah, we're on the same page; for my other comment below I imagine you'll just tag on to the C accum infrastructure for now, but I felt compelled to whine :)

tylerjereddy · 2023-06-27T21:52:01Z

darshan-util/pydarshan/darshan/experimental/plots/plot_common_access_table.py

@@ -83,7 +83,7 @@ def get_access_count_df(mod_df: Any, mod: str) -> Any:
    Parameters
    ----------
    mod_df: "counters" dataframe for the input
-    module `mod` from a ``darshan.DarshanReport``.
+    module `mod` from a record.


Not just a singular record I don't think, probably a dictionary that has counter and fcounter dataframes within.

tylerjereddy · 2023-06-27T21:53:44Z

darshan-util/pydarshan/darshan/experimental/plots/plot_opcounts.py

    """
    Generates a bar chart summary for operation counts.

    Parameters
    ----------

-    report (DarshanReport): darshan report object to plot
+    record: darshan record object to plot


similar comment here about data structure specificity

tylerjereddy · 2023-06-27T22:06:05Z

darshan-util/pydarshan/darshan/tests/test_plot_exp_common.py

@@ -143,12 +144,16 @@
    ],
 )
 def test_xticks_and_labels(log_path, func, expected_xticklabels, mod):
+    if mod in ['H5F', 'H5D', 'PNETCDF_FILE', 'PNETCDF_VAR']:
+        pytest.xfail(reason="module not supported")


Hmm, one thing I've mentioned to @shanedsnyder @carns is that I'd not like to see more capabilities added on the C analysis side--can we add the accumulators for these at the Python/pandas level moving forward? While it would be substantially easier to add on the C-side short-term I suspect, it would also dig us deeper in that side of things vs. the ultimate direction of interest (using common Python ecosystem tools for new capabilities/analyses).

Also, do we not already have the expected values for some logs here, given that it was already working before? Wouldn't that make it easier to test newly-developed functionality? As far as I can tell, at least from the standpoint of the subset of plots we are working on now, we don't actually need any of the super-complicated C aggregator stuff anyway, so that seems like a decent starting point for the Python side agg stuff (or improving the old Python stuff, so that it is a suitable template for future expansion vs. what is there now).

tylerjereddy · 2023-06-27T22:37:24Z

I suppose one other suggestion I might have on the testing side is to compose expected "accumulated" results from the individual expected values in i.e., darshan/tests/test_plot_exp_common.py. For example, read in STDIO from ior_hdf5_example.darshan and dxt.darshan, and require plot_opcounts to produce bar heights that are the sum of the two current expected values in test_bar_heights.

shanedsnyder · 2023-06-29T18:36:27Z

darshan-util/pydarshan/darshan/experimental/plots/plot_access_histogram.py

-        agg['WRITE_1G_PLUS']
-    ]
-
+    agg=record['counters']


I would not call this variable agg, as it implies it is some sort of aggregated value. Maybe just counters instead?

This variable was originally called agg because these plotting routines were aggregating all records in a given report object, which makes sense. Now, we are just passing in a single "record" to this routine, which could correspond to a single file record from a Darshan log or to an accumulated set of records (i.e., a summary record). In any case, any sort of aggregation now takes place outside of the plotting routine so we should just view the input as a generic record whose values are plotted directly.

I will change it!

shanedsnyder · 2023-06-29T18:38:10Z

darshan-util/pydarshan/darshan/experimental/plots/plot_access_histogram.py

+    agg=record['counters']
+    if mod == 'POSIX':
+        read_vals = [
+            agg['POSIX_SIZE_READ_0_100'][0],


Notice how similar the counter names are here for POSIX and MPI-IO modules. They only differ in the prefix module name that is provided as input to this routine. I would get rid of this module-specific if-else logic and use string manipulation to get the desired module-specific counter names instead. Will be visually much cleaner since it's half as much code.

yes, it will be cleaner.

Yanlilyu and others added 9 commits June 16, 2023 10:24

add multi log scratchpad notebook

6150dd6

change plot_opcounts.py to use a record input

aedaa9b

update test_xticks_and_labels

f63839c

add xfails for some tests using HDF5/PNETCDF

721c982

update multi log scratchpad notebook

c7c8a1e

update multi-log-scratchpad.ipynb

0097625

change plot_access_histogram.py to use a record input

4cf1aae

change plot_common_access_table.py to use a record input

75cc536

"delete multi-log-scratchpad.ipynb"

f1917f3

github-actions bot added the pydarshan label Jun 27, 2023

tylerjereddy reviewed Jun 27, 2023

View reviewed changes

shanedsnyder requested changes Jun 29, 2023

View reviewed changes

Yanlilyu added 2 commits July 7, 2023 15:09

Use f-strings to extract the data

0f626ac

add details to comments on the record

5abaeb0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: modify PyDarshan plotting routines to accept records as input #944

WIP: modify PyDarshan plotting routines to accept records as input #944

Yanlilyu commented Jun 27, 2023

tylerjereddy left a comment

tylerjereddy Jun 27, 2023

Yanlilyu Jun 28, 2023

tylerjereddy Jun 27, 2023

Yanlilyu Jun 28, 2023

tylerjereddy Jun 28, 2023

shanedsnyder Jun 29, 2023

tylerjereddy Jun 29, 2023

tylerjereddy Jun 27, 2023

tylerjereddy Jun 27, 2023

tylerjereddy Jun 27, 2023

tylerjereddy Jun 27, 2023 •

edited

Loading

tylerjereddy commented Jun 27, 2023

shanedsnyder Jun 29, 2023

Yanlilyu Jun 29, 2023

shanedsnyder Jun 29, 2023

Yanlilyu Jun 29, 2023

WIP: modify PyDarshan plotting routines to accept records as input #944

Are you sure you want to change the base?

WIP: modify PyDarshan plotting routines to accept records as input #944

Conversation

Yanlilyu commented Jun 27, 2023

tylerjereddy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tylerjereddy Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

tylerjereddy commented Jun 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tylerjereddy Jun 27, 2023 •

edited

Loading