Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: modify PyDarshan plotting routines to accept records as input #944

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,12 @@ def autolabel(ax, rects):
rotation=45,
)

def plot_access_histogram(report, mod, ax=None):
def plot_access_histogram(record, mod, ax=None):
"""
Plots a histogram of access sizes for specified module.

Args:
report (darshan.DarshanReport): report to generate plot from
record: record to generate plot from
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to be a bit more specific here (and in the other similarly-modified plotting functions)--I believe we'd want to support not just a singular record, and not just many records, but rather a dictionary that contains both the counter and fcounter record data structures, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's a dictionary, with 2 separate DataFrames: 'counters' (for integer record data) and 'fcounters' (for floating point record data)

mod (str): mod-string for which to generate access_histogram

"""
Expand All @@ -33,44 +33,64 @@ def plot_access_histogram(report, mod, ax=None):
else:
fig = None

# TODO: change to report.summary
if 'mod_agg_iohist' in dir(report):
report.mod_agg_iohist(mod)
else:
print("Cannot create summary, mod_agg_iohist aggregator is not registered with the report class.")

# defaults
labels = ['0-100', '101-1K', '1K-10K', '10K-100K', '100K-1M', '1M-4M', '4M-10M', '10M-100M', '100M-1G', '1G+']

agg = report.summary['agg_iohist'][mod]
# TODO: can simplify the read/write vals below after
# support for python 3.6 is dropped
read_vals = [
agg['READ_0_100'],
agg['READ_100_1K'],
agg['READ_1K_10K'],
agg['READ_10K_100K'],
agg['READ_100K_1M'],
agg['READ_1M_4M'],
agg['READ_4M_10M'],
agg['READ_10M_100M'],
agg['READ_100M_1G'],
agg['READ_1G_PLUS']
]

write_vals = [
agg['WRITE_0_100'],
agg['WRITE_100_1K'],
agg['WRITE_1K_10K'],
agg['WRITE_10K_100K'],
agg['WRITE_100K_1M'],
agg['WRITE_1M_4M'],
agg['WRITE_4M_10M'],
agg['WRITE_10M_100M'],
agg['WRITE_100M_1G'],
agg['WRITE_1G_PLUS']
]

agg=record['counters']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not call this variable agg, as it implies it is some sort of aggregated value. Maybe just counters instead?

This variable was originally called agg because these plotting routines were aggregating all records in a given report object, which makes sense. Now, we are just passing in a single "record" to this routine, which could correspond to a single file record from a Darshan log or to an accumulated set of records (i.e., a summary record). In any case, any sort of aggregation now takes place outside of the plotting routine so we should just view the input as a generic record whose values are plotted directly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it!

if mod == 'POSIX':
read_vals = [
agg['POSIX_SIZE_READ_0_100'][0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice how similar the counter names are here for POSIX and MPI-IO modules. They only differ in the prefix module name that is provided as input to this routine. I would get rid of this module-specific if-else logic and use string manipulation to get the desired module-specific counter names instead. Will be visually much cleaner since it's half as much code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it will be cleaner.

agg['POSIX_SIZE_READ_100_1K'][0],
agg['POSIX_SIZE_READ_1K_10K'][0],
agg['POSIX_SIZE_READ_10K_100K'][0],
agg['POSIX_SIZE_READ_100K_1M'][0],
agg['POSIX_SIZE_READ_1M_4M'][0],
agg['POSIX_SIZE_READ_4M_10M'][0],
agg['POSIX_SIZE_READ_10M_100M'][0],
agg['POSIX_SIZE_READ_100M_1G'][0],
agg['POSIX_SIZE_READ_1G_PLUS'][0]
]

write_vals = [
agg['POSIX_SIZE_WRITE_0_100'][0],
agg['POSIX_SIZE_WRITE_100_1K'][0],
agg['POSIX_SIZE_WRITE_1K_10K'][0],
agg['POSIX_SIZE_WRITE_10K_100K'][0],
agg['POSIX_SIZE_WRITE_100K_1M'][0],
agg['POSIX_SIZE_WRITE_1M_4M'][0],
agg['POSIX_SIZE_WRITE_4M_10M'][0],
agg['POSIX_SIZE_WRITE_10M_100M'][0],
agg['POSIX_SIZE_WRITE_100M_1G'][0],
agg['POSIX_SIZE_WRITE_1G_PLUS'][0]
]
elif mod == 'MPI-IO':
read_vals = [
agg['MPIIO_SIZE_READ_AGG_0_100'][0],
agg['MPIIO_SIZE_READ_AGG_100_1K'][0],
agg['MPIIO_SIZE_READ_AGG_1K_10K'][0],
agg['MPIIO_SIZE_READ_AGG_10K_100K'][0],
agg['MPIIO_SIZE_READ_AGG_100K_1M'][0],
agg['MPIIO_SIZE_READ_AGG_1M_4M'][0],
agg['MPIIO_SIZE_READ_AGG_4M_10M'][0],
agg['MPIIO_SIZE_READ_AGG_10M_100M'][0],
agg['MPIIO_SIZE_READ_AGG_100M_1G'][0],
agg['MPIIO_SIZE_READ_AGG_1G_PLUS'][0]
]

write_vals = [
agg['MPIIO_SIZE_WRITE_AGG_0_100'][0],
agg['MPIIO_SIZE_WRITE_AGG_100_1K'][0],
agg['MPIIO_SIZE_WRITE_AGG_1K_10K'][0],
agg['MPIIO_SIZE_WRITE_AGG_10K_100K'][0],
agg['MPIIO_SIZE_WRITE_AGG_100K_1M'][0],
agg['MPIIO_SIZE_WRITE_AGG_1M_4M'][0],
agg['MPIIO_SIZE_WRITE_AGG_4M_10M'][0],
agg['MPIIO_SIZE_WRITE_AGG_10M_100M'][0],
agg['MPIIO_SIZE_WRITE_AGG_100M_1G'][0],
agg['MPIIO_SIZE_WRITE_AGG_1G_PLUS'][0]
]
#TODO: add support for HDF5/PnetCDF modules
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ def get_access_count_df(mod_df: Any, mod: str) -> Any:
Parameters
----------
mod_df: "counters" dataframe for the input
module `mod` from a ``darshan.DarshanReport``.
module `mod` from a record.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just a singular record I don't think, probably a dictionary that has counter and fcounter dataframes within.


mod: the module to obtain the common accesses
table for (i.e "POSIX", "MPI-IO", "H5D").
Expand All @@ -102,7 +102,6 @@ def get_access_count_df(mod_df: Any, mod: str) -> Any:
df = mod_df.filter(filter_keys)
df = collapse_access_cols(df=df, col_name=col_name)
df_list.append(df)

return pd.concat(df_list, axis=1)


Expand All @@ -122,14 +121,14 @@ def __init__(self, df: Any, **kwargs):
self.html = self.df.to_html(**kwargs)


def plot_common_access_table(report: darshan.DarshanReport, mod: str, n_rows: int = 4) -> DarshanReportTable:
def plot_common_access_table(record: dict, mod: str, n_rows: int = 4) -> DarshanReportTable:
"""
Creates a table containing the most
common access sizes and their counts.

Parameters
----------
report: a ``darshan.DarshanReport``.
record: a dict.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can be a bit more specific than "just some dictionary" here, it should have counter and fcounter dataframes nested within I suspect. To be honest, we may want to use our own custom class name for this just for clarity (it could still be interchangeable with the original plain dict).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cffi_backend.py, rec_dict is used for the accumulate_records function. maybe we could use a name like acc_rec_dict?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe? I was just thinking we kind of have a custom object that subclasses dict or something like that, and then it is clear to folks what the data structure should be since we define it ourselves somewhere. Probably a minor point that you could easily change later on though, so probably not a huge priority compared to dealing with all the testing/code changes.

class acc_rec_dict(dict):
    # stores darshan counter and fcounter dataframes
    pass

Or maybe there's a better idea?

I also wonder if it might make sense to provide some way to i.e., automatically produce such an object from multiple logs in a convenience function like produce_acc_rec_dict(list_of_logs). But that may also be something we could think about later on, if you and Shane haven't already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on board with coming up with a more specific class for a "record" type, even if it's just a dict subclass. I don't know if the DarshanRecordCollection thing we added is somehow useful for this or if that's meant to be hidden from users -- probably a better question for @jakobluettgau .

I think we're on the same page, but I'll clarify just to double check:

  • These routines should take a record-like object that has counters/fcounters data frames internally.
  • This record structure could correspond to a single file (i.e., a single file record from the set of file records a module collects) or it could correspond to arbitrary accumulations of record sets (i.e., all records for a given module in a single log, all records for a given module across multiple logs).

I also wonder if it might make sense to provide some way to i.e., automatically produce such an object from multiple logs in a convenience function like produce_acc_rec_dict(list_of_logs). But that may also be something we could think about later on, if you and Shane haven't already.

Maybe? At the very least, it might make sense to make the existing API (accumulate_records() in cffi backend) a little more flexible. It currently takes in a set of records and accumulates the entire set (with this set of records produced by the to_df() routine). But, to make this work across multiple logs we'd have to find a way to munge the record sets from each log into a single structure... Instead, it might make sense to provide an API that users can call repeatedly with new record sets, and then an additional routine to pull out the final result (derived metrics + summary record tuple).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we're on the same page; for my other comment below I imagine you'll just tag on to the C accum infrastructure for now, but I felt compelled to whine :)


mod: the module to obtain the common access size
table for (i.e "POSIX", "MPI-IO", "H5D").
Expand All @@ -145,8 +144,7 @@ def plot_common_access_table(report: darshan.DarshanReport, mod: str, n_rows: in
the `df` or `html` attributes, respectively.

"""
mod_df = report.records[mod].to_df(attach=None)["counters"]

mod_df=record['counters']
if mod == "MPI-IO":
mod = "MPIIO"

Expand Down
Loading
Loading