# DarshanUtils for Python

This notebook gives an overwiew of features provided by the Python bindings for DarshanUtils.

By default all records, metadata, available modules and the name records are loaded when opening a Darshan log:

In [None]:
import darshan

report = darshan.DarshanReport("example_logs/example.darshan", read_all=True)  # Default behavior
report.info()

In [None]:
report.modules

A few of the internal data structures explained:

In [None]:
# report.metadata         # dictionary with raw metadata from darshan log
# report.modules          # dictionary with raw module info from darshan log (need: technical, module idx)
# report.name_records     # dictionary for resovling name records: id -> path/name
# report.records          # per module "dataframes"/dictionaries holding loaded records

The darshan report holds a variety of namespaces for report related data. All of them are also referenced in `report.data` at the moment, but reliance on this internal organization of the report object is discouraged once the API stabilized. Currently, `report.data` references the following information:

In [None]:
report.data.keys()

In [None]:
report.mod_read_all_records('POSIX')

In [None]:
report.mod_read_all_records('STDIO')

In [None]:
report.update_name_records()
report.info()

In [None]:
# visualization helper used by different examples in the remainder of this notebook
from IPython.display import display, HTML
# usage: display(obj)

### Record Formats and Selectively Loading Records

For memory efficiant analysis, it is possible to supress records from being loaded automatically. This is useful, for example, when analysis considers only records of a particular layer/module.

In [None]:
import darshan
report = darshan.DarshanReport("example_logs/example.darshan", read_all=False, lookup_name_records=True) # Loads no records!

In [None]:
# expected to fail, as no records were loaded
try:
    print(len(report.records['STDIO']), "records loaded for STDIO.")
except:
    print("No STDIO records loaded for this report yet.")

Additional records then can be loaded selectively, for example, on a per module basis:

### Exporting Data

#### dtype: pandas

In [None]:
report.mod_read_all_records("STDIO")

In [None]:
display(report.records['STDIO'].to_df()['counters'])
display(report.records['STDIO'].to_df()['fcounters'])

In [None]:
# by default, export to pandas dataframes using .to_df() attaches id and rank information  
# for aggregations, this can be supressed by providing attach=None, allowing you to get plots with sensible ranges directly using pandas plotting
report.records['STDIO'].to_df(attach=['rank'])['fcounters'].plot.box(vert=False)

In [None]:
report.records['STDIO'].to_df(attach=['rank'])['counters'].plot.box(vert=False)

#### dtype: dict

In [None]:
report.mod_read_all_records("STDIO")

In [None]:
report.records['STDIO'][0].to_dict()

#### dtype: numpy

In [None]:
report.mod_read_all_records("STDIO")
report.records['STDIO'][0].to_numpy()

In [None]:
print(type(report.records['STDIO'][0].to_numpy()[0]['counters']))
print(type(report.records['STDIO'][0].to_numpy()[0]['fcounters']))

#### The Darshan Log in Memory

Let's have a look at how calling `report.mod_read_all_records("STDIO")` changed the state of the log in memory.

In [None]:
# Compare to info line: "Loaded Records: {...}"
report.info()

When interacting on individual log data for example in a for loop you would most likely care about the following instead:

In [None]:
print("Num records:", len(report.records['STDIO']))

# show first 10 records
for rec in report.records['STDIO'][0:10]:
    print(rec)
    # do something with the record

### Aggregation and Filtering (Experimental)

Darshan log data is routinely aggregated for quick overview. The report object offers a few methods to perform common aggregations:

Report aggregations and summarization remains **experimental** for now, mostly to allow interfaces to stabilize. But experimental features can be switched on easily by invoking `darshan.enable_experimental()`:

In [None]:
import darshan
darshan.enable_experimental(verbose=True) # Enable verbosity, listing new functionality

In [None]:
# Example report, which counts records in log across modules 
report.name_records_summary()

### Chain operations like filtering and reductions
The filter and reduce operations return DarshanReports themsleves, thus allow to convieniently chain operations.

In [None]:
import pprint

import darshan
darshan.enable_experimental()

report = darshan.DarshanReport("example_logs/example.darshan", read_all=True)
report.name_records

In [None]:
# The original report for reference. Take note of the "Loaded Records" section
report.info()

In [None]:
# name_records maybe filenames (or ids)
# Note how only records of the STDIO module remain
report.filter(name_records=['<STDIN>', '<STDOUT>', '<STDERR>']).info()

In [None]:
# name_records using a id
# Note how only one POSIX, one MPI-IO and one LUSTRE record remain
report.filter(name_records=[6301063301082038805]).info()

In [None]:
# reduce all after filtering
report.filter(pattern="*.hdf5").reduce().info()

In [None]:
# only preserve some
report.filter(name_records=[6301063301082038805]).reduce(mods=['POSIX', 'STDIO']).records

In [None]:
# expected to fail
try:
    pprint.pprint(report.summary['agg_ioops'])
except:
    print("IOOPS have not been aggregated for this report.")

In [None]:
report.read_all() 
report.summarize()

In [None]:
report.summary['agg_ioops']

Or fine grained:

In [None]:
report.mod_agg_iohist("MPI-IO")  # to create the histograms

In [None]:
report.agg_ioops()               # to create the combined operation type summary

### Report Algebra (Experimental)

Various operations are implemented to merge, combine and manipulate log records. This is useful for analysis task, but can also be used to construct performance projections or extrapolation.

For convienience, we overload some of the operations provided by Python when they resemble intuitive equivalence to their mathematical counterparts. In particular, we enable the combination of different object types.

In [None]:
import darshan
darshan.enable_experimental()

In [None]:
# merging records
from darshan.experimental.plots import plot_access_histogram
from darshan.experimental.plots import plot_opcounts

r1 = darshan.DarshanReport("example_logs/example.darshan", read_all=True, dtype='numpy')
r2 = darshan.DarshanReport("example_logs/example2.darshan", read_all=True, dtype='numpy')
rx = r1 + r2

for r in [r1, r2, rx]:
    plt = plot_opcounts(r)
    plt.show()

In [None]:
# multiply records with a scalar (think, four times the I/O load)
#r1 = darshan.DarshanReport("example.darshan", read_all=True)
#rx = r1 * 4
#plot_opcounts(rx)

In [None]:
# rebase via timedelta
#r1 = darshan.DarshanReport("example.darshan", read_all=True)
#dt = datetime.timedelta()
#rx = r1 + dt

## Plotting

In [None]:
import darshan
darshan.enable_experimental(verbose=False)

r3 = darshan.DarshanReport("example_logs/example.darshan", dtype='numpy')
r3.mod_read_all_records('POSIX')

from darshan.experimental.plots import plot_access_histogram
plot_access_histogram(r3, mod='POSIX')

In [None]:
import darshan
darshan.enable_experimental(verbose=False)

r3 = darshan.DarshanReport("example_logs/example.darshan", dtype='numpy')
r3.mod_read_all_records('MPI-IO')

from darshan.experimental.plots import plot_access_histogram
plot_access_histogram(r3, mod='MPI-IO')

In [None]:
import darshan
darshan.enable_experimental(verbose=False)

r3 = darshan.DarshanReport("example_logs/example.darshan", dtype='numpy')
r3.read_all()

from darshan.experimental.plots import plot_opcounts
plot_opcounts(r3, mod='POSIX')

### DXT Records

DXT records are also supported, and can be loaded individually on a per module basis as follows:


In [None]:
import darshan

report2 = darshan.DarshanReport("example_logs/dxt.darshan")
report2.info()

In [None]:
report2.records['DXT_POSIX'][0]._records[0].keys()

Sometimes it is easier to visualize or transform data to get an overview:

In [None]:
# load prepared transformations
# might require: pip install pillow
from darshan.experimental.transforms.dxt2png import segment, wallclock

report2.mod_read_all_dxt_records("DXT_POSIX", dtype="dict")  # need dict format for now
rec = report2.records['DXT_POSIX'][2]

In [None]:
segment(rec)

In [None]:
wallclock(rec)

In [None]:
from IPython.display import display, HTML

report2.mod_read_all_dxt_records("DXT_POSIX", dtype="pandas") 

print("Write Segments:")
display(report2.records['DXT_POSIX'][2]['write_segments'])
print("Read Segments:")
display(report2.records['DXT_POSIX'][2]['read_segments'])

Exercise left for the reader ;P 
Implement a custom aggregator/summary function and commit it as a contribution to pydarshan:

In [None]:
# Create file: <darshan-repo>/darshan-util/pydarshan/darshan/experimental/aggregators/dxt_summary.py
from darshan.report import *

def dxt_summary(self):
    """
    Count records for every name record.

    Args:
        mod_name (str): 

    Return:
        None
    """

    counts = {}

    for mod, records in self.records.items():
        for rec in records:
            if rec['id'] not in counts:
                counts[rec['id']] = {'name': self.name_records[rec['id']], 'counts': {}}

            if mod not in counts[rec['id']]['counts']:
                counts[rec['id']]['counts'][mod] = 1
            else:
                counts[rec['id']]['counts'][mod] += 1

    return counts


## Exporting Data for Use in Third-Party Analysis

Darshan logs may be used in contexts beyond our imagination. To make this effortless export in JSON is easy.

In [None]:
import darshan
report = darshan.DarshanReport("example_logs/ior_hdf5_example.darshan", read_all=True)
report.to_json()

## Error Handling?

Currently, playing with two modes, both have their pros and cons.

Generally, should expose errors and let users handle them. At the same time, just skipping invalid load requests does little harm but greatly improves convenience.

Could add a switch to enable disable these guards :/

In [None]:
report = darshan.DarshanReport("example_logs/example.darshan")

In [None]:
report.mod_read_all_records("MOD_ABC") # Expect KeyError

In [None]:
report.mod_read_all_dxt_records("ABC") # Expect warning, but not exception