# PyDarshan Data Layout and Understanding
This notebook walks through the pydarshan data structure to
help understand how to access data in order to build analysis.

### Minimum Python Version
- 3.6 is the minimum version

In [None]:
import platform
print(platform.python_version())

#### Import darshan
- If this fails with a traceback, it is most likely due to not finding the corresponding libdarshan-util.so library
- export LD_LIBRARY_PATH=<path/to/darshan/lib> before running the notebook

In [None]:
import darshan

### Example log
- we load an example log to walk through the data structures

In [None]:
logfile="logs/shane_macsio_id29959_5-22-32552-7035573431850780836_1590156158.darshan"
report = darshan.DarshanReport(logfile, read_all=True)

### DarshanReport object
- This object contains many methods for loading and processing logs
- default is to load all known log entries into the default data set
  - can instead disable loading and load ony logs from modules you are interested in

In [None]:
dir(report)

### Log metadata summary
- The info() method prints an overview of the log to standard out
- useful for understanding what is in the log
- Loaded Records shows which modules are present and how many records are in each
- Name records indicates how many different "files" are in the log
  - some name records are special like <STDOUT> that don't refer to an actual file

In [None]:
report.info()

In [None]:
import pprint

### Name Records
- this is a list of all the files the darshan log knows about
- Some "files" may not be actual files and are place holders for their records, such as <STDOUT> or an HDF5 dataset
- The key is a hash which can be used to correlate the file in the different modules
  - for example, using MPI-IO to write a file, and MPI-IO then uses POSIX will have records
    in both the POSIX and MPI-IO modules.

In [None]:
pprint.pprint(report.name_records)

### Mount File systems
- the Mounts variable is a list of mount point names and file system types
- the list is sorted from longest path to shortest path
  - in order to determine which file system a file is on, match the path of the name record
    with the longest matching mount
  - note this can be incorrect if symlinking a file in one files system to another

In [None]:
pprint.pprint(report.mounts)

### Records
- this is where the data of the log is held, this is where you start for analysis
- records is a dictionary with each key being a module that contains a DarshanRecordCollection object

In [None]:
pprint.pprint(report.records)

### DarshanRecordCollection
- this object is derived from collections.abc.MutableSequence
- you can access each records like a sequence, [0], [1:3], etc..
- it also has 3 functions that return the data in different formats (depending on your preference)
- to_df() -> provides counters as pandas data frames
- to_dict() -> provides counters as python dictionaries
- to_numpy() -> provides counters as numpy arrays (this is the default representation)
  - these 3 methods will be deepcopies of data if transforming from the loaded representation

In [None]:
dir(report.records['POSIX'])

### Accessing DarshanRecordCollection objects
- You can check the len() for the number of records in the object
- Accessing the object like a sequence will return a dictionary-like reference that allows access to four
  pieces of data
  - id -> the hash which corresponds to the hash in the name_records
  - rank -> the rank the data was collected on or -1 if the file was accessed by all ranks and data was reduced to 
    single record
  - counters -> all integer counters
  - fcounters -> all floating point counters
- the representation of data within _counters_ and _fcounters_ defaults to numpy arrays but depends
  on what _dtype_ was set when the records were loaded

In [None]:
print("num records = ", len(report.records['POSIX']))
print(type(report.records['POSIX'][0]))
pprint.pprint(report.records['POSIX'][0])

### Numpy Format
- the default format
- use to_numpy() to get a deep copy of the data in this format
- returns a list of dictionaries, one dictionary for each record
- dictionary is the same format as the DarshanRecordCollection above
- _counters_ and _fcounters_ contain the numpy array
- this is format assumed by some of the experimental aggregators that are part of the library

In [None]:
np = report.records['POSIX'].to_numpy()
pprint.pprint(np)

To access a specific counters, you can generate a mapping of counters names to indexs

In [None]:
counter2index = dict(zip(report.counters['POSIX']['counters'],
                         range(0, len(report.counters['POSIX']['counters']))))
fcounter2index = dict(zip(report.counters['POSIX']['fcounters'],
                         range(0, len(report.counters['POSIX']['fcounters']))))
i = counter2index['POSIX_READS']
print(np[0]['counters'][i])
i = fcounter2index['POSIX_F_READ_TIME']
print(np[0]['fcounters'][i])

### Dictionary Format
- use to_dict() to get a deep copy of the data in this format
- returns a list of dictionaries, one dictionary for each record
- dictionary is the same format as the DarshanRecordCollection above
- _counters_ and _fcounters_ contain the dictionary
  - counters names are the keys and the value is the counter value

In [None]:
d = report.records['POSIX'].to_dict()
pprint.pprint(d)

To access a specific counters, just use the counter name corresponding to integer or floating point counter

In [None]:
print(d[0]['counters']['POSIX_READS'])
print(d[0]['fcounters']['POSIX_F_READ_TIME'])

### Pandas DataFrame Format
- use to_df() to get a deep copy of the data in this format
- returns a dictionary with _counters_ and _fcounters_ members
- _counters_ and _fcounters_ contain dataframes with all records
  - counters names are the columns and the records are the rows
  - _id_ and _rank_ are columns in the data frame

In [None]:
df = report.records['POSIX'].to_df()
pprint.pprint(df)

To access a specific counters, use the counter name for the column and use either the absolute index
or the combination of rank and id.

In [None]:
pdf = df['counters']
# with index
print(pdf.loc[0]['POSIX_READS'])
pdf = df['fcounters']
# with rank,id
rank = pdf.loc[0]['rank']
id = pdf.loc[0]['id']
print(pdf.query("rank == {rank} and id == {id}".format(rank=rank,id=id))['POSIX_F_READ_TIME'][0])

The _counters_ and _fcounters_ can be merged into a single data set since all the column names are unique.

In [None]:
import pandas
posix_df = pandas.merge(df['counters'], df['fcounters'], left_on=['id','rank'], right_on=['id','rank'])
print(posix_df)

## Basic Plotting using Pandas
- You can make quick plots with pandas in many cases
- in some cases, pandas plotting expects the data with a different organization which may make them difficult to use

In [None]:
posix_ops=["POSIX_OPENS", "POSIX_READS", "POSIX_WRITES", "POSIX_SEEKS", "POSIX_STATS", "POSIX_MMAPS"]
posix_df.plot(kind='bar', x='id', y=posix_ops,
              title='POSIX I/O Operation Counts per File', ylabel='Operation Count')

## Counter Names
- The code below will print out all the counter names that are defined by each module
- counter names match the names defined in the C code as well as the darshan-parser output

In [None]:
for key in report.counters.keys():
    print("{1} Counters for {0} {1}".format(key, '*'*10))
    for counter in report.counters[key].keys():
        pprint.pprint(report.counters[key][counter])