# Data framework: the basic paradigm

user implements one function `define_experiment`

then runs `../../tools/data_framework/run_experiment.py`

it runs potentially many experimental trials (over all defined configurations), captures output, builds a sqlite database, queries it, produces plots, and produces html pages to display plots...

the data framework also provides lots of tools to do querying, plot generation and analysis in jupyter notebooks (see `instructions_data.ipynb`).

none of this is specific to setbench! easy to apply to other code bases, as well. (data_framework is self contained--no dependencies on setbench.)

### The following tutorial fully explains the derivation of several non-trivial `define_experiment()` functions.

# Run the following code cell before any others

It does basic initialization for this notebook.

In [None]:
import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
print("Initialized.")

# The 'hello world' of `run_experiment.sh`

defining a trivial experiment that compiles and runs a single command once and saves the output.

we do `run_in_jupyter` and pass `define_experiment`. could alternatively save `define_experiment` in a python file and run the equivalent `run_experiments.sh` command (described in comments)...

In [None]:
from _basic_functions import *
def define_experiment(exp_dict, args):
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')     ## working dir for compiling
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin') ## working dir for running
    set_cmd_compile  (exp_dict, 'make brown_ext_abtree_lf.debra')
    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./brown_ext_abtree_lf.debra -nwork 1 -nprefill 1 -insdel 5 5 -k 200000 -t 1000')

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-cr')
# if the define_experiment() function above were saved in a file myexp.py,
# then the run_in_jupyter line above is equivalent to running shell command:
#   ../../tools/data_framework/run_experiment.py myexp.py -cr
#
# NOTE: -c causes COMPILATION to occur, and -r causes experiments to be RUN

# Try the same thing from the command line!

- create a file called `myexp.py` in this directory.
- start it with `from _basic_functions import *`
- copy the `define_experiment` function above into `myexp.py`
- run `../../tools/data_framework/run_experiment.py myexp.py -cr` in the shell (starting from this directory)

if you get an error along the lines of:

`NameError: name 'set_dir_compile' is not defined`

then you probably forgot to start the file with `from _basic_functions import *`, which is needed in any file where you define a `define_experiment` function for use with `run_experiment.py`.

# (Re)running results without compiling

you can rerun experiments without compiling by omitting `-c`

In [None]:
def define_experiment(exp_dict, args):
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')     ## working dir for compiling
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin') ## working dir for running
    set_cmd_compile  (exp_dict, 'make brown_ext_abtree_lf.debra')
    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./brown_ext_abtree_lf.debra -nwork 1 -nprefill 1 -insdel 5 5 -k 200000 -t 1000')

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-r')
# equiv cmd: [...]/run_experiment.py myexp.py -r

# Data files (captured stdout/err)

every time the data_framework runs your "run command" (provided by `set_cmd_run`), the output is automatically saved in a `data file`.

this is the output of that one run we executed.

In [None]:
print(shell_to_str('cat data/data000001.txt'))

# Running with varying `run param`eters

of course running one command isn't very interesting... you could do that yourself.

instead, we want to run the command many times, with different arguments. to this end, we allow the user to specify `run param`s.

the idea is as follows:
- call `add_run_param` to make the data framework aware of parameters that you want your experiments to be run with.
- your program will be run once for each set of values in the CROSS PRODUCT of all parameters.
- (i.e., we will run your program with every combination of parameters)

### Replacement strings / tokens

you can use any of the run params you define to dynamically replace `{_tokens_like_this}` in the run command. for example, we include `{DS_TYPENAME}` in our run command, and it will be replaced by the current value of `{DS_TYPENAME}`. (that's right, we can run different commands based on the current value of `DS_TYPENAME`.)
    
you can also get the paths to key directories by using:
- `{__dir_compile}`
- `{__dir_run}`
- `{__dir_data}`

the following replacement token is also defined for you:
- `{__step}`            the number of runs done so far, padded to six digits with leading zeros

*note:* we now need to compile ALL of the binaries we want to *run*. so, we just change our make command to compile everything...



In [None]:
def define_experiment(exp_dict, args):
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6') ## -j specifies how many threads to compile with

    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork 1 -nprefill 1 -insdel 5 5 -k 200000 -t 1000')

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-cr')

# Extracting data fields from captured stdout/err

NOW we're going to EXTRACT data automatically from the generated data file(s). To do this, we must include the argument `-d` which stands for `database creation`.

note 3 data files were produced this time: one for each value of `DS_TYPENAME`. let's put those data files to use by specifying that we want to *extract* some text from each data file.

in particular, let's extract a line of the form "`DS_TYPENAME=...`" and a line of the form "`total_throughput=...`" from each data file. (you can find such lines in the data file above if you like.)

extracted data is stored in a sqlite database `data/output_database.sqlite` in a table called `data`. (each field name passed to `add_data_field` becomes a **column** in `data`.)

to specify a column to be extracted, we call `add_data_field()`. we do this for `total_throughput`, but note that we do *not* have to do this for `DS_TYPENAME`, as it was already added as a `run param`.

whenever you add a data field, you should choose a column type `coltype` from:
- `'TEXT'`
- `'INTEGER'`
- `'REAL'`

the `default` if you do not specify is `'TEXT'`. note, however, that allowing the default `'TEXT'` option for a `numeric` field can cause problems when it is time to produce **graphs/plots**!

In [None]:
def define_experiment(exp_dict, args):
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork 1 -nprefill 1 -insdel 5 5 -k 200000 -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER')

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-rd')

# Querying the database

Note that we can simply **access** the last database we created, *WITHOUT rerunning* any experiments, by omitting all command line args in our `run_in_jupyter` call.

Also note that you can accomplish the same thing from the **command line** by running `../../tools/data_framework/run_experiment.py myexp.py` with `cmdline_args` omitted. However, since you can't pass your `define_experiments` function as a command line argument, you have to save it in a `.py` file and pass the name `myexp.py` of that file as the first argument to `run_experiment.py`.

To query the database, we can use function `select_to_dataframe(sql_string)` with a suitable `sql_string`. There are many other powerful functions included for querying and plotting data, but those are covered in `microbench_experiments/example/instructions_data.ipynb`. In **this** notebook we are focusing on the design of the `define_experiment` function.

## Extra columns

Note that the resulting query shows numerous extra columns such as `__hostname`, `__step` and `__cmd_run`, that we did *not* add ourselves. These are added *automatically* by the data framework.

In [None]:
import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='')
df = select_to_dataframe('select * from data')
df

# run_in_jupyter call above has equivalent command:
# [...]/run_experiment.py myexp.py


# Suppressing logging output in `run_in_jupyter`

If you want to call `run_in_jupyter` as above *without* seeing the `logging data` that was copied to stdout, you can disable the log output by calling `disable_tee_stdout()`. Note that logs will still be collected, but the output will **only** go to the log file `output_log.txt`.

In [None]:
import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
disable_tee_stdout()
run_in_jupyter(define_experiment, cmdline_args='')
df = select_to_dataframe('select * from data')
enable_tee_stdout() ## remember to enable, or you won't get output where you DO expect it...
df


# Running multiple trials

if you want to perform repeated trials of each experimental configuration, add a run_param called "`__trials`", and specify a list of trial numbers (as below).

(the run_param doesn't *need* to be called `__trials` exactly, but if it is called `__trials` exactly,
then extra sanity checks will be performed to verify, for example, that each data point in a graphical plot
represents the average of precisely as many experimental runs as there are entries in the `__trials` list.)


In [None]:
def define_experiment(exp_dict, args):
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, '__trials', [1, 2, 3])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork 1 -nprefill 1 -insdel 5 5 -k 200000 -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER')

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-rd')

## Querying the data (to see the multiple trials)

In [None]:
select_to_dataframe('select * from data')

# Extractors: mining data from arbitrary text

by default, when you call `add_data_field(exp_dict, 'XYZ')`, a field `'XYZ'` will be fetched from each data file using extractor `grep_line()`, which greps (searches) for a line of the form `'XYZ={arbitrary string}\n'`

*if a field you want to extract is not stored that way in the output data*, then you can specify a custom `extractor` function, as we do in our example with `get_maxres()` below, to extract the max resident size from the 6th space-separated column of the output of the linux "time" command.

also note: each field added with `add_data_field` becomes a replacement token (e.g., `{DS_TYPENAME}`) that can be references in any plot titles, axis titles, field lists, etc. (which we will see more on below).

the following special fields are also defined for you (and added to the `data` table):
- `{__step}`            the number of runs done so far, padded to six digits with leading zeros
- `{__cmd_run}`         your cmd_run string with any tokens replaced appropriately for this run
- `{__file_data}`       the output filename for the current run's data
- `{__path_data}`       the relative path to the output file for the current run's data
- `{__hostname}`        the result of running the hostname command on the machine
- `{__id}`              a unique row ID

note: in the following, `defaults` are `validator=is_nonempty` and `extractor=grep_line`.

## Text output we are *trying* to extract max resident size from

A line of the form:

`960.43user 50.70system 0:06.14elapsed 16449%CPU (0avgtext+0avgdata 3034764maxresident)k`

From this, we would like to extract `3034764`, then convert from KB to MB...

## Extractor that accomplishes this

`input`: an `extractor` function takes, as its arguments: the same `exp_dict` argument as `define_experiment()`, a `file_name` to load data from, and a `field_name` to extract.

`processing`: it should fetch the appropriate contents for that field, from the given `file_name` and return them.

`output`: return type can be a `string`, `int` or `float`.

(in cases like this, where we're writing a custom `extractor` to fetch a specific field, the `field_name` argument ends up being irrelevant.)

you are free to read the contents of the file, and process the data you see however you like, to come up with the desired return value.

in our case, we will use the `shell_to_str()` utility function provided by the data framework to run a sequence of `bash` shell commands to extract the desired string from the file, then cast it to a `float` and convert it from kilobytes to megabytes.

## (you could just as easily do this with pure python code. the choice is yours.)

In [None]:
def get_maxres(exp_dict, file_name, field_name):
    ## manually parse the maximum resident size from the output of `time` and add it to the data file
    maxres_kb_str = shell_to_str('grep "maxres" {} | cut -d" " -f6 | cut -d"m" -f1'.format(file_name))
    return float(maxres_kb_str) / 1000

## **Using** this extractor in `define_experiment`

we actually use this extractor by adding a data field and specifying it:

`add_data_field   (exp_dict, 'maxresident_mb', extractor=get_maxres)`

In [None]:
def get_maxres(exp_dict, file_name, field_name):
    ## manually parse the maximum resident size from the output of `time` and add it to the data file
    maxres_kb_str = shell_to_str('grep "maxres" {} | cut -d" " -f6 | cut -d"m" -f1'.format(file_name))
    return float(maxres_kb_str) / 1000

def define_experiment(exp_dict, args):
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, '__trials', [1, 2, 3])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork 1 -nprefill 1 -insdel 5 5 -k 200000 -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER')
    add_data_field   (exp_dict, 'maxresident_mb', coltype='REAL', extractor=get_maxres)

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-rd')

## Viewing the resulting data

note the `maxresident_mb` column -- highlighted for emphasis using Pandas DataFrame `style.applymap()`.

In [None]:
df = select_to_dataframe('select * from data')

df.style.applymap(lambda s: 'background-color: #b63f3f', subset=pd.IndexSlice[:, ['maxresident_mb']])

# Validators: *checking* extracted data

suppose you want to run some basic *sanity checks* on fields you pull from data files.

a `validator` function is a great way of having the data framework perform a basic check on values as they are extracted from data files.

pre-existing `validator` functions:
- `is_positive`
- `is_nonempty`
- `is_equal(to_value)`

for example, suppose we want to verify that `total_throughput` and `maxresident_mb` are both **positive** numbers. to do this, we specify `validator=is_positive` for each, below.

note: you can write your own `validator` by mimicking the ones in `../../tools/data_framework/_basic_functions.py`. (see `is_positive` and `is_equal`.)

In [None]:
def get_maxres(exp_dict, file_name, field_name):
    ## manually parse the maximum resident size from the output of `time` and add it to the data file
    maxres_kb_str = shell_to_str('grep "maxres" {} | cut -d" " -f6 | cut -d"m" -f1'.format(file_name))
    return float(maxres_kb_str) / 1000

def define_experiment(exp_dict, args):
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, '__trials', [1, 2, 3])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork 1 -nprefill 1 -insdel 5 5 -k 200000 -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)
    add_data_field   (exp_dict, 'maxresident_mb', coltype='REAL', extractor=get_maxres, validator=is_positive)

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-rd')

# What happens when a field *fails* validation?

we trigger a validation failure by specifying an obviously incorrect validator `is_equal('hello')`

In [None]:
def get_maxres(exp_dict, file_name, field_name):
    ## manually parse the maximum resident size from the output of `time` and add it to the data file
    maxres_kb_str = shell_to_str('grep "maxres" {} | cut -d" " -f6 | cut -d"m" -f1'.format(file_name))
    return float(maxres_kb_str) / 1000

def define_experiment(exp_dict, args):
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, '__trials', [1, 2, 3])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork 1 -nprefill 1 -insdel 5 5 -k 200000 -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_equal('hello'))
    add_data_field   (exp_dict, 'maxresident_mb', coltype='REAL', extractor=get_maxres, validator=is_positive)

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-rd', error_exit_code=0)

# Plotting results (for data with <ins>3 dimensions</ins>)

One of the main reasons I created the data framework was to make it stupid-easy to produce lots of graphs/plots.

The main tool for doing this is the `add_plot_set` function.

`add_plot_set()` can be used to cause a SET of plots to be rendered as images in the data directory.

the precise SET of plots is defined by the fields included in `varying_cols_list` keyword argument.
 (the data framework will iterate over all distinct combinations of values in `varying_cols_list`,
 and will render a plot for each.)
 in the example below, we do *not* pass any `varying_cols_list` argument, so only a single plot is produced.

(we will see where `varying_cols_list` is useful, and how it is used, in some of the later examples...)

Note: a plot's title and filename can only use replacement `{tokens}` that correspond
      to fields THAT ARE INCLUDED in `varying_cols_list[]`.
      (this is because only those tokens are well defined and unique PER PLOT)

### Note: any plots you define are *not actually rendered* unless you add command line argument `-p`

In [None]:
def define_experiment(exp_dict, args):
    set_dir_tools    (exp_dict, os.getcwd() + '/../../tools') ## tools library for plotting
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, '__trials', [1, 2])
    add_run_param    (exp_dict, 'TOTAL_THREADS', [1, 2, 4, 8])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork {TOTAL_THREADS} -nprefill {TOTAL_THREADS} -insdel 5 5 -k 200000 -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)

    add_plot_set(
            exp_dict
          , name='throughput.png'
          , title='Throughput vs data structure'
          , series='DS_TYPENAME'
          , x_axis='TOTAL_THREADS'
          , y_axis='total_throughput'
          , plot_type='bars', plot_cmd_args = '--legend-include'
    )

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-rdp')

## Let's view the data and plot produced by the previous cell

(You have to run the previous cell before running the next one.)

In [None]:
from IPython.display import Image
display(Image('data/throughput.png'))
display(select_to_dataframe('select * from data'))

# Plotting data with a custom function

If you want full control over how your data is plotted, you can specify your own function as the `plot_type` argument.

Your custom function will be called with keyword arguments:
- `filename`        -- the output filename for the plot image
- `column_filters`  -- the *current* values of all fields in `varying_cols_list` (if any)
- `data`            -- a Pandas DataFrame containing the (filtered) data for this plot
- `series_name`     -- name of the column containing `series` in `data` (`''` if no series)
- `x_name`          -- name of the column containing `x-values` in `data`
- `y_name`          -- name of the column containing `y-values` in `data`
- `exp_dict`        -- same as `exp_dict` passed to `define_experiment`

To *better understand* what data is passed to a custom function, let's create a custom function that just prints its arguments.

In [None]:
def my_plot_func(filename, column_filters, data, series_name, x_name, y_name, exp_dict=None):
    print('## filename: {}'.format(filename))
    print('## filters: {}'.format(column_filters))
    print('## data:')
    print(data)

def define_experiment(exp_dict, args):
    set_dir_tools    (exp_dict, os.getcwd() + '/../../tools')
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, '__trials', [1, 2])
    add_run_param    (exp_dict, 'TOTAL_THREADS', [1, 2, 4, 8])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork {TOTAL_THREADS} -nprefill {TOTAL_THREADS} -insdel 5 5 -k 200000 -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)

    add_plot_set(
            exp_dict
          , name='throughput.png'
          , title='Throughput vs data structure'
          , series='DS_TYPENAME'
          , x_axis='TOTAL_THREADS'
          , y_axis='total_throughput'
          , plot_type=my_plot_func
    )

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *

disable_tee_stdout() ## disable regular log printing so we ONLY see OUR prints below
run_in_jupyter(define_experiment, cmdline_args='-dp')
enable_tee_stdout()

# For example, we can plot this data *manually* using `Pandas`

Since we have `TWO trials` per combination of `DS_TYPENAME` and `TOTAL_THREADS`, we need to aggregate our data somehow before plotting. We can use `pandas` `pivot_table()` function to compute the `mean` of the trials for each data point.

Once we have a pivot table, we can call `pandas` `plot()` to render it, then use `savefig()` to save it to the provided `filename`.

Of course, you can write your own such functions, and make them arbitrarily complex/customized...

In [None]:
import pandas
import matplotlib as mpl

def my_plot_func(filename, column_filters, data, series_name, x_name, y_name, exp_dict=None):
    table = pandas.pivot_table(data, index=x_name, columns=series_name, values=y_name, aggfunc='mean')
    table.plot(kind='line')
    mpl.pyplot.savefig(filename)
    print('## SAVED FIGURE {}'.format(filename))

def define_experiment(exp_dict, args):
    set_dir_tools    (exp_dict, os.getcwd() + '/../../tools')
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, '__trials', [1, 2])
    add_run_param    (exp_dict, 'TOTAL_THREADS', [1, 2, 4, 8])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork {TOTAL_THREADS} -nprefill {TOTAL_THREADS} -insdel 5 5 -k 200000 -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)

    add_plot_set(
            exp_dict
          , name='throughput.png'
          , title='Throughput vs data structure'
          , series='DS_TYPENAME'
          , x_axis='TOTAL_THREADS'
          , y_axis='total_throughput'
          , plot_type=my_plot_func
    )

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
disable_tee_stdout()
run_in_jupyter(define_experiment, cmdline_args='-dp')
enable_tee_stdout()

## Viewing the generated figure

In [None]:
from IPython.display import Image
display(Image('data/throughput.png'))

# Producing *many* plots (for data with <ins>5 dimensions</ins>)

the real power of `add_plot_set` only starts to show once you want to plot *many* plots at once.

so, let's add a couple of dimensions to our data:
- key range (`MAXKEY` in the data file)
- update rate (`INS_DEL_FRAC` in the data file)

and use them to produce **multiple plots** (one for each combination of values of these dimensions). we do this by specifying `varying_cols_list` in `add_plot_set`.

we can also customize the plot file`name`s and `title`s with these parameters.

# Showing these plots in a table in an HTML page

we also generate an HTML page to show off these grids in a table by invoking `add_page_set`.

HTML page construction only occurs if you specify command line argument `-w` (which stands for `website creation`) to `run_experiment.py`. so, we add this to `run_in_jupyter`.

note: you can also customize the `index.html` starting page (which is blank by default) by providing your own `HTML body` string to the function `set_content_index_html(exp_dict, content_html_string)`.

In [None]:
def define_experiment(exp_dict, args):
    set_dir_tools    (exp_dict, os.getcwd() + '/../../tools') ## path to tools library
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make bin_dir={__dir_run} -j6')

    add_run_param    (exp_dict, '__trials', [1, 2])
    add_run_param    (exp_dict, 'TOTAL_THREADS', [1, 2, 4, 8])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])
    add_run_param    (exp_dict, 'MAXKEY', [20000, 200000])
    add_run_param    (exp_dict, 'INS_DEL_FRAC', ["0.0 0.0", "5.0 5.0"])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork {TOTAL_THREADS} -nprefill {TOTAL_THREADS} -insdel {INS_DEL_FRAC} -k {MAXKEY} -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)

    add_plot_set(
            exp_dict
          , name='throughput-{INS_DEL_FRAC}-{MAXKEY}k.png'
          , title='{INS_DEL_FRAC} {MAXKEY}k: throughput'
          , varying_cols_list=['MAXKEY', 'INS_DEL_FRAC']
          , series='DS_TYPENAME'
          , x_axis='TOTAL_THREADS'
          , y_axis='total_throughput'
          , plot_type='bars'
    )

    ## render one legend for all plots (since the legend is the same for all).
    ## if legend varies from plot to plot, you might enable legends for all plots,
    ## or write a custom plotting command that determines what to do, given your data
    add_plot_set(exp_dict, name='throughput-legend.png', series='DS_TYPENAME', x_axis='TOTAL_THREADS', y_axis='total_throughput', plot_type='bars', plot_cmd_args='--legend-only --legend-columns 3')

    ## we place the above legend at the bottom of *each* table by providing "legend_file"
    add_page_set(
            exp_dict
          , image_files='throughput-{INS_DEL_FRAC}-{MAXKEY}k.png'
          , name='throughput'
          , column_field='INS_DEL_FRAC'
          , row_field='MAXKEY'
          , legend_file='throughput-legend.png'
    )

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-rdpw')

## Let's view the plots produced by the previous cell

note you can click on the plots to "drill down" into the data.

In [None]:
show_html('data/throughput.html')

# How about 4 dimensions?

We just saw how to plot 3- and 5-dimensional data...

Let's remove the `MAXKEY` column / data dimension to reduce the dimensionality of the data to 4.

With only one column in the `varying_cols_list` and NO `row_field` specified in `add_page_set`, there will only be one row of plots. (So a strip of plots instead of a grid.)

In [None]:
def define_experiment(exp_dict, args):
    set_dir_tools    (exp_dict, os.getcwd() + '/../../tools') ## path to tools library
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make bin_dir={__dir_run} -j6')

    add_run_param    (exp_dict, '__trials', [1, 2])
    add_run_param    (exp_dict, 'TOTAL_THREADS', [1, 2, 4, 8])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])
    add_run_param    (exp_dict, 'INS_DEL_FRAC', ["0.0 0.0", "5.0 5.0"])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork {TOTAL_THREADS} -nprefill {TOTAL_THREADS} -insdel {INS_DEL_FRAC} -k 200000 -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)

    add_plot_set(
            exp_dict
          , name='throughput-{INS_DEL_FRAC}.png'
          , title='{INS_DEL_FRAC}: throughput'
          , varying_cols_list=['INS_DEL_FRAC']
          , series='DS_TYPENAME'
          , x_axis='TOTAL_THREADS'
          , y_axis='total_throughput'
          , plot_type='bars'
    )

    ## render one legend for all plots (since the legend is the same for all).
    ## if legend varies from plot to plot, you might enable legends for all plots,
    ## or write a custom plotting command that determines what to do, given your data
    add_plot_set(exp_dict, name='throughput-legend.png', series='DS_TYPENAME', x_axis='TOTAL_THREADS', y_axis='total_throughput', plot_type='bars', plot_cmd_args='--legend-only --legend-columns 3')

    ## we place the above legend at the bottom of *each* table by providing "legend_file"
    add_page_set(
            exp_dict
          , image_files='throughput-{INS_DEL_FRAC}.png'
          , name='throughput'
          , column_field='INS_DEL_FRAC'
          , legend_file='throughput-legend.png'
    )

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-rdpw')

## Let's view the plots produced by the previous cell

In [None]:
show_html('data/throughput.html')

# Plots and HTML for data with <ins>6 dimensions</ins>

note that we could have added more than 2 dimensions of data (resulting in data with 6+ dimensions), listing potentially many fields in `varying_cols_list`, and this simply would have resulted in *more plots*.

note that if we had **one** more dimension of data (6 dimensions in total), it could be listed in the keyword argument `table_field`, and **multiple** HTML tables would be rendered in a single HTML page (one for each value of this column).

In [None]:
def define_experiment(exp_dict, args):
    set_dir_tools    (exp_dict, os.getcwd() + '/../../tools') ## path to tools library
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make bin_dir={__dir_run} -j6')

    add_run_param    (exp_dict, '__trials', [1])
    add_run_param    (exp_dict, 'TOTAL_THREADS', [2, 4, 8])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_abtree_lf', 'bronson_pext_bst_occ'])
    add_run_param    (exp_dict, 'MAXKEY', [20000, 200000])
    add_run_param    (exp_dict, 'INS_DEL_FRAC', ['0.0 0.0', '5.0 5.0'])
    ## unlike the above four fields,
    ## the run command does NOT produce a line of the form 'malloc=[...]'.
    ## so, run_experiment.py will APPEND a line of this form to the datafile!
    add_run_param    (exp_dict, 'malloc', ['jemalloc', 'mimalloc'])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/lib{malloc}.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork {TOTAL_THREADS} -nprefill {TOTAL_THREADS} -insdel {INS_DEL_FRAC} -k {MAXKEY} -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)
    add_data_field   (exp_dict, 'malloc', validator=is_run_param('malloc'))

    add_plot_set(
            exp_dict
          , name='throughput-{malloc}-{INS_DEL_FRAC}-{MAXKEY}.png'
          , title='{malloc} {INS_DEL_FRAC} {MAXKEY}'
          , varying_cols_list=['malloc', 'MAXKEY', 'INS_DEL_FRAC']
          , series='DS_TYPENAME'
          , x_axis='TOTAL_THREADS'
          , y_axis='total_throughput'
          , plot_type='bars'
    )

    ## render one legend for all plots (since the legend is the same for all).
    ## if legend varies from plot to plot, you might enable legends for all plots,
    ## or write a custom plotting command that determines what to do, given your data
    add_plot_set(exp_dict, name='throughput-legend.png', series='DS_TYPENAME', x_axis='TOTAL_THREADS', y_axis='total_throughput', plot_type='bars', plot_cmd_args='--legend-only --legend-columns 3')

    ## note: choice of column / row / table field determines how the HTML page looks -- up to you!
    add_page_set(
            exp_dict
          , image_files='throughput-{malloc}-{INS_DEL_FRAC}-{MAXKEY}.png'
          , name='throughput'
          , column_field='INS_DEL_FRAC'
          , row_field='MAXKEY'
          , table_field='malloc'
          , legend_file='throughput-legend.png'
    )

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-rdpw')

## Let's view the data, plots and HTML we produced

In [None]:
show_html('data/throughput.html')
display(select_to_dataframe('select * from data'))

# Plots and HTML for data with <ins>7+ dimensions</ins>

if we had MORE than one extra dimension of data (7+ dimensions in total), we could list additional fields in the keyword argument `page_field_list`, which would cause additional HTML pages to be rendered (one for each combination of values for fields in `page_field_list`), and linked together by an `index.htm`. (note that the `name` keyword argument of `page_field_list` must also be modified to reference these fields, in order for multiple HTML files to be created---you must specify what sort of naming convention you'd like the framework to use.)

In [None]:
def define_experiment(exp_dict, args):
    set_dir_tools    (exp_dict, os.getcwd() + '/../../tools') ## path to tools library
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make bin_dir={__dir_run} -j6')

    add_run_param    (exp_dict, '__trials', [1])
    add_run_param    (exp_dict, 'TOTAL_THREADS', [2, 8])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_abtree_lf', 'bronson_pext_bst_occ'])
    add_run_param    (exp_dict, 'MAXKEY', [20000, 200000])
    add_run_param    (exp_dict, 'INS_DEL_FRAC', ['0.0 0.0', '5.0 5.0'])
    ## unlike the above four fields,
    ## the run command does NOT produce a line of the form 'malloc=[...]'.
    ## so, run_experiment.py will APPEND a line of this form to the datafile!
    add_run_param    (exp_dict, 'malloc', ['jemalloc', 'mimalloc'])
    ## ditto for reclaimer
    add_run_param    (exp_dict, 'numactl', ['', 'numactl --interleave=all'])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/lib{malloc}.so {numactl} time ./{DS_TYPENAME}.debra -nwork {TOTAL_THREADS} -nprefill {TOTAL_THREADS} -insdel {INS_DEL_FRAC} -k {MAXKEY} -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)

    add_plot_set(
            exp_dict
          , name='throughput-{malloc}-{numactl}-{INS_DEL_FRAC}-{MAXKEY}.png'
          , title='{INS_DEL_FRAC} {MAXKEY}'
          , varying_cols_list=['malloc', 'numactl', 'MAXKEY', 'INS_DEL_FRAC']
          , series='DS_TYPENAME'
          , x_axis='TOTAL_THREADS'
          , y_axis='total_throughput'
          , plot_type='bars'
    )

    ## render one legend for all plots (since the legend is the same for all).
    ## if legend varies from plot to plot, you might enable legends for all plots,
    ## or write a custom plotting command that determines what to do, given your data
    add_plot_set(exp_dict, name='throughput-legend.png', series='DS_TYPENAME', x_axis='TOTAL_THREADS', y_axis='total_throughput', plot_type='bars', plot_cmd_args='--legend-only --legend-columns 3')

    ## we place the above legend at the bottom of *each* table by providing "legend_file"
    add_page_set(
            exp_dict
          , image_files='throughput-{malloc}-{numactl}-{INS_DEL_FRAC}-{MAXKEY}.png'
          , name='throughput'
          , column_field='numactl'
          , row_field='malloc'
          , table_field='MAXKEY'
          , page_field_list=['INS_DEL_FRAC']
          , legend_file='throughput-legend.png'
    )

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-rdpw')

## Let's view the data, plots and HTML we produced

In [None]:
show_html('data/index.html')
display(select_to_dataframe('select * from data'))

# It's easy to plot *many* value fields vs your `run_params`

Let's go back to our 5-dimensional data example to demonstrate how to easily produce plots from *many different value fields* (not just `total_throughput`).

### First let's run a quick shell command to check what kinds of fields exist in our data

(This command uses `grep` with a simple `regex` to look for lines of the form "XYZ=*number*")

In [None]:
shell_to_list('grep -E "^[^ =]+=[0-9.]+$" data/data000001.txt', sep='\n')

## Let's focus on the following fields from that list:

- `tree_stats_numNodes`
- `tree_stats_height`
- `tree_stats_avgKeyDepth`
- `global_epoch_counter`
- `PAPI_L2_TCM`
- `PAPI_L3_TCM`
- `PAPI_TOT_CYC`
- `PAPI_TOT_INS`
- `total_throughput`

In [None]:
def define_experiment(exp_dict, args):
    set_dir_tools    (exp_dict, os.getcwd() + '/../../tools')
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, '__trials', [1, 2])
    add_run_param    (exp_dict, 'TOTAL_THREADS', [1, 2, 4, 8])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])
    add_run_param    (exp_dict, 'MAXKEY', [20000, 200000])
    add_run_param    (exp_dict, 'INS_DEL_FRAC', ["0.0 0.0", "5.0 5.0"])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork {TOTAL_THREADS} -nprefill {TOTAL_THREADS} -insdel {INS_DEL_FRAC} -k {MAXKEY} -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)
    add_data_field   (exp_dict, 'tree_stats_numNodes', coltype='INTEGER')
    add_data_field   (exp_dict, 'tree_stats_height', coltype='INTEGER')
    add_data_field   (exp_dict, 'tree_stats_avgKeyDepth', coltype='REAL')
    add_data_field   (exp_dict, 'global_epoch_counter', coltype='INTEGER')
    add_data_field   (exp_dict, 'PAPI_L2_TCM', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_L3_TCM', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_TOT_CYC', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_TOT_INS', coltype='REAL')

    ## render one legend for all plots (since the legend is the same for all).
    ## if legend varies from plot to plot, you might enable legends for all plots,
    ## or write a custom plotting command that determines what to do, given your data
    add_plot_set(exp_dict, name='legend.png', series='DS_TYPENAME', x_axis='TOTAL_THREADS', y_axis='total_throughput', plot_type='bars', plot_cmd_args='--legend-only --legend-columns 3')

    ## render a plot_set for EVERY numeric data field extracted above
    for field in get_numeric_data_fields(exp_dict):
        add_plot_set(
              exp_dict
            , name=field+'-{INS_DEL_FRAC}-{MAXKEY}k.png'
            , title='{INS_DEL_FRAC} {MAXKEY}k: '+field
            , varying_cols_list=['MAXKEY', 'INS_DEL_FRAC']
            , series='DS_TYPENAME'
            , x_axis='TOTAL_THREADS'
            , y_axis=field
            , plot_type='bars'
        )

        ## and also add a page_set for each data field.
        ## we place the above legend at the bottom of *each* table by providing "legend_file"
        add_page_set(
              exp_dict
            , image_files=field+'-{INS_DEL_FRAC}-{MAXKEY}k.png'
            , name=field
            , column_field='INS_DEL_FRAC'
            , row_field='MAXKEY'
            , legend_file='legend.png'
        )

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-rdpw')

## Viewing the results

In [None]:
import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
show_html('data/index.html')

# Rendering *many data fields* on a *single* HTML page

in the previous example, we build one page for each data field extracted. however, you might want, for example, to build a single page with many data fields, each appearing as a *row* of plots.

if you take a moment to think about *how* you would accomplish this using `add_page_set`, it's not obvious that you even *can*... you can specify *one field* as the `row_field`, but in this case we want to show *many different fields, one per row*.

In [None]:
def define_experiment(exp_dict, args):
    set_dir_tools    (exp_dict, os.getcwd() + '/../../tools')
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, '__trials', [1, 2])
    add_run_param    (exp_dict, 'TOTAL_THREADS', [1, 2, 4, 8])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])
    add_run_param    (exp_dict, 'MAXKEY', [20000, 200000])
    add_run_param    (exp_dict, 'INS_DEL_FRAC', ["0.0 0.0", "5.0 5.0"])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork {TOTAL_THREADS} -nprefill {TOTAL_THREADS} -insdel {INS_DEL_FRAC} -k {MAXKEY} -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)
    add_data_field   (exp_dict, 'tree_stats_numNodes', coltype='INTEGER')
    add_data_field   (exp_dict, 'tree_stats_height', coltype='INTEGER')
    add_data_field   (exp_dict, 'tree_stats_avgKeyDepth', coltype='REAL')
    add_data_field   (exp_dict, 'global_epoch_counter', coltype='INTEGER')
    add_data_field   (exp_dict, 'PAPI_L2_TCM', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_L3_TCM', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_TOT_CYC', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_TOT_INS', coltype='REAL')

    ## render one legend for all plots
    add_plot_set(exp_dict, name='legend.png', series='DS_TYPENAME', x_axis='TOTAL_THREADS', y_axis='total_throughput', plot_type='bars', plot_cmd_args='--legend-only --legend-columns 3')

    ## render plots
    value_fields = get_numeric_data_fields(exp_dict)
    for field in value_fields:
        add_plot_set(
              exp_dict
            , name=field+'-{INS_DEL_FRAC}-{MAXKEY}k.png'
            , title='{INS_DEL_FRAC} {MAXKEY}k: '+field
            , varying_cols_list=['MAXKEY', 'INS_DEL_FRAC']
            , series='DS_TYPENAME'
            , x_axis='TOTAL_THREADS'
            , y_axis=field
            , plot_type='bars'
        )

    ## and also add a page_set to show all plots
    add_page_set(
          exp_dict
        , image_files='{row_field}-{INS_DEL_FRAC}-{MAXKEY}k.png'
        , name='comparison'
        , column_field='INS_DEL_FRAC'
        , row_field=value_fields
        , table_field='MAXKEY'
        , legend_file='legend.png'
    )

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-dpw')

## Viewing the results

In [None]:
import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
show_html('data/index.html')

# Separating `tables` into different `pages`

if you prefer, you can eliminate the `table_field` argument to `add_page_set` and instead use `page_field_list`. this produces a slightly different effect.

In [None]:
def define_experiment(exp_dict, args):
    set_dir_tools    (exp_dict, os.getcwd() + '/../../tools')
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, '__trials', [1, 2])
    add_run_param    (exp_dict, 'TOTAL_THREADS', [1, 2, 4, 8])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])
    add_run_param    (exp_dict, 'MAXKEY', [20000, 200000])
    add_run_param    (exp_dict, 'INS_DEL_FRAC', ["0.0 0.0", "5.0 5.0"])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork {TOTAL_THREADS} -nprefill {TOTAL_THREADS} -insdel {INS_DEL_FRAC} -k {MAXKEY} -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)
    add_data_field   (exp_dict, 'tree_stats_numNodes', coltype='INTEGER')
    add_data_field   (exp_dict, 'tree_stats_height', coltype='INTEGER')
    add_data_field   (exp_dict, 'tree_stats_avgKeyDepth', coltype='REAL')
    add_data_field   (exp_dict, 'global_epoch_counter', coltype='INTEGER')
    add_data_field   (exp_dict, 'PAPI_L2_TCM', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_L3_TCM', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_TOT_CYC', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_TOT_INS', coltype='REAL')

    ## render one legend for all plots
    add_plot_set(exp_dict, name='legend.png', series='DS_TYPENAME', x_axis='TOTAL_THREADS', y_axis='total_throughput', plot_type='bars', plot_cmd_args='--legend-only --legend-columns 3')

    ## render plots
    value_fields = get_numeric_data_fields(exp_dict)
    for field in value_fields:
        add_plot_set(
              exp_dict
            , name=field+'-{INS_DEL_FRAC}-{MAXKEY}k.png'
            , title='{INS_DEL_FRAC} {MAXKEY}k: '+field
            , varying_cols_list=['MAXKEY', 'INS_DEL_FRAC']
            , series='DS_TYPENAME'
            , x_axis='TOTAL_THREADS'
            , y_axis=field
            , plot_type='bars'
        )

    ## and also add a page_set to show all plots
    add_page_set(
          exp_dict
        , image_files='{row_field}-{INS_DEL_FRAC}-{MAXKEY}k.png'
        , name='comparison'
        , column_field='INS_DEL_FRAC'
        , row_field=value_fields
        , page_field_list=['MAXKEY']
        , legend_file='legend.png'
    )

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-dpw')

## Viewing the results

In [None]:
import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
show_html('data/index.html')

# Defining a `--testing` mode
## Briefly running each configuration *BEFORE* doing a full run

i often find it useful to have a `testing` mode (enabled with argument `--testing`), that runs for less time, but still explores all (important) configurations of run parameters, to make sure nothing simple will fail when i run for many hours. (fail-fast is good!)

to this end, a variable called `args.testing` is accessible in `define_experiment`, and if it's `True`, then the user has passed `--testing` as a command line arg.

the correct response to this is to limit the set of configurations somehow, perhaps be reducing the number of thread counts, and/or the reducing length of time to execute in each trial, and/or limiting runs to a single trial, and/or eliminating data structure prefilling (or anything else that you find appropriate).

for example, let's add a simple `--testing` mode to the previous code cell.

note the `if args.testing:` block, as well as the `--testing` argument passed to `run_in_jupyter` *instead of* the previous `` argument. (we also eliminate the `-r` argument, since we want to actually run our testing mode.)

observe that this new `--testing` mode takes around 20 seconds to run, compared to several minutes without specifying `--testing`. (this time difference becomes much more drastic if you would normally run more trials, thread counts, or for longer than 1 second. :)) 

i make it a habit to run in `--testing` mode and take a quick peek at the results before running my full experiments.

In [None]:
def define_experiment(exp_dict, args):
    set_dir_tools    (exp_dict, os.getcwd() + '/../../tools')
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, '__trials', [1, 2])
    add_run_param    (exp_dict, 'TOTAL_THREADS', [1, 2, 4, 8])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])
    add_run_param    (exp_dict, 'MAXKEY', [20000, 200000])
    add_run_param    (exp_dict, 'INS_DEL_FRAC', ["0.0 0.0", "5.0 5.0"])

    millis_to_run = 1000

    ## defined a reduced set of configurations for testing mode
    if args.testing:
        add_run_param    (exp_dict, '__trials', [1])
        add_run_param    (exp_dict, 'TOTAL_THREADS', [1, 8])
        millis_to_run = 100

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork {TOTAL_THREADS} -nprefill {TOTAL_THREADS} -insdel {INS_DEL_FRAC} -k {MAXKEY} -t ' + str(millis_to_run))

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)
    add_data_field   (exp_dict, 'tree_stats_numNodes', coltype='INTEGER')
    add_data_field   (exp_dict, 'tree_stats_height', coltype='INTEGER')
    add_data_field   (exp_dict, 'tree_stats_avgKeyDepth', coltype='REAL')
    add_data_field   (exp_dict, 'global_epoch_counter', coltype='INTEGER')
    add_data_field   (exp_dict, 'PAPI_L2_TCM', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_L3_TCM', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_TOT_CYC', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_TOT_INS', coltype='REAL')

    ## render one legend for all plots
    add_plot_set(exp_dict, name='legend.png', series='DS_TYPENAME', x_axis='TOTAL_THREADS', y_axis='total_throughput', plot_type='bars', plot_cmd_args='--legend-only --legend-columns 3')

    ## render plots
    value_fields = get_numeric_data_fields(exp_dict)
    for field in value_fields:
        add_plot_set(
              exp_dict
            , name=field+'-{INS_DEL_FRAC}-{MAXKEY}k.png'
            , title='{INS_DEL_FRAC} {MAXKEY}k: '+field
            , varying_cols_list=['MAXKEY', 'INS_DEL_FRAC']
            , series='DS_TYPENAME'
            , x_axis='TOTAL_THREADS'
            , y_axis=field
            , plot_type='bars'
        )

    ## and also add a page_set to show all plots
    add_page_set(
          exp_dict
        , image_files='{row_field}-{INS_DEL_FRAC}-{MAXKEY}k.png'
        , name='comparison'
        , column_field='INS_DEL_FRAC'
        , row_field=value_fields
        , page_field_list=['MAXKEY']
        , legend_file='legend.png'
    )

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='--testing -rdpw')

## Viewing the `--testing` mode results

In [None]:
import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
show_html('data/index.html')

# Custom output filename patterns

in the experiments above, we have always used the default filename for output files: `dataXXXXXX.txt`.

if you want a different file naming scheme, it's easy to specify a pattern for this using `set_file_data(exp_dict, pattern)`.

let's see an example of this, where we include the current values of several `run_param`s in the outfile file pattern.

(you can also set the output directory with `set_dir_data(exp_dict, path)`.)

In [None]:
def define_experiment(exp_dict, args):
    set_dir_tools    (exp_dict, os.getcwd() + '/../../tools')
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, '__trials', [1, 2])
    add_run_param    (exp_dict, 'TOTAL_THREADS', [1, 2, 8])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])
    add_run_param    (exp_dict, 'MAXKEY', [20000, 200000])
    add_run_param    (exp_dict, 'INS_DEL_FRAC', ["0.0 0.0", "5.0 5.0"])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork {TOTAL_THREADS} -nprefill {TOTAL_THREADS} -insdel {INS_DEL_FRAC} -k {MAXKEY} -t 1000')

    set_file_data    (exp_dict, 'my_data_n{TOTAL_THREADS}_k{MAXKEY}_insdel{INS_DEL_FRAC}_{DS_TYPENAME}.txt')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='--testing -rdpw')

# Automatic best-effort sanity checks

the data framework does its best to identify some basic mistakes that are common when running repeated experiments over a large configuration space. we describe some of them here, and show how they work.

for example, observe that the following `define_experiment` function attempts to plot `TOTAL_THREADS` on the x-axis, `total_throughput` on the y-axis, with `DS_TYPENAME` as the series, but completely ignores `MAXKEY` in the `add_plot_set` call.

this is a mistake, as this would result in `averaging` unrelated data points with two *different* values of `MAXKEY`.

run the following code cell to see the detailed error message that results in this situation. it attempts to be as helpful as possible in helping you diagnose the cause. in this case it essentially identifies and highlights the problematic column (`MAXKEY`) *for you*, and suggests a fix (adding it to the `varying_cols_list` argument when calling `add_plot_set`).

of course, just because something plots successfully doesn't mean you haven't made a mistake... but we do our best to catch a variety of simple mistakes. (or at least assert and fail-fast when *some* sensible assumptions are violated.)

In [None]:
def define_experiment(exp_dict, args):
    set_dir_tools    (exp_dict, os.getcwd() + '/../../tools')
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, '__trials', [1, 2])
    add_run_param    (exp_dict, 'TOTAL_THREADS', [1, 8])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])
    add_run_param    (exp_dict, 'MAXKEY', [20000, 200000])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork {TOTAL_THREADS} -nprefill {TOTAL_THREADS} -insdel 0.5 0.5 -k {MAXKEY} -t 100')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)

    add_plot_set(
          exp_dict
        , name='throughput.png'
        , title='throughput'
        , series='DS_TYPENAME'
        , x_axis='TOTAL_THREADS'
        , y_axis='total_throughput'
        , plot_type='bars'
    )

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-rdp', error_exit_code=0)

# Automatic archival features
## (data zip, git commit hash fetch, git diff file archive)

activated with command line arg: `-z` (which stands for `zip creation`)

the data framework offers a powerful convenience for archiving your experiments: it can automatically ZIP *as little data as is needed* to guarantee you won't lose the ability to return to this exact code/data state (file/directory structure).

how does it do this?

well, assuming you are working a git repository, and are committing changes as you go, the repository's current `commit hash` presumably gives you a way to get *pretty close* to your current file/directory structure.

but of course, it will be missing any changes you've made since your last commit! this includes all of the data you've just generated, as well as any tentative code changes you've made (perhaps experimental changes you're currently testing).

happily, we can *extract* the list of files you've changed *since your last commit* directly with a `git` command: `git status -s | awk '{if ($1 != "D") print $2}' | grep -v "/$"`

so, we do this, and then once we have this list of files, we selectively add *them* to a ZIP file along with the data directory we just produced, as well as the file `output_log.txt`.

crucially, any files that are ignored by `git` (because they are covered by a pattern in your `.gitignore` file) will *NOT* be added to the ZIP file. this means you can automatically exclude files easily that you wouldn't want in your repo anyway. (normally the `data` folder produced by your experiments would probably fall into that category, but we add it manually. if you want to add more files manually, see the function `do_finish` in `run_experiment.py`.)

this whole process should make it easier to achieve a *much* smaller file size for an archive that you *can* reconstruct to reproduce experiments. this smaller file size *should* make it feasible to archive *every* set of experiments you run by default, along with enough information to understand exactly what was run, later. (and, you should only occasionally have to clean up your archives.) 

this can help you eliminate one of the questions we all *hate* asking: `what on earth did we run to get these results?`

to help you reconstruct your current file/directory state later, we dump all relevant information about the `current commit`, including the `commit hash` to `output_log.txt` before we add it to the ZIP. you can find this information about the commit by looking for `'git status:'` or `'commit hash='` in `output_log.txt`.

for example, the following code causes text along the following lines to be archived as part of `output_log.txt`:

    ## ## Fetching git status and any uncommitted changes for archival purposes
    ## 
    ## commit_hash=05ec0e2184bd8c7a30e22457483cbeeadd0c2461
    ## git_status:
    ## On branch data_framework
    ## Your branch is up to date with 'origin/data_framework'.
    ## 
    ## Changes not staged for commit:
    ##   (use "git add <file>..." to update what will be committed)
    ##   (use "git checkout -- <file>..." to discard changes in working directory)
    ##   (commit or discard the untracked or modified content in submodules)
    ## 
    ## 	modified:   .vscode/settings.json
    ## 	modified:   microbench_experiments/tutorial/tutorial.ipynb
    ## 	modified:   microbench_experiments/tutorial/tutorial_extra.ipynb
    ## 	modified:   tools (new commits, modified content)
    ## 
    ## no changes added to commit (use "git add" and/or "git commit -a")
    ## 
    ## diff_files=['.vscode/settings.json', 'microbench_experiments/tutorial/tutorial.ipynb', 'microbench_experiments/tutorial/tutorial_extra.ipynb', 'tools']

## on my system, the following code produces an archive smaller than `3MB`, which offers complete reproducibility (and even includes 37 generated plots), despite the entire contents of setbench reaching `140MB`!


In [None]:
def define_experiment(exp_dict, args):
    set_dir_tools    (exp_dict, os.getcwd() + '/../../tools')
    set_dir_compile  (exp_dict, os.getcwd() + '/../../microbench')
    set_dir_run      (exp_dict, os.getcwd() + '/../../microbench/bin')
    set_cmd_compile  (exp_dict, 'make -j6')

    add_run_param    (exp_dict, '__trials', [1, 2])
    add_run_param    (exp_dict, 'TOTAL_THREADS', [1, 2, 4, 8])
    add_run_param    (exp_dict, 'DS_TYPENAME', ['brown_ext_ist_lf', 'brown_ext_abtree_lf', 'bronson_pext_bst_occ'])
    add_run_param    (exp_dict, 'MAXKEY', [20000, 200000])
    add_run_param    (exp_dict, 'INS_DEL_FRAC', ["0.0 0.0", "5.0 5.0"])

    set_cmd_run      (exp_dict, 'LD_PRELOAD=../../lib/libjemalloc.so numactl --interleave=all time ./{DS_TYPENAME}.debra -nwork {TOTAL_THREADS} -nprefill {TOTAL_THREADS} -insdel {INS_DEL_FRAC} -k {MAXKEY} -t 1000')

    add_data_field   (exp_dict, 'total_throughput', coltype='INTEGER', validator=is_positive)
    add_data_field   (exp_dict, 'tree_stats_numNodes', coltype='INTEGER')
    add_data_field   (exp_dict, 'tree_stats_height', coltype='INTEGER')
    add_data_field   (exp_dict, 'tree_stats_avgKeyDepth', coltype='REAL')
    add_data_field   (exp_dict, 'global_epoch_counter', coltype='INTEGER')
    add_data_field   (exp_dict, 'PAPI_L2_TCM', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_L3_TCM', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_TOT_CYC', coltype='REAL')
    add_data_field   (exp_dict, 'PAPI_TOT_INS', coltype='REAL')

    ## render one legend for all plots
    add_plot_set(exp_dict, name='legend.png', series='DS_TYPENAME', x_axis='TOTAL_THREADS', y_axis='total_throughput', plot_type='bars', plot_cmd_args='--legend-only --legend-columns 3')

    ## render plots
    value_fields = get_numeric_data_fields(exp_dict)
    for field in value_fields:
        add_plot_set(
              exp_dict
            , name=field+'-{INS_DEL_FRAC}-{MAXKEY}k.png'
            , title='{INS_DEL_FRAC} {MAXKEY}k: '+field
            , varying_cols_list=['MAXKEY', 'INS_DEL_FRAC']
            , series='DS_TYPENAME'
            , x_axis='TOTAL_THREADS'
            , y_axis=field
            , plot_type='bars'
        )

    ## and also add a page_set to show all plots
    add_page_set(
          exp_dict
        , image_files='{row_field}-{INS_DEL_FRAC}-{MAXKEY}k.png'
        , name='comparison'
        , column_field='INS_DEL_FRAC'
        , row_field=value_fields
        , table_field='MAXKEY'
        , legend_file='legend.png'
    )

import sys ; sys.path.append('../../tools/data_framework') ; from run_experiment import *
run_in_jupyter(define_experiment, cmdline_args='-rdpwz')