# Benchmark Analysis of Two Analysis-Ready HDF5 Files with Alcator C-Mod Data

This notebook analyzes performance when reading shot data from two analysis-ready HDF5 files. The term _analysis-ready_ implies that the files were generated with specific features that are expexted to improve data access performance. The files were:

| File name | MD5 checksum | File size |
|---|---| --- |
| `shot500_sig70.hdf5` | `953cf91887098a0d6aa6e85acc650c62` | 3.7 GB |
| `shot500_sig70_reorder.hdf5` | `90d25f6fddbdf5516f5ed712fbae12a6` | 3.7 GB |

## About the Files

Both files hold the same 500 shots with their 70 signals from the C-Mod MDSplus store at MIT. The files were created with the following properties:

* Cloud optimized using the paged aggregation file space management with the file page size of 8 MB.
* Internal file metadata in both files takes up 9 file pages.
* Both files have same HDF5 group hierarchy: `/shots/<SHOT_ID>/signals/<SIGNAL_NAME>`.
* The MDSplus `dim_of` data are stored as HDF5 dimension scale datasets, attached to the appropriate dimensions of signal HDF5 datasets.
* Only unique dimension scales were stored by comparing the MD5 checksum of dimension scale values. This means there could be multiple signals that share the same dimension scale.
* The dimension scale datasets are in the `/shots/<SHOT_ID>` groups, thus, each shot's signals come with their own set of dimension scales.
* HDF5 datasets (dimension scales and signals) with total size greater than 8 kB are compressed with the zlib compression at level 4.
* The two files differ in how the data are actually laid out. The `shot500_sig70.hdf5` file's data are written in the order of the HDF5 group hierarchy: all data for one shot. However, in the `shot500_sig70_reorder.hdf5` file, the data are written per signal for all the shots while still retaining the same group hierarchy. The different data layouts were chosen to gauge the impact of data adjacency in the file pages on performance. The first file's data layout could be described as "one shot, all signals", while the second file's data layout is "one signal, all shots".

## Benchmarks

The benchmark cases were created by combining the following parameters:

* Two HDF5 files.
* File locations: local file system or S3.
* File data read with 1, 2, 4, 8, or 16 Dask workers.
* HDF5 library page cache size of zero (off, for local files only), 64 MB (holds up to 8 file pages), or 256 MB (holds up to 32 file pages).
* Two different ways of reading all data in the files: (1) per shot, all signals, or (2) per signal from all shots. Dimension scales were read with their signal data.

All benchmarks were run on an AWS EC2 instance `m5.4xlarge` with 16 vCPU (8 physical cores) and 64 GB memory. Order of operations in every benchmark run was:

1. Open the file and gather information about all signal HDF5 datasets and their dimension scales. This uses the libhdf5 method that guarantees finding all objects in a file exactly once.
1. Organize the discovered signal information based on the access type: per shot all signals, or per signal all shots.
1. Create a data access plan by randomizing the order of shots or signals to be read from the file.
1. Dask workers are created as separate processes.
1. The data access plan is equally divided across the Dask workers.
1. Each Dask worker opens the file with the benchmark-specific file page cache size and reads all the data from specific signals and their dimension scales.
1. Each Dask worker reports collected benchmark data when finished with its job.

## TL;DR Conclusions

What is learned from the benchmark data:

* The fastest time is 6.7 seconds for local files and 16.3 seconds for S3 files. Both of these times are with 16 workers and the same shot layout and shot access order file.
* Parallelizing data reading does improve overall performance but the gain increases linearly with the number of workers only for the S3 files. The max speed-up for local files is ~10 times with 16 workers, while ~16 times with 16 workers for S3 files. Adding more workers for files in S3 may achieve further linear performance improvement.
* Laying out file's data by shot and reading data by shots (shot/shot) yields best (fastest) results. Second tier of the results is for signal file layout and reading data by signal (signal/signal). The results for the mixed cases, one file data layout and the other data access order, provided mostly the worst set of results.
* Why the signal/signal results are not always comparable to the shot/shot ones is not yet understood enough but may mean the signal-layout file is still not optimized enough for the signal data access order.
* Caching file pages does not improve the performance for local files but is absolutely required for S3 files. Larger the cache, the better, as usual.
* Performance when reading from files in S3 can be made very comparable to local files but requires parallelization of data acces and use of all the caching libhdf5 capabilities.

---

## Benchmark Data Analysis

In [1]:
import pandas as pd
import hvplot.pandas  # noqa: F401
import holoviews as hv

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

hv.extension("bokeh")

Read benchmark data for the HDF5 files in a local file system:

In [2]:
lc_data = pd.read_csv("./ec2-local.csv")
lc_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   open-file-time    372 non-null    float64
 1   pb-size           372 non-null    int64  
 2   read-data-time    372 non-null    float64
 3   wrkr-num-objs     372 non-null    int64  
 4   mean-obj-time     372 non-null    float64
 5   num-dsets         372 non-null    int64  
 6   mean-dset-time    372 non-null    float64
 7   worker#           372 non-null    int64  
 8   num-workers       372 non-null    int64  
 9   file              372 non-null    object 
 10  obj-type          372 non-null    object 
 11  tot-num-obj       372 non-null    int64  
 12  gather-time       372 non-null    float64
 13  tot-runtime       372 non-null    float64
 14  pb-meta-accesses  248 non-null    float64
 15  pb-meta-hitrate   248 non-null    float64
 16  pb-meta-evicts    248 non-null    float64
 1

Read benchmark data for the HDF5 files in S3:

In [3]:
s3_data = pd.read_csv("./ec2-s3.csv")
s3_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248 entries, 0 to 247
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   open-file-time    248 non-null    float64
 1   pb-size           248 non-null    int64  
 2   pb-meta-accesses  248 non-null    int64  
 3   pb-meta-hitrate   248 non-null    float64
 4   pb-meta-evicts    248 non-null    int64  
 5   pb-raw-accesses   248 non-null    int64  
 6   pb-raw-hitrate    248 non-null    float64
 7   pb-raw-evicts     248 non-null    int64  
 8   read-data-time    248 non-null    float64
 9   wrkr-num-objs     248 non-null    int64  
 10  mean-obj-time     248 non-null    float64
 11  num-dsets         248 non-null    int64  
 12  mean-dset-time    248 non-null    float64
 13  worker#           248 non-null    int64  
 14  num-workers       248 non-null    int64  
 15  file              248 non-null    object 
 16  obj-type          248 non-null    object 
 1

Show the HDF5 files used:

In [4]:
lc_data["file"].drop_duplicates()

0            ../data/shot500_sig70.hdf5
1    ../data/shot500_sig70_reorder.hdf5
Name: file, dtype: object

In [5]:
s3_data["file"].drop_duplicates()

0    s3://psfchdf5/mw_files/v4_files/shot500_sig70....
1    s3://psfchdf5/mw_files/v4_files/shot500_sig70_...
Name: file, dtype: object

Page cache sizes used in the benchmarks:

In [6]:
lc_data["pb-size"].unique()

array([        0,  64000000, 264000000])

In [7]:
s3_data["pb-size"].unique()

array([ 64000000, 264000000])

Replace the file names and page cache sizes with simpler values so it will be easier to work with the data:

In [8]:
lc_data.replace(
    {
        "file": {
            "../data/shot500_sig70.hdf5": "shot layout",
            "../data/shot500_sig70_reorder.hdf5": "signal layout",
        },
        "pb-size": {0: "off", 64000000: "64MB", 264000000: "264MB"},
    },
    inplace=True,
)

s3_data.replace(
    {
        "file": {
            "s3://psfchdf5/mw_files/v4_files/shot500_sig70.hdf5": "shot layout",
            "s3://psfchdf5/mw_files/v4_files/shot500_sig70_reorder.hdf5": "signal layout",
        },
        "pb-size": {0: "off", 64000000: "64MB", 264000000: "264MB"},
    },
    inplace=True,
)

Replaced values are:

In [9]:
lc_data["file"].drop_duplicates()

0      shot layout
1    signal layout
Name: file, dtype: object

In [10]:
s3_data["file"].drop_duplicates()

0      shot layout
1    signal layout
Name: file, dtype: object

In [11]:
lc_data["pb-size"].drop_duplicates()

0        off
124     64MB
248    264MB
Name: pb-size, dtype: object

In [12]:
s3_data["pb-size"].drop_duplicates()

0       64MB
124    264MB
Name: pb-size, dtype: object

Columns `obj-type` and `tot-num-obj` describe data access type and the total number of _objects_ accessed during one benchmark run:

* `obj-type = shots` the order of data access was by shot, reading all signals and their dimensions scales of a single shot, then proceeding to another, for a total of 500 shots.
* `obj-type = signals` the data access order was by signal and its dimension scales for all the shots, then proceeding to another signal, for a total of 62 signals from all the shots.


In [13]:
lc_data[["obj-type", "tot-num-obj"]].drop_duplicates()

Unnamed: 0,obj-type,tot-num-obj
0,signals,62
2,shots,500


In [14]:
s3_data[["obj-type", "tot-num-obj"]].drop_duplicates()

Unnamed: 0,obj-type,tot-num-obj
0,signals,62
2,shots,500


### Total Runtime

Total benchmark runtime is the elapsed time of the entire benchmark as measured by the main process. The measured time encompasses:
1. Dividing data access plan across Dask workers and their intialization.
1. Waiting for all Dask workers to complete their jobs.
1. Collecting Dask worker benchmark data.

These data are in the `tot-runtime` column.

Below are two DataFrames with total runtimes for the local and S3 file benchmarks. They include several relevant columns from the original DataFrames plus a new column `norm-tot-runtime`. The new column holds computed performance ratios to the _baseline_ benchmark. The baseline benchmark is one of the available benchmarks selected because it represents the most common set of libhdf5 and compute settings. The baseline benchmarks for the two file location cases are:

* Local files: 1 Dask worker, no file page cache, shot file layout, reading by shots
* S3 files: 1 Dask worker, 264 MB file page cache, shot file layout, reading by shots

In [15]:
lc_runtime = lc_data[
    ["pb-size", "file", "obj-type", "num-workers", "tot-runtime"]
].drop_duplicates(ignore_index=True)
lc_runtime["where"] = "Local"
lc_runtime["norm-tot-runtime"] = (
    lc_runtime.loc[2, "tot-runtime"] / lc_runtime["tot-runtime"]
)

s3_runtime = s3_data[
    ["pb-size", "file", "obj-type", "num-workers", "tot-runtime"]
].drop_duplicates(ignore_index=True)
s3_runtime["where"] = "S3"
s3_runtime["norm-tot-runtime"] = (
    s3_runtime.loc[22, "tot-runtime"] / s3_runtime["tot-runtime"]
)

Plots of normalized runtimes for the local and S3 file cases:

In [16]:
plot_kwargs = {
    "x": "num-workers",
    "y": "norm-tot-runtime",
    "by": ["pb-size", "obj-type", "file"],
}
(
    lc_runtime.hvplot.line(**plot_kwargs)
    * lc_runtime.hvplot.scatter(**plot_kwargs)
    * hv.HLine(1).opts(line_width=0.5, color="pink")
    * hv.Slope(slope=1, y_intercept=0).opts(color="pink", line_width=0.5)
).options(
    # legend_position="bottom_right",
    title="Local files  (>1 better)",
    xlabel="Number of Dask workers",
    ylabel="Performance ratio",
    xlim=(0, 17),
    ylim=(0, None),
    height=400,
    width=800,
)

Benchmark performance for the local files are split into three distinct groups. The best performance is for the "off, shots, shot layout" and "off, shots, signal layout" benchmarks. The second group are the "64MB, shots, shot layout" and "264MB, shots, shot layout". The rest of the benchmarks are the slowest group. The impact of more Dask workers is almost linear up to 8 workers and then reduces for the first two groups. The third group shows much lesser improvement for additional workers.

In [17]:
plot_kwargs = {
    "x": "num-workers",
    "y": "norm-tot-runtime",
    "by": ["pb-size", "obj-type", "file"],
}
(
    s3_runtime.hvplot.line(**plot_kwargs)
    * s3_runtime.hvplot.scatter(**plot_kwargs)
    * hv.HLine(1).opts(line_width=0.5, color="pink")
    * hv.Slope(slope=1, y_intercept=0).opts(color="pink", line_width=0.5)
).options(
    # legend_position="top_right",
    title="S3 files",
    xlabel="Number of Dask workers",
    ylabel="Performance ratio (>1 better)",
    xlim=(0, 17),
    height=400,
    width=900,
)

The benchmark performance when the files are in S3 falls into two groups: one group is linearly increasing with the number of Dask workers, while the other group improves mush slower for the same number of Dask workers. The best performance benchmarks are "264MB, shots, shot layout" and "64MB, shots, shot layout".

Combine the local and S3 runtime DataFrames into one for easier analysis:

In [18]:
all_runtime = pd.concat([lc_runtime, s3_runtime], axis=0, ignore_index=False)
all_runtime.drop("norm-tot-runtime", axis=1, inplace=True)

The combined plot of the total runtime for all benchmarks grouped in four categories:

In [19]:
plot_kwargs = {
    "x": "num-workers",
    "y": "tot-runtime",
    "by": ["where", "pb-size", "obj-type", "file"],
}
(
    all_runtime.hvplot.line(**plot_kwargs) * all_runtime.hvplot.scatter(**plot_kwargs)
).options(
    # show_legend=False,
    # legend_position="top_right",
    title="Runtimes for Local and S3 Files",
    xlabel="Number of Dask workers",
    ylabel="Runtime / [seconds]",
    xlim=(0, 17),
    logy=True,
    show_grid=True,
    height=650,
    width=800,
)

Benchmark runtimes are shown in the above plot just to illustrate the impact of accessing data in a cloud object store compared to the same data locally. The benefit of file settings optimization and data access parallelization with Dask workers is apparent with a two order of magnitude improvement. The runtimes generally fall into three groups regardless of the number of Dask workers.

---

The top 15 fastest benchmarks (the `tot-runtime` column) are:

In [20]:
# all_runtime.groupby(["pb-size", "file", "obj-type"]).apply(
#     lambda x: x.sort_values("tot-runtime"), include_groups=False
# )

all_runtime.sort_values(by="tot-runtime").reset_index(drop=True).head(15)

Unnamed: 0,pb-size,file,obj-type,num-workers,tot-runtime,where
0,off,signal layout,shots,16,6.676796,Local
1,off,shot layout,shots,16,6.724144,Local
2,264MB,shot layout,shots,16,7.602094,Local
3,64MB,shot layout,shots,16,7.682992,Local
4,off,signal layout,shots,8,9.032351,Local
5,off,shot layout,shots,8,9.069877,Local
6,264MB,shot layout,shots,8,10.171084,Local
7,64MB,shot layout,shots,8,10.528344,Local
8,264MB,shot layout,shots,16,16.268789,S3
9,off,shot layout,shots,4,16.910447,Local


The best times are dominantly for the local file case (`where = Local`), as expected. The best S3 file benchmark is placed 9th (`index = 8`). All the local file benchmarks before it are either with 16 or 8 Dask workers (`num-workers = 16`). Eight of the listed benchmarks are with 16 workers regardless of the file location.

The data access order of the displayed benchmarks is all `shots` (`obj-type = shots`) but one, the last, which is for `obj-type = signals`. This is an interesting result indicating that for some reason the signal layout file was not optimized well for the signal data access order.

The fastest S3 file benchmark is ~1.4 times slower than the fastest local file benchmark but still in the absolute terms quite acceptable (16.3 vs 6.7 seconds). The local file benchmark with the same parameters as the best S3 one placed 3rd (`index = 2`) and is just ~0.5 times quicker.

The first and second placed benchmarks are so close that if these benchmarks were repeated the order could have been different. We base this assumption on the predominance of the shot file layout and shot data access order benchmarks in the above top list.

The descriptive statistics of the benchmarks for the two file locations (`where = Local` and `where = S3`):

In [21]:
all_runtime[all_runtime["where"] == "Local"]["tot-runtime"].describe()

count     60.000000
mean      53.114493
std       48.342431
min        6.676796
25%       21.192890
50%       35.586112
75%       68.046896
max      213.538156
Name: tot-runtime, dtype: float64

In [22]:
all_runtime[all_runtime["where"] == "S3"]["tot-runtime"].describe()

count      40.000000
mean      581.272330
std       717.017434
min        16.268789
25%       131.097705
50%       270.932446
75%       675.492870
max      2615.098425
Name: tot-runtime, dtype: float64

### Gather Dataset Information Performance

This task was to gather information about one file's content: all signal HDF5 datasets and their dimension scales. It was repeated for every benchmark run. The task used the native HDF5 library method that guarantees to visit every HDF5 object in the file only once. The task's timing data are in the `gather-time` column.

In [23]:
lc_gather_time = lc_data[["pb-size", "file", "gather-time"]].drop_duplicates()
s3_gather_time = s3_data[["pb-size", "file", "gather-time"]].drop_duplicates()

In [24]:
(
    lc_gather_time.hvplot.box(
        y="gather-time",
        by=["pb-size", "file"],
    )
    * s3_gather_time.hvplot.box(
        y="gather-time",
        by=["pb-size", "file"],
    )
).options(
    ylim=(8, 11),
    title="Local (blue) and S3 (red)",
    height=400,
    show_legend=False,
    ylabel="Time to discover all signals / [seconds]",
    show_grid=True,
)

Main findings from the above plot:

* Performance for the `signal layout` file in S3 is slower, especially for the case of 64 MB page buffer where the timings were ~125 seconds (not shown in the plot).
* The `shot layout` file performance is slightly slower when page buffer is not used (`off`) but overall all cases are similar.
* Page buffer does significantly improve performance for files in S3 if its size is enough to keep relevant file pages cached.

### Open File Performance

This task measured the time to just open file and was recorded for every Dask worker and all benchmark runs. The data is in the `open-file-time` column.

In [25]:
(
    lc_data.hvplot.box(
        y="open-file-time",
        by=["pb-size"],
    )
    * s3_data.hvplot.box(
        y="open-file-time",
        by=["pb-size"],
    )
).options(
    logy=True,
    title="Local files (blue) and S3 files (red)",
    height=400,
    show_legend=False,
    ylabel="Open file time / [seconds]",
    show_grid=True,
)

Takeaways from the plot:

* Opening files in S3 was approx. two orders of magnitude slower than locally.
* Opening files in either locations was under one second for all cases, so quick enough for the file's location.
* There are no noticeable variations for cases with the same file location, either local or S3.

### Page Cache Performance

Libhdf5 keeps statistics about its file page cache which can be used to assess whether its chosen size is appropriate for the actual data access operations. The statistics kept for thsis analysis are: total number of cache accesses, cache hit rate, and number of file page evictions from the cache.

In [26]:
lc_pb = lc_data.loc[
    lc_data["pb-size"] != 0,
    [
        "pb-size",
        "file",
        "num-workers",
        "obj-type",
        "tot-runtime",
        "pb-meta-accesses",
        "pb-meta-hitrate",
        "pb-meta-evicts",
        "pb-raw-accesses",
        "pb-raw-hitrate",
        "pb-raw-evicts",
    ],
]

s3_pb = s3_data[
    [
        "pb-size",
        "file",
        "num-workers",
        "obj-type",
        "tot-runtime",
        "pb-meta-accesses",
        "pb-meta-hitrate",
        "pb-meta-evicts",
        "pb-raw-accesses",
        "pb-raw-hitrate",
        "pb-raw-evicts",
    ]
]

In [27]:
by_cols = ["file", "obj-type", "pb-size"]
(
    lc_pb.hvplot.box(
        y="pb-meta-hitrate",
        by=by_cols,
    )
    * s3_pb.hvplot.box(
        y="pb-meta-hitrate",
        by=by_cols,
    )
).options(
    show_grid=True,
    show_legend=False,
    title="Local files (blue) and S3 files (red)",
    ylabel="meta cache hit rate / [%]",
    ylim=(None, 100),
    height=400,
) + (
    lc_pb.hvplot.box(
        y="pb-raw-hitrate",
        by=by_cols,
    )
    * s3_pb.hvplot.box(
        y="pb-raw-hitrate",
        by=by_cols,
    )
).options(
    show_grid=True,
    show_legend=False,
    title="Local files (blue) and S3 files (red)",
    ylabel="raw cache hit rate / [%]",
    ylim=(None, 100),
    height=400,
)


The overall statistics of cache hit rates for internal metadata (_meta_) and raw data (_raw_) file pages shows almost identical values across all benchmark runtime parameters. The meta cache rates are very high (93-100%) for all cases, although the "signal layout, shots, 64MB" case looks almost like an outlier with its smallest value. On the other hand, the raw cache rates seem to fall into four groups. In the order of highest to lowest hit rate, they are: (1) "shot layout, shots, 64MB" and "shot layout, shots, 264MB"; (2) "signal layout, signals, 64MB", "signal layout, signals, 264MB"; (3) "signal layout, shots, 64MB" and "signal layout, shots, 264MB"; (4) "shot layout, signals, 64MB" and "shot layout, signals, 264MB". The first two groups seem to indicate a positive impact on raw cache hit rates for the shot- and signal-centric data file layouts with the same data access order.

---

lc_pb.hvplot.scatter(
    x="num-workers", y="pb-meta-hitrate", by=["file", "obj-type", "pb-size"], width=750
)

lc_pb.hvplot.scatter(
    x="num-workers",
    y="pb-raw-hitrate",
    by=["file", "obj-type", "pb-size"],
    grid=True,
    width=750,
)

pb_raw_rate_desc = pb_data.groupby(["file", "obj-type", "pb-size", "num-workers"])[
    "pb-raw-rate"
].describe()
pb_raw_rate_desc["top"] = pb_raw_rate_desc["max"] - pb_raw_rate_desc["mean"]
pb_raw_rate_desc["bot"] = pb_raw_rate_desc["mean"] - pb_raw_rate_desc["min"]

pb_raw_rate_desc.hvplot.scatter(
    y="mean", by=["file", "obj-type", "pb-size"]
) * pb_raw_rate_desc.hvplot.line(
    y="mean", by=["file", "obj-type", "pb-size"]
) * pb_raw_rate_desc.hvplot.errorbars(
    y="mean",
    yerr2="top",
    yerr1="bot",
    by=["file", "obj-type", "pb-size"],
)