<a id='home'></a>

# Growth of File Sizes for Versioned-HDF5 Files 

For these tests, we have generated `.h5` data files using the `generate_data_deterministic.py` script from the [VersionedHDF5 repository](https://github.com/Quansight/versioned-hdf5), using the standard options ([see details here](#standard))

We performed the following tests:
1. [Test Large Fraction Changes Sparse](#test1)
2. [Test Mostly Appends Sparse](#test2)
3. [Test Small Fraction Changes Sparse](#test3)
4. [Test Mostly Appends Dense](#test4)

**These tests were last run on**

In [None]:
from datetime import datetime
print(datetime.utcnow(), "UTC")

## Setup

The path to the generated test files is

In [None]:
path = "/home/melissa/projects/versioned-hdf5/analysis" # change this as necessary

In [None]:
%matplotlib inline
import h5py
import json
import numpy as np
import performance_tests
import matplotlib.pyplot as plt

The information from the generated test files are stored in either
- `testcase.tests`, a dictionary containing all the info related to a testcase that was run recently;
- a `.json` file named after the test name and options, containing a summary of the results. This file can be read with
    ```python
    with open(f"{testname}.json", "r") as json_in:
        test = json.load(json_in)
    ```

<a id='test1'></a>

# Test 1: Large fraction changes (sparse)

In [None]:
testname = "test_large_fraction_changes_sparse"

We have tested the following numbers of versions (or transactions):

```python
num_transactions_1 = [50, 100, 500, 1000, 5000, 10000]
```

If you want to generate the files now, modify the following constants for the desired tests. **Please keep in mind that file sizes can become very large for large numbers of transactions (above 5000 transactions).**

In [None]:
num_transactions_1 = [25, 50]

For the chunk size parameter, we have tested chunk sizes of $2^8, 2^{10}, 2^{12}$ and $2^{14}$.

In [None]:
exponents_1 = [12, 14]

Choose desired compression algorithm.

In [None]:
compression_1 = [None, "gzip", "lzf"]

In [None]:
testcase = performance_tests.test_large_fraction_changes_sparse(path=path,
                                                                  num_transactions=num_transactions_1, 
                                                                  exponents=exponents_1, 
                                                                  compression=compression_1)
testcase_1 = testcase.create_files()
testcase.save(testcase_1)
t_sizes_1 = [test['theoretical_sizes'] for test in testcase.tests[-len(num_transactions_1):]]

Otherwise, we can read an existing `.json` file with

In [None]:
with open(f"{testname}.json", "r") as json_in:
    testcase_1 = json.load(json_in)

num_transactions_1 = list(set([test['num_transactions'] for test in testcase_1]))
exponents_1 = list(set([test['chunk_size'] for test in testcase_1]))
compression_1 = list(set([test['compression'] for test in testcase_1]))
t_sizes_1 = None

## Number of versions v. File size

We'll start by analyzing how the `.h5` file sizes grow as the number of versions grows. 

Note that the array size also grows as the number of versions grows, since each transaction is changing the original arrays by adding, deleting and changing values in the original arrays. In order to compute a (naive) theoretical lower bound on the file size, we'll compute how much space each version should take. Keep in mind there is redundant data as some of it is not changed during the staging of a new version but it is still being stored. In this example, we start with three arrays with 5000 elements (2 integer arrays and one float), and in the end we have the following array sizes:

```
Maximum array size for file with 50 transactions: 5500
Maximum array size for file with 100 transactions: 6000
Maximum array size for file with 500 transactions: 10000
Maximum array size for file with 1000 transactions: 15000
Maximum array size for file with 5000 transactions: 55000
Maximum array size for file with 10000 transactions: 105000
```

Let's show the size information in a plot:

In [None]:
filesizes_1 = np.array([test['size'] for test in testcase_1])
sizelabels_1 = np.array([test['size_label'] for test in testcase_1])

n = len(num_transactions_1)
fig_large_fraction_changes = plt.figure(figsize=(14,10))

if t_sizes_1:
    plt.plot(num_transactions_1, t_sizes_1, 'o--', ms=5, color='magenta', label="Theoretical file size")

nexp = len(exponents_1)
ncomp = len(compression_1)
for i in range(ncomp):
    for j in range(nexp):
        plt.plot(num_transactions_1, 
                 filesizes_1[j*n+i*nexp*n:(j+1)*n+i*nexp*n], 
                 '*--', ms=12, 
                 label=f"Chunk size 2**{exponents_1[j]}, {compression_1[i]}")

plt.xlabel("Transactions")
plt.title("test_large_fraction_changes_sparse")
plt.legend()
plt.yticks(filesizes_1, sizelabels_1)
plt.show()

Changing the view to a logarithmic scale, we have the following:

In [None]:
fig_large_fraction_changes_log = plt.figure(figsize=(14,10))

if t_sizes_1:
    plt.loglog(num_transactions_1, t_sizes_1, 'o--', ms=5, color='magenta', label="Theoretical file size")

for i in range(ncomp):
    for j in range(nexp):
        plt.loglog(num_transactions_1, 
                   filesizes_1[j*n+i*nexp*n:(j+1)*n+i*nexp*n], 
                   '*--', ms=12, 
                   label=f"Chunk size 2**{exponents_1[j]}, {compression_1[i]}")

plt.xlabel("Transactions")
plt.title("test_large_fraction_changes_sparse")
plt.legend()
plt.yticks(filesizes_1, sizelabels_1)
plt.show()

## Creation times

If we look at the creation times for these files, we have something like this:

In [None]:
t_write_1 = np.array([test['t_write'] for test in testcase_1])

fig_large_fraction_changes_times = plt.figure()
for i in range(ncomp):
    for j in range(nexp):
        plt.plot(num_transactions_1, 
                 t_write_1[j*n+i*nexp*n:(j+1)*n+i*nexp*n], 
                 'o--', ms=8, 
                 label=f"Chunk size 2**{exponents_1[j]}, {compression_1[i]}")
plt.xlabel("Transactions")
plt.title("test_large_fraction_changes_sparse - creation times in seconds")
plt.legend()
plt.xticks(num_transactions_1)
plt.show()

So we can clearly see that files with smallest file size, corresponding to chunk sizes of $2^8$ and $2^{10}$, are also the ones with largest creation times. **This is consistent with the effects of using smaller chunk sizes in HDF5 files.**

This behaviour suggests that for `test_large_fraction_changes_sparse`, larger chunk sizes generate larger files, but the size of the files grows modestly as the number of transactions grow. So, **if we are dealing with a large number of transactions, larger chunk sizes generate files that are of reasonable size while having faster creation times** (and probably faster IO speeds as well).

[Back to top](#home)

<a id='test2'></a>

# Test 2: Mostly appends (sparse)

In [None]:
testname = "test_mostly_appends_sparse"

For this case, we have tested the following number of transactions:

```python
num_transactions_2 = [50, 100, 200]
```

Once again, we tested chunk sizes of $2^8$, $2^{10}$, $2^{12}$ and $2^{14}$.

Change `num_transactions_2` and `exponents` as desired (the previous warning applies: beware of very large file sizes and creation times for large numbers of versions):

In [None]:
num_transactions_2 = [25, 50]
exponents_2 = [12, 14]
compression_2 = [None, "gzip", "lzf"]

In [None]:
testcase = performance_tests.test_mostly_appends_sparse(path=path,
                                                        num_transactions=num_transactions_2, 
                                                        exponents=exponents_2, 
                                                        compression=compression_2)
testcase_2 = testcase.create_files()
testcase.save(testcase_2)
t_sizes_2 = [test['theoretical_sizes'] for test in testcase.tests[-len(num_transactions_2):]]

Otherwise, we can read an existing `.json` file with

In [None]:
with open(f"{testname}.json", "r") as json_in:
    testcase_2 = json.load(json_in)

num_transactions_2 = list(set([test['num_transactions'] for test in testcase_2]))
exponents_2 = list(set([test['chunk_size'] for test in testcase_2]))
compression_2 = list(set([test['compression'] for test in testcase_2]))
t_sizes_2 = None

Let's show the size information in a graph:

In [None]:
filesizes_2 = np.array([test['size'] for test in testcase_2])
sizelabels_2 = np.array([test['size_label'] for test in testcase_2])

fig_mostly_appends = plt.figure(figsize=(14,10))

if t_sizes_2:
    plt.plot(num_transactions_2, t_sizes_2, 'o--', ms=5, color='magenta', label="Theoretical file size")

n = len(num_transactions_2)
nexp = len(exponents_2)
ncomp = len(compression_2)
for i in range(ncomp):
    for j in range(nexp):
        plt.plot(num_transactions_2, 
                 filesizes_2[j*n+i*nexp*n:(j+1)*n+i*nexp*n], 
                 '*--', ms=12, 
                 label=f"Chunk size 2**{exponents_2[j]}, {compression_2[i]}")

    
plt.xlabel("Transactions")
plt.title("test_mostly_appends_sparse")
plt.legend()
plt.yticks(filesizes_2[-n:], sizelabels_2[-n:])
plt.show()

Changing the view to a logarithmic scale, we have the following:

In [None]:
fig_mostly_appends_log = plt.figure(figsize=(14,10))

if t_sizes_2:
    plt.loglog(num_transactions_2, t_sizes_2, 'o--', ms=5, color='magenta', label="Theoretical file size")

for i in range(ncomp):
    for j in range(nexp):
        plt.loglog(num_transactions_2, 
                   filesizes_2[j*n+i*nexp*n:(j+1)*n+i*nexp*n], 
                   '*--', ms=12, 
                   label=f"Chunk size 2**{exponents_2[j]}, {compression_2[i]}")

plt.xlabel("Transactions")
plt.title("test_mostly_appends_sparse")
plt.legend()
plt.yticks(filesizes_2[-n:], sizelabels_2[-n:])
plt.show()

If we look at the creation times for these files, we have something like this:

In [None]:
t_write_2 = np.array([test['t_write'] for test in testcase_2])

fig_mostly_appends_times = plt.figure()
for i in range(ncomp):
    for j in range(nexp):
        plt.plot(num_transactions_2, 
                 t_write_2[j*n+i*nexp*n:(j+1)*n+i*nexp*n], 
                 'o--', ms=8, 
                 label=f"Chunk size 2**{exponents_2[j]}, {compression_2[i]}")

plt.xlabel("Transactions")
plt.title("test_mostly_appends_sparse - creation times in seconds")
plt.legend()
plt.xticks(num_transactions_2)
plt.show()

The result is similar to the first test: smaller chunk sizes correspond to smaller file sizes, but larger creation times.

[Back to top](#home)

<a id='test3'></a>

# Test 3: Small fraction changes (sparse)

In [None]:
testname = "test_small_fraction_changes_sparse"

We have tested the following numbers of versions (or transactions):

```python
num_transactions_3 = [50, 100, 500, 1000, 5000]
```

Change `num_transactions_3` and `exponents` as desired:

In [None]:
num_transactions_3 = [25, 50]
exponents_3 = [12, 14]
compression_3 = [None, "gzip", "lzf"]

In [None]:
testcase = performance_tests.test_small_fraction_changes_sparse(path=path,
                                                                num_transactions=num_transactions_3, 
                                                                exponents=exponents_3, 
                                                                compression=compression_3)

testcase_3 = testcase.create_files()
testcase.save(testcase_3)

t_sizes_3 = [test['theoretical_sizes'] for test in testcase.tests[-len(num_transactions_3):]]

To open an existing `.json` file, use

In [None]:
with open(f"{testname}.json", "r") as json_in:
    testcase_3 = json.load(json_in)

num_transactions_3 = list(set([test['num_transactions'] for test in testcase_3]))
exponents_3 = list(set([test['chunk_size'] for test in testcase_3]))
compression_3 = list(set([test['compression'] for test in testcase_3]))
t_sizes_3 = None

Let's show the size information in a graph:

In [None]:
filesizes_3 = np.array([test['size'] for test in testcase_3])
sizelabels_3 = np.array([test['size_label'] for test in testcase_3])

fig_small_fraction_changes = plt.figure(figsize=(14,10))

if t_sizes_3:
    plt.plot(num_transactions_3, t_sizes_3, 'o--', ms=5, color='magenta', label="Theoretical file size")

n = len(num_transactions_3)
nexp = len(exponents_3)
ncomp = len(compression_3)
for i in range(ncomp):
    for j in range(nexp):
        plt.plot(num_transactions_3, 
                 filesizes_3[j*n+i*nexp*n:(j+1)*n+i*nexp*n], 
                 '*--', ms=12, 
                 label=f"Chunk size 2**{exponents_3[j]}, {compression_3[i]}")

plt.xlabel("Transactions")
plt.title("test_small_fraction_changes_sparse")
plt.legend()
plt.yticks(filesizes_3[-n:], sizelabels_3[-n:])
plt.show()

Changing the view to a logarithmic scale, we have the following:

In [None]:
fig_small_fraction_changes_log = plt.figure(figsize=(14,10))

if t_sizes_3:
    plt.loglog(num_transactions_3, t_sizes_3, 'o--', ms=5, color='magenta', label="Theoretical file size")

for i in range(ncomp):
    for j in range(nexp):
        plt.loglog(num_transactions_3, 
                   filesizes_3[j*n+i*nexp*n:(j+1)*n+i*nexp*n], 
                   '*--', ms=12, 
                   label=f"Chunk size 2**{exponents_3[j]}, {compression_3[i]}")

plt.xlabel("Transactions")
plt.title("test_small_fraction_changes_sparse")
plt.legend()
plt.yticks(filesizes_3[-n:], sizelabels_3[-n:])
plt.show()

If we look at the creation times for these files, we have something like this:

In [None]:
t_write_3 = np.array([test['t_write'] for test in testcase_3])

fig_small_fraction_changes_times = plt.figure()
for i in range(ncomp):
    for j in range(nexp):
        plt.plot(num_transactions_3, 
                 t_write_3[j*n+i*nexp*n:(j+1)*n+i*nexp*n], 
                 '*--', ms=12, 
                 label=f"Chunk size 2**{exponents_3[j]}, {compression_3[i]}")

plt.xlabel("Transactions")
plt.title("test_small_fraction_changes_sparse - creation times in seconds")
plt.legend()
plt.xticks(num_transactions_3)
plt.show()

So we can clearly see that the files with smallest file size, corresponding to chunk sizes of $2^8$ and $2^{10}$, are also the ones with largest creation times. This is consistent with the effects of using smaller chunk sizes in HDF5 files.

This behaviour is similar to what we got in the `test_large_fraction_changes_sparse` case: for `test_small_fraction_changes_sparse`, larger chunk sizes generate larger files, but the size of the files grows modestly as the number of transactions grow. So, **if we are dealing with a large number of transactions, larger chunk sizes generate files that are of reasonable size while having faster creation times** (and probably faster IO speeds as well).

[Back to top](#home)

<a id='test4'></a>

# Test 4: Mostly appends (dense)

In [None]:
testname = "test_mostly_appends_dense"

For this case, we have tested the following number of transactions:

```python
num_transactions_2 = [50, 100, 200]
```

Change `num_transactions_4` and `exponents` as desired:

In [None]:
num_transactions_4 = [25, 50]
exponents_4 = [12, 14]
compression_4 = [None, "gzip", "lzf"]

In [None]:
testcase = performance_tests.test_mostly_appends_dense(path=path,
                                                       num_transactions=num_transactions_4, 
                                                       exponents=exponents_4, 
                                                       compression=compression_4)

testcase_4 = testcase.create_files()
testcase.save(testcase_4)

t_sizes_4 = [test['theoretical_sizes'] for test in testcase.tests[-len(num_transactions_4):]]

To open an existing `.json` file, use

In [None]:
with open(f"{testname}.json", "r") as json_in:
    testcase_4 = json.load(json_in)

num_transactions_4 = list(set([test['num_transactions'] for test in testcase_4]))
exponents_4 = list(set([test['chunk_size'] for test in testcase_4]))
compression_4 = list(set([test['compression'] for test in testcase_4]))
t_sizes_4 = None

Let's show the size information in a graph:

In [None]:
filesizes_4 = np.array([test['size'] for test in testcase_4])
sizelabels_4 = np.array([test['size_label'] for test in testcase_4])

fig_mostly_appends_dense = plt.figure(figsize=(14,10))

if t_sizes_4:
    plt.plot(num_transactions_4, t_sizes_4, 'o--', ms=5, color='magenta', label="Theoretical file size")

n = len(num_transactions_4)
nexp = len(exponents_4)
ncomp = len(compression_4)
for i in range(ncomp):
    for j in range(nexp):
        plt.plot(num_transactions_4, 
                 filesizes_4[j*n+i*nexp*n:(j+1)*n+i*nexp*n], 
                 '*--', ms=12, 
                 label=f"Chunk size 2**{exponents_4[j]}, {compression_4[i]}")

plt.xlabel("Transactions")
plt.title("test_mostly_appends_dense")
plt.legend()
plt.yticks(filesizes_4[-n:], sizelabels_4[-n:])
plt.show()

Changing the view to a logarithmic scale, we have the following:

In [None]:
fig_mostly_appends_dense_log = plt.figure(figsize=(14,10))

if t_sizes_4:
    plt.loglog(num_transactions_4, t_sizes_4, 'o--', ms=5, color='magenta', label="Theoretical file size")

for i in range(ncomp):
    for j in range(nexp):
        plt.loglog(num_transactions_4, 
                   filesizes_4[j*n+i*nexp*n:(j+1)*n+i*nexp*n], 
                   '*--', ms=12, 
                   label=f"Chunk size 2**{exponents_4[j]}, {compression_4[i]}")


plt.xlabel("Transactions")
plt.title("test_mostly_appends_dense")
plt.legend()
plt.yticks(filesizes_4[-n:], sizelabels_4[-n:])
plt.show()

If we look at the creation times for these files, we have something like this:

In [None]:
t_write_4 = np.array([test['t_write'] for test in testcase_4])

fig_mostly_appends_dense_times = plt.figure()

for i in range(ncomp):
    for j in range(nexp):
        plt.plot(num_transactions_4, 
                 filesizes_4[j*n+i*nexp*n:(j+1)*n+i*nexp*n], 
                 '*--', ms=12, 
                 label=f"Chunk size 2**{exponents_4[j]}, {compression_4[i]}")

plt.xlabel("Transactions")
plt.title("test_mostly_appends_dense - creation times in seconds")
plt.legend()
plt.xticks(num_transactions_4)
plt.show()

The behaviour is similar to what we observed in other tests.

[Back to top](#home)

## Understanding each file

Each versioned HDF5 file contains 3 datasets per version:
- `key0`, an array of `int64`
- `key1`, an array of `int64`
- `val`, an array of `float64`
plus metadata about groups, datasets and versions.

This means that each file has  

```
nversions * 24 * arraysize + metadata
```
bytes of information.

<a id='standard'></a>
## Standard parameters

- `test_large_fraction_changes_sparse`: 
    - `num_rows_initial = 5000`
    - `num_rows_per_append = 10`
    - `num_inserts = 10`
    - `num_deletes = 10`
    - `num_changes = 1000`
- `test_small_fraction_changes_sparse`
    - `num_rows_initial = 5000`
    - `num_rows_per_append = 10`
    - `num_inserts = 10`
    - `num_deletes = 10`
    - `num_changes = 10`
- `test_mostly_appends_sparse`:
    - `num_rows_initial = 1000`
    - `num_rows_per_append = 1000`
    - `num_inserts = 10`
    - `num_deletes = 10`
    - `num_changes = 10`  
- `test_mostly_appends_dense`
    - `num_rows_initial_0 = 30`
    - `num_rows_initial_1 = 30`
    - `num_rows_per_append_0 = 1`
    - `num_inserts_0 = 1`
    - `num_inserts_1 = 10`
    - `num_deletes_0 = 1`
    - `num_deletes_1 = 1`
    - `num_changes = 10`