# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Archives-and-DataSets" data-toc-modified-id="Archives-and-DataSets-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Archives and DataSets</a></div><div class="lev1 toc-item"><a href="#Archive-Format" data-toc-modified-id="Archive-Format-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Archive Format</a></div><div class="lev2 toc-item"><a href="#Containers-and-Duplicates" data-toc-modified-id="Containers-and-Duplicates-21"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Containers and Duplicates</a></div><div class="lev2 toc-item"><a href="#External-Data" data-toc-modified-id="External-Data-22"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>External Data</a></div><div class="lev2 toc-item"><a href="#Examples" data-toc-modified-id="Examples-23"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Examples</a></div><div class="lev3 toc-item"><a href="#Non-scoped-(flat)-Format" data-toc-modified-id="Non-scoped-(flat)-Format-231"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Non-scoped (flat) Format</a></div><div class="lev3 toc-item"><a href="#Scoped-Format" data-toc-modified-id="Scoped-Format-232"><span class="toc-item-num">2.3.2&nbsp;&nbsp;</span>Scoped Format</a></div><div class="lev3 toc-item"><a href="#External-Data" data-toc-modified-id="External-Data-233"><span class="toc-item-num">2.3.3&nbsp;&nbsp;</span>External Data</a></div><div class="lev1 toc-item"><a href="#DataSet-Format" data-toc-modified-id="DataSet-Format-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>DataSet Format</a></div><div class="lev2 toc-item"><a href="#Caveats" data-toc-modified-id="Caveats-31"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Caveats</a></div><div class="lev1 toc-item"><a href="#Examples" data-toc-modified-id="Examples-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Examples</a></div><div class="lev2 toc-item"><a href="#External-Data" data-toc-modified-id="External-Data-41"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>External Data</a></div><div class="lev1 toc-item"><a href="#The-DS-Format" data-toc-modified-id="The-DS-Format-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>The DS Format</a></div>

# Archives and DataSets

There are two main classes provided by the `persist` module: `persist.Archive` and `persist.DataSet`.  Archives deal with the linkage between objects so that if multiple objects are referred to, they are only stored once in the archive.  Archives provide a way to serialize the data with the `str()` operator, but are not well suited for saving data to disk.

DataSets use archives, but provide additional functionality for saving the data to disk for true persistence.  This document describes the format of this data.

# Archive Format

The `persist.Archive` object maintains a collection of python objects that are inserted with `persist.Archive.insert()`.  The main purpose of an archive is to provide an executable string which can be evaluated to regenerate the objects that have been inserted.

## Containers and Duplicates

The main complexity with archives is with containers – all object referenced in a contain (like a  list or dictionary) need to be inserted into the archive, and inserted only once.

## External Data

A second complexity is that we allow large arrays (whose size exceeds `array_threshold`) to be stored externally.  To deal with these, the archive expects the string representation to be evaluated in an environment that has these arrays defined so that they can be inserted appropriately as needed into containers.  The behavior is controlled by the following attributes of `persist.Archive`:

* `array_threshold` : Arrays longer than this will be stored as external data.  Shorter arrays will be stored in the archive.
* `data_name` : When evaluated, the environment should contain a dictionary with this name (default `'_arrays'`) which contains all of the externally stored arrays (or placeholders).

* `datafile` : The name of the file or directory in which external arrays will be stored.
* `data_format` : The format in which external arrays will be stored.

A downside of the current implementation is that all of the arrays need to be loaded in order for the archive to be evaluated.  This complicates delayed loading of large arrays unless a placeholder can be used that only loads the data when used.  Both HDF5 and NumPy support this (via [memmap_mode](https://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html#numpy.load)) but it is not implemented yet.  As a workaround, separate archives should be used for external arrays.  This is the approach that `DataSet` takes.

## Examples

Here we demonstrate a simple archive containing all of the data.  We start with the simplest format which is obtained with `scoped=False`:

### Non-scoped (flat) Format

We start with the `scoped=False` format.  This produces a flat archive that is easier to read:

In [1]:
import os.path
import tempfile
import shutil
import numpy as np
import persist.archive;reload(persist.archive)
from persist.archive import Archive
tmpdir = tempfile.mkdtemp()  # Make temporary directory for dataset

a = 1
x = np.arange(2)
y = np.arange(3)  # Implicitly reference in archive
y = [1,2,3]
b = [x, y, y]     # Nested references to x and y

archive = Archive(scoped=False)
archive.insert(a=a, x=x, b=b)

# Get the string representation
%time s = str(archive)
print(s)

CPU times: user 1.57 ms, sys: 68 µs, total: 1.64 ms
Wall time: 1.65 ms
import numpy as _numpy
a = 1
_g3 = [a, 2, 3]
x = _numpy.fromstring('\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00', dtype='<i8').reshape((2,))
b = [x, _g3, _g3]
del _numpy
del _g3
try: del __builtins__
except NameError: pass


Note that intermediate objects not explicitly inserted are stored with variables like `_g#` and that these are deleted, so that evaluating the string in a dictionary gives a clean result:

In [2]:
# Now execute the representation to get the data
d = {}
exec(s, d)
print(d)
d['b'][1] is d['b'][2]

{'a': 1, 'b': [array([0, 1]), [1, 2, 3], [1, 2, 3]], 'x': array([0, 1])}


True

The potential problem with the flat format is that to obtain this simple representation, a graph reduction is performed that replaces intermediate nodes, ensuring that local variables do not have name clashes as well as simplifing the representation.  Replacing variables in representations can have performance implications if the objects are large.  The fastest approach is a string replacement, but this can make mistakes if the substring appears in data.  The option `robust_replace` invokes the python AST parser, but this is slower.

### Scoped Format

To alleviate these issues, the `scoped=True` format is provided.  This is visually much more complicated as each object is constructed in a function.  The advantage is that this provides a local scope in which objects are defined.  As a result, any local variables defined in the representation of the object can be used as they are without worrying that they will conflict with other names in the file.  No reduction is performed and no replacements are made, makeing the method faster and more robust, but less attractive if the files need to be inspected by humans:

In [3]:
archive = Archive(scoped=True)
archive.insert(a=a, x=x, b=b)

# Get the string representation
%time s = str(archive)
print(s)

CPU times: user 713 µs, sys: 230 µs, total: 943 µs
Wall time: 778 µs
_g5 = 3
_g4 = 2
a = 1

def _g3(_l_0=a,_l_1=_g4,_l_2=_g5):
    return [_l_0, _l_1, _l_2]
_g3 = _g3()

def x():
    import numpy as numpy
    return numpy.fromstring('\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00', dtype='<i8').reshape((2,))
x = x()

def b(_l_0=x,_l_1=_g3,_l_2=_g3):
    return [_l_0, _l_1, _l_2]
b = b()
del _g5, _g4, _g3
try: del __builtins__
except NameError: pass


### External Data

Here we store an array that exceeds the array threshold:

In [4]:
archive = Archive(scoped=False, datafile='tmpdata', array_threshold=3)
x = np.arange(10)
archive.insert(a=a, x=x, b=b)
s = str(archive)
print s

import numpy as _numpy
a = 1
_g4 = [a, 2, 3]
b = [_numpy.fromstring('\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00', dtype='<i8').reshape((2,)), _g4, _g4]
x = _arrays['array_0']
del _numpy
del _g4
try: del __builtins__
except NameError: pass


The data for `x` is stored in a separate file:

In [5]:
!ls $archive.datafile

array_0.npy


To evaluate this archive, we must first load the arrays and provide these in the execution environment:

In [6]:
from persist.archive import load_arrays
d = {archive.data_name: load_arrays(archive.datafile)}
exec(s, d)
d

{'_arrays': {'array_0': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])},
 'a': 1,
 'b': [array([0, 1]), [1, 2, 3], [1, 2, 3]],
 'x': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])}

The convenience method `Archive.eval` does this for us:

In [7]:
a = Archive(datafile='tmpdata')
a.eval(s)

{'a': 1,
 'b': [array([0, 1]), [1, 2, 3], [1, 2, 3]],
 'x': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])}

In [8]:
!rm -rf tmpdata

# DataSet Format

This notebook outlines the format of a DataSet on disk.

For this discussion, we shall assume that the dataset is stored in a directory called `dataset`.  This DataSet directory contains the following files

* `_this_dir_is_a_DataSet`: This is an empty file signifying that the directory is a DataSet.

* `__init__.py`: Each DataSet is an importable python module so that the data can be used on a machine without the `persist` package.  This file contains all the data for and defines the following variable:
  * `_info_dict`: This is a dictionary/namespace with string keys (which must be valid python identifiers) and associated data (which should in general be small).  These are intended to be interpreted as meta-data.
  
  In addition to `_info_dict`, the module may contain variables corresponding to the keys in `_info_dict` which are generally larger chunks of data and are not stored here (see below).  If the DataSet is directly imported, then `__init__.py` will attempt to load all of these arrays into memory.  (This behaviour may change in the future.)
  
  For the remainder of this discussion, we shall assume that `_info_dict` contains the key `'x'`.
  
* `x.py`: This is the python file responsible for loading the data associated with the key `'x'` in `_info_dict`. If the size of the array is less than the `array_threshold` specified in the `DataSet` object, then the data for the arrays are stored in this file, otherwise this file is responsible for loading the data from an associated file.

* `data_x.*`: If the size of the array stored in `x` is larger than the `array_threshold`, then the data associated with `x` is stored in this file/directory which is either an HDF5 file or a numpy file.  This will be loaded by the file `x.py`.

## Caveats

Each attribute in a DataSet, such as `dataset.x` above, is an independent archive.  Thus, if the same data is stored under different names in a DataSet, then this data will be duplicated.  Each attribute, however, can be a full archive, so multiple copies of the same data within an object of the attribute will not be duplicated.

# Examples

In [9]:
import os.path
import tempfile
import shutil
import numpy as np
import persist.archive;reload(persist.archive)
from persist.archive import DataSet
tmpdir = tempfile.mkdtemp()  # Make temporary directory for dataset
print("Storing dataset in {}".format(tmpdir))

a = np.arange(10)
x = np.arange(100)

ds = DataSet(os.path.join(tmpdir, 'dataset'), 'w', array_threshold=20, data_format='npy')

ds.a = a
ds.x = x
ds['a'] = "A small array"
ds['x'] = "A large array"

!ls -a1 $tmpdir/dataset
print(os.listdir(os.path.join(tmpdir, 'dataset')))
shutil.rmtree(tmpdir)        # Remove files

Storing dataset in /var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmpdAebTQ
[34m.[m[m
[34m..[m[m
__init__.py
_this_dir_is_a_DataSet
a.py
[34mdata_a[m[m
[34mdata_x[m[m
x.py
['__init__.py', '_this_dir_is_a_DataSet', 'a.py', 'data_a', 'data_x', 'x.py']


  "Save data manually and populate in _arrays dict.")


In [10]:
tmpdir = tempfile.mkdtemp()  # Make temporary directory for dataset
print("Storing dataset in {}".format(tmpdir))

a = np.arange(10)
x = [np.arange(100)]
y = x

ds = DataSet(os.path.join(tmpdir, 'dataset'), 'w', array_threshold=20, data_format='npy')

ds.a = a
ds.x = x
ds.y = y
ds['a'] = "A small array"
ds['x'] = "A large array"

!ls -a1 $tmpdir/dataset
print(os.listdir(os.path.join(tmpdir, 'dataset')))

ds = DataSet(os.path.join(tmpdir, 'dataset'))
shutil.rmtree(tmpdir)        # Remove files

Storing dataset in /var/folders/m7/dnr91tjs4gn58_t3k8zp_g000000gp/T/tmpcimdvb
[34m.[m[m
[34m..[m[m
__init__.py
_this_dir_is_a_DataSet
a.py
[34mdata_a[m[m
[34mdata_x[m[m
[34mdata_y[m[m
x.py
y.py
['__init__.py', '_this_dir_is_a_DataSet', 'a.py', 'data_a', 'data_x', 'data_y', 'x.py', 'y.py']
