# Usage of HybridDict

`HybridDict` is a dictionary-based class. It can store matrices or tensors in either memory (by numpy) or disk (by HDF5). Since it can provide a unified way to handle arrays in both memory and disk (thanks to the similar way of indexing of numpy and h5py), it is called a **dictionary that hybrids both memory and disk arrays**.

In [1]:
import numpy as np
from pyscf.dh.util import HybridDict

## Basic Usage

In the following example, we see how `HybridDict` stores a numpy array, as well as allocate zero-initialized (5, 5) double-type variable space in disk by h5py (HDF5):

In [2]:
tensors = HybridDict()
tensors["numpy_array"] = np.eye(5)
tensors.create("hdf5_array", shape=(5, 5), incore=False)
tensors

{'numpy_array': array([[1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1.]]),
 'hdf5_array': <HDF5 dataset "hdf5_array": shape (5, 5), type "<f8">}

Following code modifies the HDF5 data. To retrive the whole array in disk, use `[:]` or `[()]`.

In [3]:
tensors["hdf5_array"][1:3] = np.random.randn(2, 5)
tensors["hdf5_array"][:]

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [-0.11059277, -0.61265821,  1.02197176, -0.19033687,  0.28984891],
       [ 1.65725868, -0.05450106,  0.22574605, -0.13893546,  0.64403076],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])

## Detailed Usage

### Specify HDF5 data file

For HDF5 array in `HybridDict`, by default,

- file location is PySCF's temporary directory (defined by `pyscf.lib.param.TMPDIR`) if PySCF is detected, or `/tmp` is there's no PySCF installed.
- file name is given by python's tempfile utility; after instance of `HybridDict` destroyed, the temporary HDF5 file will also be deleted.

To overide default settings, when creating instance of `HybridDict`,

- `pathdir` changes file location (directory);
- `chkfile_name` changes file name.

It should be noted that the file must be exist before instantiation. `HybridDict` does not create a new file for HDF5 data.

In [4]:
# code begins with `!` is executed in bash, not python
! touch temporary_hdf5_data.h5

In [5]:
tensors = HybridDict(chkfile_name="temporary_hdf5_data.h5", pathdir=".")

In [6]:
! ls -al temporary_hdf5_data.h5

-rw-rw-r-- 1 a a 96 Sep  6 22:00 temporary_hdf5_data.h5


One may see that disk space is increased when some data writed into disk. 

In [7]:
tensors.create("hdf5_array", data=np.random.randn(20, 20), incore=False)

<HDF5 dataset "hdf5_array": shape (20, 20), type "<f8">

In [8]:
! ls -al temporary_hdf5_data.h5

-rw-rw-r-- 1 a a 5248 Sep  6 22:00 temporary_hdf5_data.h5


Since we create this `temporary_hdf5_data.h5` file in a conventional way, this file may not be destroyed when `tensors` (as instance of `HybridDict`) finishes its lifecycle.

### Creating an array

There are several options for creating an array.

To control whether array is created in memory or disk, use option `incore=` in member function `HybridDict.create`. True for memory (numpy), while False for disk (HDF5). Usage of this option is shown in [Basic Usage](#basic-usage).

To create a valid data entry, there can be three ways:

1. Use `HybridDict` as an ordinary dictionary:

    ```python
    tensors["array"] = some_array_in_memory
    ```
    
    ```{Note}
    Note that `HybridDict` class should be only used as array data manager.
    It is not recommended to save values other than numpy array or HDF5 data in this class.
    More specifically, types such as list, tuple, dictionary, simple types are not recommended
    to be stored in this class.
    
    It is strongly not recommended to save values that is not serializable.
    If do so, member function `HybridDict.dump` can be affected.
    ```

2. If array is in memory, using member function `HybridDict.create` with `data=` option; this is virtually the same to the first way, but could also store some data in disk:

    ```python
    # the same to first way
    tensors.create("array", data=some_array_in_memory)
    # store array to disk
    tensors.create("array", data=some_array_in_memory, incore=False)
    # additionally cast type
    tensors.create("array", data=some_array_in_memory, dtype=complex)
    ```

3. If some array is not available in memory, then create an zero-initialized array on disk by calling member function `HybridDict.create` with `shape=` and `incore=False` option:

    ```python
    # create (1000, 1000) array on disk
    tensors.create("array", shape=(1000, 1000), incore=False)
    ```
    
    Then write data by batch that is available in memory:
    
    ```python
    # suppose that array size (10, 1000) is acceptable in memory
    for i in range(100):
        tensors["array"][10*i:10*(i+1)] = some_batched_array_in_memory
    ```
    
It should be noted that options `data=` and `shape=` are incompatible (one should not declare data shape for new data, while data itself is actually available). So programmer should not use these options simultanously.

### Dump and load

`HybridDict` class provides dump and load utility. Since this class handles disk-based and memory-based data differently, those data are dumped to two different files. HDF5 file is copied as is, while numpy arrays and other entries are stored by pickle package. For example,

In [9]:
# create a dictionary that have both 
tensors = HybridDict()
tensors.create("numpy_array_dumped", data=np.random.randn(5, 5))
tensors.create("hdf5_array_dumped", data=np.random.randn(5, 5), incore=False)
# dump to `dumped.h5` and `dumped.dat`
tensors.dump(h5_path="dumped.h5", dat_path="dumped.dat")

Then we can load these files to construct a new `HybridDict` instance. Additional parameter options are passed into consturctor of `HybridDict`. For example,

In [10]:
! touch disk_of_tensors_loaded.h5

In [11]:
tensors_loaded = HybridDict.pick(
    # load data from previous instance
    h5_path="dumped.h5", dat_path="dumped.dat",
    # specify HDF5 file location for current instance
    pathdir=".", chkfile_name="disk_of_tensors_loaded.h5")

tensors_loaded.create("hdf5_array_appended", data=np.random.randn(10, 10), incore=False)
tensors_loaded

{'hdf5_array_dumped': <HDF5 dataset "hdf5_array_dumped": shape (5, 5), type "<f8">,
 'numpy_array_dumped': array([[ 0.60168714,  1.38361886,  0.8212771 , -1.20283382, -0.06599009],
        [-0.34737881, -0.36756216,  1.90771333,  1.48354733,  0.10901779],
        [ 0.87932125,  1.33943419, -0.51066958,  0.20387545,  0.61703993],
        [ 0.55893424,  0.84483485,  0.3450849 ,  0.33837149,  1.23697113],
        [ 0.14952038, -0.43165432, -0.59371613,  0.76986518, -0.7204273 ]]),
 'hdf5_array_appended': <HDF5 dataset "hdf5_array_appended": shape (10, 10), type "<f8">}

```{Note}

Note that loaded HDF5 file will not be modified.

```

In [12]:
! ls -al dumped.h5 disk_of_tensors_loaded.h5

-rw-rw-r-- 1 a a 5096 Sep  6 22:00 disk_of_tensors_loaded.h5
-rw------- 1 a a 2248 Sep  6 22:00 dumped.h5


In [13]:
# clean scratch files in this document
! rm *.h5
! rm *.dat