<br />    
<br />
<br />
# Results of chunking and compression mechanisms
<br />    
            
            
Using compression in HDF5 requires chunking. Chunking is the process of storing different subsets of the dataset contiguously on disk. For example for an array of dimensions (36, 257, 167, 84), chunking with (18, 257, 167, 84) gives four chunks. Chunking can be utilized when anticipating reading subsets of data. Compared to storing each row of an array conitguously on disk, reading subset of the array that match the chunking mechanism at the time of storage increases efficiency.

There are multiple variable that can control how the compression is done:

* chunks: The shape of the chunk
* compression: The compression algorithm, it can be either of gzip, szip, lzf, or blosc compressors such as zstd, blosclz, lz4, lz4hc, zlib
* compression_opts: The options for each compression algorithm
* shuffle: rearranging bytes in the data for possibility of improved compression

Compression and write speed:

The results in this section are dependent on the hardware. We start with considering only the Y channel. The analysis is similar for both channels.

Let's consider the following chunk shape, with 'gzip', at compresion option level 3, and with shuffle turned off:

|Chunk Size | Compression | Compression Options | Shuffle | Channel | File Size (MB) | Write Time (sec) | Read Time (sec) |
| -- | -- | -- | -- | -- | -- | -- | -- |
| (18, 36, 10, 21) |	gzip |	3|	0|	Y|	419.16|	45.48|	6.72|


After compressing the Y channel only, the file size is 419.16 MB Bytes and it takes 45.48 seconds for the data to be written, and 6.72 seconds to read the data back. We try turning the shuffling on:

|Chunk Size | Compression | Compression Options | Shuffle | Channel | File Size (MB) | Write Time (sec) | Read Time (sec) |
| -- | -- | -- | -- | -- | -- | -- | -- |
|(18, 36, 10, 21)|	gzip|	3|	1|	Y|	380.97 | 34.66 |	4.93 |

The shuffling algorithms takes advantage of the fact that numerical data in nearby voxels have reasonably close values. It reorders the bytes representing the values of nearby voxels, and places the zeros together. This allows for better compression.

Further testing shows that shuffling consistently gives significantly better results for other chunking sizes and algorithms. We will keep the shuffling on.

Testing with different settings with szip shows consistent lower compression ratio and longer write time. Among blosc algorithms, zstd and lz4 performed best on the available data. gzip and zstd give the best compression size. However, gzip has significantly slower read and write times compared to blosc algorithms and lzf. 


The following shows the results for different chunking sizes.

<img src="../media/Graph.png" width="90%"/>


In [None]:
# Setting up data for Benchmarking:

import os
from IPython.display import display
import scipy.io
import numpy as np
import hdf5plugin
import h5py


path_to_files = 'Folder/AxelLab/data'
# info file name in path_to_files folder
fname0 = 'fly2_run1_info.mat'
# .mat file containing Calcium Imaging data in path_to_files folder
fname1 = '2019_04_18_Nsyb_NLS6s_Su_walk_G_fly2_run1_8401reg.mat'


# Open info file
fpath0 = os.path.join(path_to_files, fname0)
f_info = scipy.io.loadmat(fpath0, struct_as_record=False, squeeze_me=True)
info = f_info['info']
# Open .mat file containing Calcium Imaging data
fpath1 = os.path.join(path_to_files, fname1)
file = h5py.File(fpath1, 'r')
options = file['options']
landmarkThreshold = file['landmarkThreshold']
templates = file['templates']

Y = file['Y'] 
R = file['R']
# Note: Changing axis order copies Y and R into memory
Y = np.moveaxis(Y, 1, 2) 
R = np.moveaxis(R, 1, 2)
data_shape = Y.shape

# Convert back to float32
Y = np.array(Y, dtype=np.float32)
R = np.array(R, dtype=np.float32)

In [None]:
# Setting up compression scenarios:

# The Cartesian product of the following parameters
# will be passed for compression

# blosc compressors: 'zstd', 'blosclz','lz4','lz4hc','zlib'
# blosc compression option range: integers 1-9 
compression_opts_list = { 'lzf' : [ None],            # lzf does not take compression options
                          'gzip': [ 1, 3, 6 ] }       # gzip compression option range: integers 1-9
shuffle_list = [1] # subset of [ hdf5plugin.Blosc.NOSHUFFLE, hdf5plugin.Blosc.SHUFFLE, hdf5plugin.Blosc.BITSHUFFLE ]
                   # or [ 0, 1, 2]

chunks_list =  [(4, 20, 10, 10),
                (4, 29, 43, 7),
                (36, 36, 10, 6),
                (4, 36, 11, 21)]

channels = { 'Channel_Y':  Y } # subset of ( Y , R ), format { 'Channel Name': Channel_Data }

In [None]:
# Define the function to run benchmark tests

from datetime import datetime
from dateutil.tz import tzlocal
from pynwb import NWBFile, NWBHDF5IO, ProcessingModule
from pynwb.ophys import TwoPhotonSeries, OpticalChannel, ImageSegmentation, Fluorescence, DfOverF, MotionCorrection
from pynwb.device import Device
from pynwb.base import TimeSeries
from hdmf.backends.hdf5 import H5DataIO
from IPython.display import clear_output, Markdown, update_display
import pandas as pd
import itertools
import time
from tqdm import tqdm
import functools
import warnings
warnings.simplefilter('ignore')
import ipyvolume.pylab as p3
from ipyvolume import ipv
from nwbwidgets.utils.cmaps import linear_transfer_function
from axel_lab_to_nwb import SparseIterator
from axel_lab_to_nwb import plot_grid_seq


def run_benchmark_tests( compression_param_product, channels, info, uncompressed_results = None, show_results = True ):

    header_str = ("Chunk Size","Compression","Compression Options",
                  "Shuffle","Write Time (sec)","Read Time (sec)",
                  "File Size (MB)")
    if uncompressed_results is not None:
        header_str +=  ("Compressed / Uncompressed Write Time Ratio",
                        "Compressed / Uncompressed Read Time Ratio",
                        "Compressed / Uncompressed Size Ratio")
    header_str += ("Written / Total Chunks Ratio", )
    display_style = {h: "{:.2f}" for h in header_str[4:]}

    data_frame = pd.DataFrame(columns = header_str)
    results_file = 'results.tsv'

    initial_display = True; chunks_quantized = 0
    compression_param_product_withpbar = tqdm(list(compression_param_product))
    for compression_params in compression_param_product_withpbar:

        # progressbar update
        compression_param_product_withpbar.set_description("Processing %s" % str(compression_params) )
        compression_param_product_withpbar.refresh()

        # unpacking compression parameters
        compression_opts, compression, chunk_shape, shuffle = *compression_params,
        
        run_nwb_result = run_nwb(compression_opts, compression, chunk_shape, shuffle, channels, info)
        time_write, time_read, file_size, chunk_ratio, chunks_written, chunk_maxvalues = run_nwb_result

        # prepare benchmark results
        output_list =  [str(chunk_shape),
                        compression,
                        str(compression_opts),
                        str(shuffle),
                        time_write,
                        time_read,
                        file_size]

        # add comp/uncompressed ratios
        if uncompressed_results is not None:
            output_ratio = np.array([time_write, time_read, file_size])/uncompressed_results
            output_list += list(output_ratio)

        # add ratio of written chunks
        output_list += [chunk_ratio]

        # add results to data frame
        data_frame.loc[len(data_frame)] = output_list

        if show_results == True:
            
            if initial_display == True:
                clear_output()
                p3.clear()
                p3.figure(width = 800, controls = False)
                ipv.style.use('seaborn-whitegrid')
                ipv.style.box_off()
                p3.display(p3.gcc(),display_id="data selection")
                
                chunks_index_shape = np.ceil( np.divide( data_shape, chunk_shape ) ).astype(int)
                chunks_index_boolean = np.zeros( chunks_index_shape, dtype = bool )
                chunks_index_boolean[tuple(chunks_written.T)] = True
                chunks_boolean_max = np.max( chunks_index_boolean, axis = 0)
                chunks_quantized_bool = np.kron( chunks_boolean_max, np.ones( chunk_shape[1:], dtype = bool ) )

                checkered_grid = functools.reduce( lambda x,y : np.logical_xor(x%2,y%2),
                                                  np.ogrid[0:chunks_index_shape[1],
                                                  0:chunks_index_shape[2],
                                                  0:chunks_index_shape[3] ] )
                checkered_grid_quantized = np.kron( checkered_grid, np.ones( chunk_shape[1:] ) )
                chunks_quantized_bool_a = checkered_grid_quantized*chunks_quantized_bool
                chunks_quantized_bool_b = np.logical_not(checkered_grid_quantized)*chunks_quantized_bool

                if 'Channel_Y' in channels:
                    p3.volshow(np.max(channels['Channel_Y'], axis = 0), controls = False,
                               tf=linear_transfer_function([0.6, 0.6, 0.6], max_opacity=0.1))
                if 'Channel_R' in channels:
                    p3.volshow(np.max(channels['Channel_R'], axis = 0), controls = False,
                               tf=linear_transfer_function([0.6, 0.6, 0.6], max_opacity=0.1))
                p3.volshow( chunks_quantized_bool_a, controls = False,
                            tf=linear_transfer_function([0.45,0.45, 1], max_opacity=0.75),
                            specular_exponent=5, lighting = True) 
                p3.volshow( chunks_quantized_bool_b, controls = False,
                            tf=linear_transfer_function([0.85,0.75, 0.6], max_opacity=0.75),
                            specular_exponent=5, lighting = True) 
                update_display(p3.current,display_id="data selection")
                    
                display(Markdown(''), display_id="figure")
                display(Markdown(''), display_id="data frame")
                initial_display = False
                
            update_display(data_frame.style.format(display_style).set_properties(width='110px'),
                           display_id="data frame")            

            # store results
            with open(results_file, 'a') as output_file:
                output_line = '\t'.join( str(el) for el in output_list)+'\n' 
                output_file.write( output_line )
            # plot results 
            if uncompressed_results is None:
                columns = header_str[4:7]
            else:
                columns = header_str[7:10]
            marker_list = ["v","d","o","X"]
            output_data_frame = data_frame.copy(deep=True) # for plotting
            sns_plot = plot_grid_seq(output_data_frame,columns = columns,
                        legend_columns=["Compression","Compression Options"],
                        markers = marker_list, add_gridline = [1,1] )
            
            if sns_plot != None:
                update_display(sns_plot.fig, display_id="figure")
            
    return chunks_quantized#data_frame

# Collect benchmark data for a given compression scenario
# time of writing and reading data, and file size
def run_nwb(compression_opts, compression, chunks, shuffle, channels, info):

    # unpack data
    if 'Channel_Y' in channels:
        Y = channels['Channel_Y']
    if 'Channel_R' in channels:
        R = channels['Channel_R']
    
    #Create new NWB file
    nwb = NWBFile(session_description='my CaIm recording', 
                  identifier='EXAMPLE_ID',
                  session_start_time=datetime.now(tzlocal()),
                  experimenter='Evan Schaffer',
                  lab='Axel lab',
                  institution='Columbia University',
                  experiment_description='EXPERIMENT_DESCRIPTION',
                  session_id='IDX')

    #Create and add device
    device = Device('Device')
    nwb.add_device(device)

    # Create an Imaging Plane for Yellow
    optical_channel_Y = OpticalChannel(name='OpticalChannel_Y',
                                       description='2P Optical Channel',
                                       emission_lambda=510.)
    imaging_plane_Y = nwb.create_imaging_plane(name='ImagingPlane_Y',
                                               optical_channel=optical_channel_Y,
                                               description='Imaging plane',
                                               device=device,
                                               excitation_lambda=488., 
                                               imaging_rate=info.daq.scanRate,
                                               indicator='NLS-GCaMP6s',
                                               location='whole central brain')
    # Create an Imaging Plane for Red
    optical_channel_R = OpticalChannel(name='OpticalChannel_R',
                                       description='2P Optical Channel',
                                       emission_lambda=633.)
    imaging_plane_R = nwb.create_imaging_plane(name='ImagingPlane_R',
                                               optical_channel=optical_channel_R,
                                               description='Imaging plane',
                                               device=device,
                                               excitation_lambda=488., 
                                               imaging_rate=info.daq.scanRate,
                                               indicator='redStinger',
                                               location='whole central brain')

    # output file name
    fname_nwb = 'file_compressed.nwb'
    output_path_to_files = 'Folder/AxelLab/data'
    fpath_nwb = os.path.join(output_path_to_files, fname_nwb)
    if os.path.isfile(fpath_nwb):
        os.remove(fpath_nwb)

    # compression keywords to pass to h5py
    shuffle = bool(shuffle)
    if compression in ['zstd','blosclz','lz4','lz4hc','zlib']:
        compression_kw = hdf5plugin.Blosc(cname=compression, clevel=compression_opts, shuffle=shuffle)
    else:
        compression_kw = { 'compression' : compression, 'compression_opts' : compression_opts,
                           'shuffle' : shuffle }
       
    chunk_ratio = {}
    chunk_index_array = {}
    chunk_maxvalues = {}    
    if 'Channel_Y' in channels:
        if chunks != None:
            Y_chunk_iterator = SparseIterator(data=Y,
                                chunk_shape=chunks)
            chunk_ratio['Channel_Y'] = Y_chunk_iterator.chunk_ratio
            chunk_index_array['Channel_Y'] = Y_chunk_iterator.chunk_index_array
            chunk_maxvalues['Channel_Y'] = Y_chunk_iterator.chunk_maxvalues            
            Y_dataio = H5DataIO(Y_chunk_iterator, chunks=chunks, fillvalue=np.nan,
                                maxshape = (None,*Y.shape[1:]), **compression_kw)
        else:
            Y_dataio = H5DataIO(Y, **compression_kw)
            chunk_ratio['Channel_Y'] = 1
            chunk_index_array['Channel_Y'] = "all"
            chunk_maxvalues['Channel_Y'] = "all"
        raw_image_series_Y = TwoPhotonSeries(name='TwoPhotonSeries_Y',
                     imaging_plane=imaging_plane_Y,
                     rate=info.daq.scanRate,
                     dimension=Y_dataio.shape,
                     unit="unit",
                     data=Y_dataio)
        nwb.add_acquisition(raw_image_series_Y)

    if 'Channel_R' in channels:
        if chunks != None:
            R_chunk_iterator = SparseIterator(data=R,
                                chunk_shape=chunks)
            chunk_ratio['Channel_R'] = R_chunk_iterator.chunk_ratio
            chunk_index_array['Channel_R'] = R_chunk_iterator.chunk_index_array
            chunk_maxvalues['Channel_R'] = R_chunk_iterator.chunk_maxvalues                        
            R_dataio = H5DataIO(R_chunk_iterator, chunks=chunks, fillvalue=np.nan,
                                maxshape = (None,*R.shape[1:]), **compression_kw)
        else:
            R_dataio = H5DataIO(R, **compression_kw)
            chunk_ratio['Channel_R'] = 1
            chunk_index_array['Channel_R'] = "all"
            chunk_maxvalues['Channel_R'] = "all"            
        raw_image_series_R = TwoPhotonSeries(name='TwoPhotonSeries_R',
                     imaging_plane=imaging_plane_R,
                     rate=info.daq.scanRate,
                     dimension=R_dataio.shape,
                     unit="unit",
                     data=R_dataio)
        nwb.add_acquisition(raw_image_series_R)

    # start compression write clock
    time_write_start = time.clock()        
        
    #Saves to NWB file
    with NWBHDF5IO(fpath_nwb, mode='w') as io:
        io.write(nwb)

    time_write = time.clock() - time_write_start

    # clear file buffer
    if os.name == 'posix':
        try:
            with open(fpath_nwb) as fdforfadvise:
                os.posix_fadvise(fdforfadvise.fileno(), 0, 0, os.POSIX_FADV_DONTNEED)
                # normal unix file buffer 
                os.posix_fadvise(fdforfadvise.fileno(), 0, 0, os.POSIX_FADV_NORMAL)
        except:
            pass
    
    #Loads NWB file
    time_read_start = time.clock()

    with NWBHDF5IO(fpath_nwb, mode='r') as io:
        nwb = io.read()
        if 'Y' in channels:
            Y_series = nwb.acquisition['TwoPhotonSeries_Y']
            # read data into memory
            Y_series_data = Y_series.data[()]
            del Y_series_data
        if 'R' in channels:
            R_series = nwb.acquisition['TwoPhotonSeries_R']
            # read data into memory
            R_series_data = R_series.data[()]
            del R_series_data

    time_read = time.clock() - time_read_start

    nwbfile_size = os.stat(fpath_nwb).st_size/1024/1024# in MB        

    # find ratio of wrritten chunks
    chunk_ratio_total = sum(chunk_ratio.values())/len(chunk_ratio)
    if "all" in chunk_index_array.values():
        chunks_written = "all"
        chunk_maxvalues = "all"
    else:
        chunks_written = np.vstack(tuple(chunk_index_array.values()))
        # remove duplicates
        chunks_written = np.unique( chunks_written, axis=0 )
        chunk_maxvalues = functools.reduce( np.maximum , tuple(chunk_maxvalues.values()) )

    return [time_write, time_read, nwbfile_size, chunk_ratio_total, chunks_written, chunk_maxvalues]


In [None]:
# Iterator for nested loop of compression variables
compression_param_product = itertools.chain.from_iterable(
    itertools.product(compression_opts_list[compression_alg],
                      [compression_alg],
                      chunks_list,
                      shuffle_list)
                      for compression_alg in compression_opts_list)

# adding uncompressed runs
num_uncompr_runs = 3 # 1st run is warm up, 2 for finding the average time without compression
no_compression_param_product = itertools.repeat( (None, None, None, None), num_uncompr_runs)
data_frame = run_benchmark_tests( no_compression_param_product, channels, info, show_results = False )
uncompressed_results = data_frame[["Write Time (sec)","Read Time (sec)", "File Size (MB)"]].loc[1:].mean()

df = run_benchmark_tests( compression_param_product, channels, info, uncompressed_results = uncompressed_results )
