In [1]:
import numpy as np
import tables as tb
write_path = 'test_tables.h5'

# When to flush

Our data consist of many events. For each event, we have 5 different signals we want to append to 5 different pytables. The number of rows we append to each pytable varies from signal to signal and event to event.     


This is a simplified representation of how we have implemented our table writers and an experimental investigation into when to flush.

#### Helper functions

In [2]:
class Signal(tb.IsDescription):
    event    = tb.  Int32Col()
    time     = tb.Float32Col()
    energy   = tb.Float32Col()
    
filt = tb.Filters()

def create_n_pytables(num_tables, h5out, group):
    """
    create num_tables pytables in group. 
    tables are accessible via group.ti where i is in range num_tables
    """
    tables = []
    for i in range(num_tables):
        path = 't{}'     .format(i)
        name = 'Table  {}'.format(i)
        tables.append(h5out.create_table(group, path, Signal, name, filt))
        tables[-1].cols.event.create_index()
    return tables
    
def toy_signal():
    """
    makes a toy signal (time, energy),
    where time and energy are 1d np.ndarrays of equal but
    random length, between, minl and maxl
    """
    minl = 10; maxl = 100
    signal_length = np.random.randint(minl, high=maxl)
    t = np.arange(signal_length, dtype=np.float32)
    e = np.random.random(signal_length)
    return t, e

def write_signal_for_one_event(table, event, toy_signal, flush_0=False):
    for t, e in zip(*toy_signal):
        table.row["event"]  = event
        table.row["time"]   = t
        table.row["energy"] = e
        table.row.append()
        
    if flush_0: table.flush() # Should we flush here? 
                              # Sometimes? Always?
                              # Pytables documentation seems to  
                              # recommend flushing here.
                              # But we have never run into problems 
                              # without this flush, and flushing,
                              # at least with our implementation, 
                              # slows things down a lot.

#### Main

In [3]:
def write_some_pytables(write_path, 
                        num_tables =  5,   # Number of tables to write
                        num_events =100,   # Number of events
                        flush_0 = False,   # Flush each table for each event 
                        flush_1 = False):  # Flush file before closing the file
    
    with tb.open_file(write_path, 'w') as h5out:
        g1 = h5out.create_group(h5out.root, 'g1')         # Make group
        tables = create_n_pytables(num_tables, h5out, g1) # Make num_tables in group

        for event in range(num_events): # For each event,
            for table in tables:        # Write a toy signal to its table.
                write_signal_for_one_event(table, 
                                           event, 
                                           toy_signal(), 
                                           flush_0=flush_0)
                
        if flush_1: h5out.flush() 
        # Should we flush the entire file here? When we don't do this
        # we frequently end up with blank pytables.
        #
        # It's strange flushing here changes anything, since the file closes 
        # immediately after this line is executed, and pytables documentation
        # says a file is flushed automatically as it closes....     

## Experimental Flushing
For the rest of this notebook we create and write events to pytables, experimenting with when and where to flush.    

As a sort of benchmark, in most of these cells I set `num_events` to be 100k. This is excessively large. In NEXT the number of detector data events we write to one file is usually less than 200, and the number of monte carlo events we write to one file typically does not exceed 100k. By using such a large number of events in this mini-study, hopefully we can have a little confidence that if we don't see losses of data here, we will not lose any events in our actual data processing.

First, notice that for up to 4 signals / event (4 pytables) we do not lose any data, even if we do no flushing. (I've checked this with `num_tables` also set to 1, 2, and 3.

In [4]:
num_events = 100000; num_tables = 4
%time write_some_pytables(write_path, num_tables=num_tables, num_events=num_events, flush_0=False, flush_1=False)

with tb.open_file(write_path, 'r') as f: 
    for table in f.root.g1: # Ensure each pytable has num_events events
        print('fraction of events succesfully written:', 
              len(set(table[:]['event'])) / num_events, 'in', table.name) # python 3 division

CPU times: user 39.8 s, sys: 536 ms, total: 40.3 s
Wall time: 40.8 s
fraction of events succesfully written: 1.0 in t0
fraction of events succesfully written: 1.0 in t1
fraction of events succesfully written: 1.0 in t2
fraction of events succesfully written: 1.0 in t3


and we if flush after every event, it's about 35 times slower per event, so we don't want to do that. Notice, I've decreased the num_events to 1k so that it is not so time consuming.

In [5]:
num_events = 1000; num_tables = 4
%time write_some_pytables(write_path, num_tables=num_tables, num_events=num_events, flush_0=True, flush_1=False)

with tb.open_file(write_path, 'r') as f: 
    for table in f.root.g1:
        print('fraction of events succesfully written:', 
              len(set(table[:]['event'])) / num_events, 'in', table.name)

CPU times: user 14.9 s, sys: 663 ms, total: 15.6 s
Wall time: 15.7 s
fraction of events succesfully written: 1.0 in t0
fraction of events succesfully written: 1.0 in t1
fraction of events succesfully written: 1.0 in t2
fraction of events succesfully written: 1.0 in t3


But, if we increase the number of tables to 5, even with only 1 event, one of the pytables is lost if we do not flush.        
After some experimenting, its seems to consistently be whichever table had this line run first: `tables[-1].cols.event.create_index()`

In [8]:
num_events = 1; num_tables = 5
%time write_some_pytables(write_path, num_tables=num_tables, num_events=num_events, flush_0=False, flush_1=False)
with tb.open_file(write_path, 'r') as f: 
    for table in f.root.g1:
        print('fraction of events succesfully written:', 
              len(set(table[:]['event'])) / num_events)

CPU times: user 41.8 ms, sys: 24.7 ms, total: 66.5 ms
Wall time: 64.7 ms
fraction of events succesfully written: 0.0
fraction of events succesfully written: 1.0
fraction of events succesfully written: 1.0
fraction of events succesfully written: 1.0
fraction of events succesfully written: 1.0


Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/Users/alej/miniconda/envs/IC3.6/lib/python3.6/site-packages/tables/node.py", line 321, in __del__
    self._f_close()
  File "/Users/alej/miniconda/envs/IC3.6/lib/python3.6/site-packages/tables/table.py", line 2957, in _f_close
    self.flush()
  File "/Users/alej/miniconda/envs/IC3.6/lib/python3.6/site-packages/tables/table.py", line 2891, in flush
    self.row._flush_buffered_rows()
  File "tables/tableextension.pyx", line 1333, in tables.tableextension.Row._flush_buffered_rows (tables/tableextension.c:16357)
  File "tables/tableextension.pyx", line 749, in tables.tableextension.Row.table.__get__ (tables/tableextension.c:9587)
  File "/Users/alej/miniconda/envs/IC3.6/lib/python3.6/site-packages/tables/file.py", line 2101, in _check_open
    raise ClosedFileError("the file object is closed")
tables.exceptions.ClosedFileError: the file object is closed


We can solve the problem in the cell above if we **flush the entire file once just before closing it.**    
We don't lose any tables/events even when we increase `num_tables` to 10

In [6]:
num_events = 100000; num_tables = 10
%time write_some_pytables(write_path, num_tables=num_tables, num_events=num_events, flush_0=False, flush_1=True)
with tb.open_file(write_path, 'r') as f: 
    for table in f.root.g1:
        assert len(set(table[:]['event'])) ==  num_events # Just assert that all events were written

CPU times: user 1min 42s, sys: 1.33 s, total: 1min 44s
Wall time: 1min 45s


Also, flushing once just before closing the write file does not slow things down significantly 

In [7]:
num_events = 100000; num_tables = 4
print('File flushed just before closing it.')
%time write_some_pytables(write_path, num_tables=num_tables, num_events=num_events, flush_0=False, flush_1=True)
with tb.open_file(write_path, 'r') as f: 
    for table in f.root.g1: 
        assert len(set(table[:]['event'])) ==  num_events
print('--------')     
print('No flush')
%time write_some_pytables(write_path, num_tables=num_tables, num_events=num_events, flush_0=False, flush_1=False)
with tb.open_file(write_path, 'r') as f: 
    for table in f.root.g1:
        assert len(set(table[:]['event'])) ==  num_events

File flushed just before closing it.
CPU times: user 40.1 s, sys: 572 ms, total: 40.6 s
Wall time: 41.3 s
--------
No flush
CPU times: user 39.8 s, sys: 571 ms, total: 40.4 s
Wall time: 40.9 s


Primarily, this mini-study suggests that with our implementation, **flushing the entire file once just before closing it should be sufficient to prevent any loss in data, without suffering a meaningful loss in speed**.

Some additional notes are:    

- Danger of losing data seems to increase as we increase the number of tables we write simultaneously, and not as we increase the number of events we write in total. Even when I increased `num_events` to 1,000,000 (Not shown here) we did not lose any events with 1 pytable and no flushes. I did not try with more than 1 pytable because it is time consuming.    


- Pytables seems to be exhibiting some unexpected behavior when we write more than 4 tables: It raises a strange, ignored exception when we do not flush manually before closing the file. I say flush 'manually' because I think the file should flush automatically before closing. Pytables doc says for `h5out.close()`: "Flush all the alive leaves in object tree and close the file" and for `h5out.flush()`: "Flush all the alive leaves in the object tree." So, when we have 5+ pytables, and we don't flush manually, the strange exception is h5out trying to flush but failing as it closes, I think. A consequence of this behavior, is that afterwards the file seems not to have been closed properly. I cannot reopen the file in 'w' write mode until I clear the `Restart & Clear Output` the notebook. See below

Notice that **the last cell run up to this point -cell [8]- is the cell that lost a table**, attempting to write 5 tables with no flushes. Now, i cannot reopen `write_file` in 'w' mode without restarting the notebook.

In [9]:
with tb.open_file(write_path,  'r') as f: print(f)

test_tables.h5 (File) ''
Last modif.: 'Sat Sep  2 16:12:47 2017'
Object Tree: 
/ (RootGroup) ''
/g1 (Group) ''
/g1/t0 (Table(0,)) 'Table  0'
/g1/t1 (Table(32,)) 'Table  1'
/g1/t2 (Table(36,)) 'Table  2'
/g1/t3 (Table(34,)) 'Table  3'
/g1/t4 (Table(65,)) 'Table  4'



In [10]:
with tb.open_file(write_path, 'r+') as f: print(f)

test_tables.h5 (File) ''
Last modif.: 'Sat Sep  2 16:12:47 2017'
Object Tree: 
/ (RootGroup) ''
/g1 (Group) ''
/g1/t0 (Table(0,)) 'Table  0'
/g1/t1 (Table(32,)) 'Table  1'
/g1/t2 (Table(36,)) 'Table  2'
/g1/t3 (Table(34,)) 'Table  3'
/g1/t4 (Table(65,)) 'Table  4'



In [11]:
with tb.open_file(write_path,  'a') as f: print(f)

test_tables.h5 (File) ''
Last modif.: 'Sat Sep  2 16:12:47 2017'
Object Tree: 
/ (RootGroup) ''
/g1 (Group) ''
/g1/t0 (Table(0,)) 'Table  0'
/g1/t1 (Table(32,)) 'Table  1'
/g1/t2 (Table(36,)) 'Table  2'
/g1/t3 (Table(34,)) 'Table  3'
/g1/t4 (Table(65,)) 'Table  4'



In [12]:
with tb.open_file(write_path,  'w') as f: print(f)

HDF5ExtError: HDF5 error back trace

  File "H5F.c", line 522, in H5Fcreate
    unable to create file
  File "H5Fint.c", line 1024, in H5F_open
    unable to truncate a file which is already open

End of HDF5 error back trace

Unable to open/create file 'test_tables.h5'