# Example Usage of Minitree Containing Array Branches

This is an example of how to include arrays (e.g. peak- or event-level data) in your minitree. It was ran on midway with the goal of looking at some properties of single electrons in Xenon1T.

In [1]:
import numpy as np
import pandas as pd
import sys, os
import hax

dataset = '160706_1631'
file_header = '/home/jh3226/analysis/single_electron/'
minitree_header = os.path.join(file_header, 'datasets_reduced/')

### Building the TreeMaker

As usual, you make your TreeMaker class and override the extract_data function, which operates on each event and returns a dictionary. The keys of the dictionary become the branch names, and its values become the branch values. The only difference is that you can have dictionary keys which contain arrays, as long as there is another dictonary key included that indicates the length of these arrays for each event. Here I am saving several fields from every peak in the TPC that is not a 'lone_hit' as arrays, and the dictionary key 'nb_peaks' gives the length of these arrays.

Note also that this method doesn't currently accept strings, so I have to convert the 'type' attribute to coded ints.

In [2]:
class SEProperties(hax.minitrees.TreeMaker):
    """
    This TreeMaker will take the event class and turn it into a row
    in a table (e.g. TNtuple or pandas DataFrame).  We define only
    one function, which takes a pax event in.  It returns a dictionary
    of new variables and their values.
    """

    extra_branches = ['*']  # Activate all of ROOT file
    __version__ = '0.0.1'

    def extract_data(self, event):
        # This runs on each event
        # These are the peak properties that I'm interested in looking at.
        # Look here for more info: http://xenon1t.github.io/pax/format.html#peak
        peak_field_namelist = [
                                'area', 'area_fraction_top', 'width',
                                'hit_time_mean', 'hit_time_std',
                                'n_contributing_channels', 'n_saturated_channels',
                                'left', 'type', 'x', 'y'
                              ]

        peaks = event.peaks
        peaks = [peak for peak in peaks if ((peak.type != 'lone_hit') and (peak.detector=='tpc'))]
        nb_peaks = len(peaks)
        result = {}
        result['nb_peaks'] = nb_peaks
        result['time'] = event.start_time
        for peak_field in peak_field_namelist:
            if hasattr(peaks[0], peak_field):
                if isinstance(getattr(peaks[0], peak_field), str):
                    result[peak_field] = np.empty(nb_peaks, dtype = (str, 10))
                else:
                    result[peak_field] = np.empty(nb_peaks, dtype = type(getattr(peaks[0], peak_field)) )
                for (i, peak) in enumerate(peaks):
                    result[peak_field][i] = getattr(peak, peak_field)
            elif peak_field in ('x', 'y'):
                result[peak_field] = np.empty(nb_peaks, dtype = float)
                result[peak_field].fill(float('nan'))
                for (i, peak) in enumerate(peaks):
                    for rp in peak.reconstructed_positions:
                        if rp.algorithm == 'PosRecTopPatternFit':
                            result[peak_field][i] = getattr(rp, peak_field)
                            break
            elif peak_field=='width':
                result[peak_field] = np.empty(nb_peaks, dtype = float)
                for (i, peak) in enumerate(peaks):
                    result[peak_field][i] = list(peak.range_area_decile)[5]
            else:
                raise ValueError("Field %s doesn't exist" % peak_field)

        # converting the type field to ints, since this method doesn't accept strings
        type_ints = {'s1': 1, 's2': 2, 'unknown': 3}
        for (i, peak) in enumerate(peaks):
            if peak.type in list(type_ints.keys()):
                result['type'] = type_ints[peak.type]
            else:
                raise ValueError("No int set for type \'%s\'" % peak.type)
        return result

hax.init(main_data_paths=['/project/lgrandi/xenon1t/processed/pax_v5.0.0/'], experiment='XENON1T',
         minitree_paths = [minitree_header])

### Reducing the data (and pickling)
Now I use the `minitrees.load()` function as usual, with some new options. To save as a pickle file (in addition to root), use the save_pickle option. If you don't want to save as a root file as well, use save_root.

In [3]:
data = hax.minitrees.load(dataset, treemakers=[SEProperties], save_pickle=True, force_reload=True)
# data = hax.minitrees.load(dataset, treemakers=[SEProperties], save_pickle=True, save_root=False, force_reload=True)

data.head(1)

100%|██████████| 4946/4946 [00:03<00:00, 1298.93it/s]
100%|██████████| 4946/4946 [05:16<00:00, 15.61it/s]


Unnamed: 0,event_duration,event_number,event_time,run_number,area,area_fraction_top,hit_time_mean,hit_time_std,left,n_contributing_channels,n_saturated_channels,nb_peaks,time,type,width,x,y
0,806090,0,1467815517001724790,1289,"[3.09444856644, 2.97188568115, 0.94188117981, ...","[0.390328347683, 0.350992381573, 1.0, 1.0, 0.7...","[2803.63916016, 5524.22998047, 9580.87792969, ...","[175.543884277, 30.3824005127, 1.57216238976, ...","[248, 549, 957, 991, 1254, 1963, 2342, 2544, 2...","[5, 3, 2, 2, 15, 3, 3, 3, 4, 2, 17, 3, 2, 16, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",241,1467815517001724790,2,"[202.157314396, 62.1318035413, 8.13094520425, ...","[-23.8151626587, -18.3289470673, -27.306390762...","[-2.11967420578, 35.2863426208, 16.3339595795,..."


### Reloading from pickle
The pickle can be read in the usual ways, and contains a dictionary like `{'metadata': <dictionary>, '<treemaker name>': <pandas dataframe> }`.

In [4]:
pickle_data = pd.read_pickle(os.path.join(minitree_header, '160706_1631_SEProperties.pkl'))
print(pickle_data['metadata'])
pickle_data['SEProperties'].head(1)

{'documentation': '\n    This TreeMaker will take the event class and turn it into a row\n    in a table (e.g. TNtuple or pandas DataFrame).  We define only\n    one function, which takes a pax event in.  It returns a dictionary\n    of new variables and their values.\n    ', 'timestamp': '2016-08-23 16:11:46.212678', 'version': '0.0.1', 'created_by': 'jh3226@midway-login1', 'pax_version': '5.5.0'}


Unnamed: 0,area,area_fraction_top,hit_time_mean,hit_time_std,left,n_contributing_channels,n_saturated_channels,nb_peaks,time,type,width,x,y
0,"[3.09444856644, 2.97188568115, 0.94188117981, ...","[0.390328347683, 0.350992381573, 1.0, 1.0, 0.7...","[2803.63916016, 5524.22998047, 9580.87792969, ...","[175.543884277, 30.3824005127, 1.57216238976, ...","[248, 549, 957, 991, 1254, 1963, 2342, 2544, 2...","[5, 3, 2, 2, 15, 3, 3, 3, 4, 2, 17, 3, 2, 16, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",241,1467815517001724790,2,"[202.157314396, 62.1318035413, 8.13094520425, ...","[-23.8151626587, -18.3289470673, -27.306390762...","[-2.11967420578, 35.2863426208, 16.3339595795,..."


### Reloading from root
When loading from the root file, for some reason using `np.asarray()` to read the arrays from the tree gives garbage, but `array.array()` works. Also note that `root_numpy.root2array()` will not work since it doesn't support arrays of this type.

In [5]:
import ROOT as root
import array

root_file = root.TFile(os.path.join(minitree_header, '160706_1631_SEProperties.root'))
tree=root_file.Get("SEProperties")
nEvents = tree.GetEntries()
print("%d events" % nEvents)

tree.GetEntry(0)
print('dataframe: %s' % list(data['area'][0][:5]))
print('np.asarray(): %s' % list(np.asarray(tree.area)[:5]))
print('array.array(): %s' % list(array.array('d', tree.area)[:5]))

4946 events
dataframe: [3.0944485664367676, 2.9718856811523438, 0.94188117980957031, 2.3494102954864502, 18.752710342407227]
np.asarray(): [0, 0, 0, 64, 110]
array.array(): [3.0944485664367676, 2.9718856811523438, 0.9418811798095703, 2.34941029548645, 18.752710342407227]
