Skip to content

CTA ML Data Format

Bryan Kim edited this page May 3, 2019 · 17 revisions

DL1 Data Handler's writer receives data via ctapipe event source (currently only SimTelEventSource is used), calibrates and processes it, then writes it to a custom DL1 format for use in ML/image analysis studies. The layout of that data format is described here. The output file format is based on the PyTables table-based structure for HDF5 data. This information is accurate as of dl1-data-handler v0.7.4.

File Structure

All files produced using DL1 Data Handler's writer follow a common internal structure comprising Tables and Folders (NOTE: at the moment, no folders besides the file root folder are used). Each output HDF5 file corresponds to 1 or more input data files spanning 1 or more runs with the same settings, as described here. Each output file contains 3 Tables as children of the file root, Telescope_Type_Information, Array_Information, and Events. If the option to save non-triggered Monte Carlo events is set (save_mc_events), then a 4th table, MC_Events will also be added. Their contents are described below here.

  • '/' [Folder]
    • Telescope_Type_Information [Table]
    • Array_Information [Table]
    • Events [Table]
    • {Telescope Type Name} [Table]
    • ...

NOTE: There is one table as a child of the file root for each telescope type present in the data (each row in Telescope_Type_Information). See the note below regarding naming convention.

Naming Conventions

In cases where fields in the DL1 Data Handler format correspond to fields in ctapipe containers, the column names were made to match the ctapipe container field names.

One notable exception is the convention for naming telescope types. The current (as of ctapipe v0.6.2) naming convention for telescope types in ctapipe is a string identifier "[optics_type]:[camera_type]". However, as expressions containing colons or hyphens are not valid Python identifiers, using these telescope names would make it difficult to use python attribute syntax to select columns in the Tables. Accordingly, the Table fields use a modified version of the ctapipe identifiers, replacing ":" with "_" and removing all hyphens.

Data

Below is a full description of all of the data fields stored in output files (both as attributes and as Table entries).

File/Super-run Properties:

Groups of input files containing events/runs with common properties (same production, primary particle type, zenith angle, azimuth angle) and processed using the the same software versions (dl1-data-handler, ctapipe, simtelarray, CORSIKA) can be dumped together in a single .h5 output file. Groups of runs/events which satisfy these criteria are called a 'super-run' and these shared properties (which are the same for all events in a file) are stored as user attributes of the file root of each .h5 file. The majority of these fields are read directly from the ctapipe mcheader container. See the ctapipe documentation here for details. The fields below are the custom fields (in addition to the MCHeaderContainer) added by DL1 Data Handler.

Data Field Attribute Name Datatype Units Example
List of Source Files (Runs) runlist Python list (string) [gamma_20deg_0deg_srun103-219___cta-prod3_desert-2150m-Paranal-HB9_cone10.h5, ...]
DL1 Data Handler Version dl1_data_handler_version string 0.7.2
ctapipe Version ctapipe_version string 0.6.2post4

NOTE: All fields are validated as each file is read (to check that all runs being processed share the same super-run properties), with the exception of num_showers, which is summed, and shower_prog_start and detector_prog_start, for which only the first value is recorded.

Array_Information Table

The Array_Information table contains all information pertaining to the layout of the (sub)array used to generate the data in the file. It describes the telescope ids, types, and positions of each telescope in the array.

Data Field Column Name Data Type Units Shape Example
Telescope ID id uint16 150
Telescope Type type string LST_LSTCam
Telescope x position x float32 m 205.5
Telescope y position y float32 m 158.8999938
Telescope z position z float32 m 5.0

Telescope_Type_Information Table

The Telescope_Type_Information table contains all information about the telescope types in the data and their characteristics, including their number of pixels and the relative positions of each pixel in their cameras, expressed as an array of x and y coordinates. This information can be used when reconstructing the image from the 1-D vectors in the image tables.

Data Field Column Name Data Type Units Shape Example
Telescope Type type string LST_LSTCam
Camera Type camera string LSTCam
Optics Type optics string LST
Number of Pixels num_pixels uint32 1855
Pixel Positions pixel_positions float32 (2,11328)* [-0.1022,...]

* NOTE: The shape of the pixel_positions column is determined by the camera with the largest number of pixels (SCTCam), as the entire column must have a common shape for all rows (telescope types). Telescope types with fewer than the SCTCam's 11328 pixels have the remainder of their pixel_positions array zero-padded.

Events Table

The Events table contains a record for each event in the dataset and various characteristics of the event (many from the MC simulation data).

When processing data in the 'tel_type' storage mode, the Event_Info will have one '[]_indices' column for each telescope type. Each column stores a 1-D array of integers, with length equal to the number of telescopes of that type in the array. Within each '[]_indices' column, the positions in the array correspond to each telescope, ordered sequentially by increasing telescope ID. For example, if there are 4 LSTs in the array with tel_ids 5, 6, 7, and 8, then the 'LST_LSTCam_indices' column will be a 1-D array of length 4. The first position in the array will correspond to the LST w/ tel_id 5, the second to the LST w/ tel_id 6, etc. The value in each array position indicates whether the corresponding telescope triggered (there is a saved image) and, if so, where that image is located in the data file. If the value is 0, then the telescope did not trigger and no image was saved (0 is also set to be the index of a blank/zero-filled image in each image table). If the value is non-zero, then the telescope did trigger and the corresponding image is saved in one of the image tables. The number then indicates the row index in the corresponding image table (in this case, the 'LST_LSTCam' table) where that image's data can be found. For example, if for event 398, the LST_LSTCam_indices column reads [0 0 212 0], then the LSTs with tel_ids 5, 6, and 8 did not trigger and have no images. The LST with tel id 7 did trigger and its image is found in row 212 of the 'LST_LSTCam' table. The LST_LSTCam_multiplicity field is an integer giving the number of triggered LSTs in the event (i.e. the number of non-zero elements in the LST_LSTCam_indices list). It is provided as a separate field for convenience when doing selections.

When processing data in the 'tel_id' storage mode the format is similar, except instead of one '[]_indices' column per telescope type, there is one '[]_indices' column for each telescope, labeled by telescope ID. Because each telescope ID also has its own image table, the entries in each '[]_indices' column are scalars (not arrays) indicating the index in the corresponding image table where the image can be found. For example, there would be a '50_indices' column corresponding to the telescope with a tel_id of 50. Each row with a value of 0 in that column would be an event where telescope 50 did not trigger. Each row with a positive index in that column would indicate the row in the '50' table where the saved image could be found.

Data Field Column Name Data Type Units Shape Example
Run/Observation ID obs_id uint32 103
Event ID event_id uint32 1510
Primary Particle Type shower_primary_id uint8 0
Zenith Angle alt float32 rad 1.157506108
Azimuth Angle az float32 rad 0.050275001
Shower core x-position core_x float32 m -511.958618
Shower core y-position core_y float32 m -169.597915
Height of first interaction h_first_int float32 m 30055.20507
Location of shower maximum x_max float32 m 363.23867
Simulated (MC) Primary Particle Energy mc_energy float32 TeV 0.312576383
Indices of rows in image tables []_indices uint32 See explanation above [0,0,3,4,0...]
Number of triggered telescopes of type in event []_multiplicity uint32 See explanation above 2

MC_Events Table

If the save_mc_events flag is set in the DL1DataWriter constructor, DL1DW will save the properties of each Monte Carlo simulated shower in the input sim telarray file (NOTE: this flag will only work when the input is a sim telarray file), even those which did not trigger the array. This data is not read from the SimTelEventSource in ctapipe, but instead read directly from its underlying Eventio SimTelFile. The structure of this data mirrors that of the Event table, but without the columns related to image indices and multiplicity:

Data Field Column Name Data Type Units Shape Example
Run/Observation ID obs_id uint32 103
Event ID event_id uint32 1510
Primary Particle Type shower_primary_id uint8 0
Zenith Angle alt float32 rad 1.157506108
Azimuth Angle az float32 rad 0.050275001
Shower core x-position core_x float32 m -511.958618
Shower core y-position core_y float32 m -169.597915
Height of first interaction h_first_int float32 m 30055.20507
Location of shower maximum x_max float32 m 363.23867
Simulated (MC) Primary Particle Energy mc_energy float32 TeV 0.312576383

NOTE: Information about all MC showers are recorded, including those which are also saved in the Events table because they triggered the array. Duplicates should be checked by obs_id and run_id.

NOTE: Storing MC event data can lead to significant increases in processing runtime and moderate increases in output file size.

Image Tables

The image tables are used to store all of the image data corresponding to one telescope (when processed in the 'tel_id' storage mode) or one telescope type (when processed in the 'tel_type' storage mode). They contain the actual data/images (both of integrated charge and peak arrival time) for each telescope for each event. They also contain an index column which maps them back to an event in the Events table. The first entry in each image table (index 0) is a blank image provided as a placeholder and does not correspond to a real event.

Data Field Column Name Data Type Units Shape Example
Event Index event_index int32 index 5
Image (integrated charge) charge float32 calibrated counts See information on Telescope_Type_Information table [0.718...,...]
Image (peak arrival channel) peakpos float32 channel num. See information on Telescope_Type_Information table [34.,...]

FAQ:

Q: Why isn't event property _____ included in the data format? Can I add _____ ?

A: When this data format was agreed on, it was decided to first include a list of the most obviously useful parameters for training event reconstruction models (the shower core position/arrival direction, the MC energy, the primary particle id). However, the list is certainly not exhaustive. If you have any suggestions for additional data to include in the output files, write up an issue and we would be happy to discuss including it. If you want to modify the included properties, feel free to fork the code and modify it (and submit a PR if you think the change would be useful to others).

Q: Why are the images stored in separate tables rather than together with the other fields in the Events table?

A: This was done primarily because PyTables Tables are designed to store rows/entries of a regular size whose fields are specified in advance (via the column shape definitions and data types). When stored in memory, each row is stored as a contiguous block of fixed size. While this works well for event parameters that are present for every event (the MC energy, particle id, etc.), our data has the issue that the amount of data per event can vary widely because varying numbers of telescopes will trigger for each event. If we added columns for telescope images to the Event_Info table, many of these columns would be empty/unneeded for each row, however they would still take up the same amount of space in memory. To avoid this large memory overhead, it was decided to move the images to separate tables whose lengths could vary and which would not require any storage overhead for empty/missing images. The 'indices' columns were then used as a map into those tables to retrieve the image data. This also has the benefit that all images from a given telescope type (for the 'tel_type' storage mode) or a telescope (for the 'tel_id' storage mode) can be found in a single place and easily retrieved with a single read.

Q: Other questions

A: Please feel free to add issues or message/email the maintainers of this repository directly if you have other questions.

Bryan Kim, UCLA

Daniel Nieto, Complutense University of Madrid

Ari Brill, Columbia University