Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework data input and processing #29

Merged
merged 37 commits into from
Jul 4, 2018
Merged

Rework data input and processing #29

merged 37 commits into from
Jul 4, 2018

Conversation

bryankim96
Copy link
Collaborator

@bryankim96 bryankim96 commented Jun 2, 2018

This request resolves #14, resolves #15, resolves #16, and resolves #18. This pull request is being opened prematurely to facilitate discussion on proposed changes. Before merging, the requested changes need to be discussed, modified, debugged, and tested, and the rewrite of train.py requested in #17 must be completed.

In this proposed change:

All data input is now handled by an abstract class CTALearnDataset. The underlying implementation is delegated to sub-classes which inherit from CTALearnDataset, such as HDF5Dataset, which is implemented. The CTALearnDataset interface is built on using run_id, event_id, and tel_ids as unique identifiers for events and images. How these are used internally to locate and return data is left up to the implemented data format and the inheriting class (i.e. HDF5Dataset).

All data processing is now handled by a class DataProcessor.

The top-level data loading functions, analogous to load_single_tel_event_HDF5() and load_array_event_HDF5() are implemented as separate non-class functions load_array_event() and load_single_tel_event() and located in data_processing.py as I was unsure of which module to place them in.

Other changes include implementing support for multiple telescope types in an array event, major changes to metadata loading, and refactoring and modifications of other functions.

NOTE: The test_metadata script rework from #27 is still incomplete, in spite of changes to test_metadata.py and print_dataset_metadata.py and must be re-done to align with the above changes.

Key questions regarding these changes include:

  • How should the overall concept of a dataset (files containing events and images) and different implementations of a dataset be handled?
  • How should cuts be handled? Should the CTALearnDataset class be used as a representation of a collection of events which cannot be modified (which is kind of how it is now)? Or should the cuts be integrated into the class itself, so that you can use a mutator method to alter some class attribute (like self.cuts) which controls which events the Dataset lets you access?
  • Should the DataProcessor class take a CTALearnDataset instance as an instance attribute (i.e. should the DataProcessor be built directly onto a Dataset)? The advantage is that it is no longer necessary for the CTALearnDataset and DataProcessor to be explicit parameters for the data loading function. The disadvantage is that it becomes impossible to use a DataProcessor independently of a CTALearnDataset.
  • How the various settings and attributes should be handled, as well as their initialization? Ex. Should the various settings of a DataProducer be laid out in the constructor as separate parameters with default values rather than received from a single dict or config object? As a compromise, should a secondary method also be added to initialize all settings from a single dict?
  • How the settings and attributes should be stored for both the CTALearnDataset and DataProcessor classes? Should they be grouped into dicts or made separate attributes

metadata dict is now an OrderedDict.
telescope_types list and telescope_ids dict are replaced with a single
OrderedDict.
num_events_by_particle_id and num_images_by_particle_id dicts added.
particle_ids list added.
image_charge_min and image_charge_max renamed to "..._mins" and
"..._maxes" respectively for clarity.
particle_id_by_file renamed to "particle_ids...".
All other data loading functions refactored to respect changes.
Calls load_metadata_HDF5 to read metadata, then prints to command line
or writes to file.
Prints additional info about class breakdown of events/images by
telescope.
@bryankim96 bryankim96 changed the title Rework data input Rework data input and processing Jun 2, 2018
@aribrill
Copy link
Collaborator

aribrill commented Jun 2, 2018

Hey Bryan, here are my comments on the PR. First are specific comments on the code as implemented, then I'll put my thoughts on the important issues that you raise.

data_input.py:

  1. The naming is potentially confusing here. "Dataset" is already used by TensorFlow, as is data "input", in related but distinct ways to what we're doing. A possible alternative could be a "DataLoader" class and data "loading" functions. Or, at the very least, we need to add documentation/comments to make this crystal clear.
  2. get_image() is at least somewhat specific to the HDF5 data format. Other formats may load matrices directly. Maybe this function doesn't belong in the abstract class.
  3. The CTALearnDataset class should be initialized with settings. Then the various functions don't need specified arguments. Specifically, get_generators() would not have to take arguments other than the mode(s).
  4. Make event type (single tel / multiple tel) a setting. Then there's no need for separate functions for single tel and event example loading, just a single load_example function. This will help simplify train.py. The load_example function could simply call the single_tel or event example loading functions internally.
  5. Does get_auxiliary_data() really belong at the highest abstract level? We already have a use case when it isn't used at all, single tel.
  6. There's an inconsistency of naming between tel_id and telescope_id (pick one).
  7. I'm concerned that run_id/event_id/tel_id isn't general enough for the abstract class. It reflects the HDF5 format. This problem applies to the suggestion of filename, index given in Refactor data loading #15 too. For example, the next data format we will implement is DL1 directly from ctapipe (without intermediation of HDF5 - for prediction within data analysis), for which an explicit run_id or event_id may not be defined. For something like real-time generation of fake data using a GAN-trained NN, there wouldn't be any stored data at all. A simple solution could be to make the identifier argument a list which would be parsed by the subclass. Allowing it to be None could also be useful in some cases.
  8. Consider marking functions not meant to be used externally with a leading underscore.

@aribrill
Copy link
Collaborator

aribrill commented Jun 2, 2018

data_processing.py:

  1. The DataProcessor class should have a function process_data (which internally would call process_image), which would process all the data for an entire event. This function would take in [data, label] where data contains the images and auxiliary info and outputs in the same format. This will allow allow all data processing to be done with one function call with no external logic needed. Basically the logic currently in load_single_tel_event and load_array_event would go here. One advantage of doing things this way is it would be possible to implement array-level data augmentation, for example, shifting the positions of all telescopes by a constant.
  2. Consider marking crop_image, normalize_image, and process_image with a leading underscore to show they're internal functions.
  3. There should be a function DataProcessor.augment_data(), but since we don't have the functionality for it yet, maybe put a TODO comment in the code indicating that we intend to add it.

@bryankim96
Copy link
Collaborator Author

bryankim96 commented Jun 2, 2018

Thanks for the fast response!

  1. This occurred to me as well and I also think it's important to make the distinction clear. I think this relates to how we want to conceptualize the class itself. Basically, should it be a persistent object which represents/describes a collection of events/data (in which case some variation on dataset seems appropriate) or should it be viewed mostly as a factory which can be used to read data (in which case something like data_loader seems appropriate)? I think both ways seem workable, but the choice might somewhat affect how we organize things.

  2. Is it actually specific to the HDF5 format in particular? My understanding is that all of the approaches we've implemented in CTALearn use the telescope traces as images/2D matrices and would want a method to return images.

  3. This makes sense, although it also implies a choice about how flexibly the class can be used. If the arguments are moved into the instance attributes, then it becomes less transparent and clear when trying to call it multiple times on the same dataset. For example, you could conceivably have a single dataset and want to get_event_generators for one cut condition and then again with another.

  4. Agreed, I'll do this right away.

  5. Ditto, although maybe there could be some type of extra info (I can't think of any right now) which we might want to provide during single tel classification?

  6. Ditto.

  7. This is a really good point and a very important decision for us to consider. It seemed to me to be the most general unique way to identify data, but you are right that it wouldn't apply in all cases. My understanding is that it is inherent to the use of simulated data, as the events are labeled at simulation time by run_id, event_id. However, as you note, this wouldn't apply necessarily to real data. I did some looking into ctapipe and it seems like an event_id is attached to the R0, R1, and DL0 event containers, but not to the DL1 level. There is also a "meta" attribute whose usage is unclear to me and other fields including "obs_id" which might correspond to an observation run (perhaps similar to our usage of run_id?) How exactly these values are collected and used to identify events is something we may want to study further. One thing which is not clear to me is why the DL1 container (post-calibration images) doesn't have any of the identifier attributes that the earlier ones have (It seems like it would still be useful to be able to identify and link DL1 data to the corresponding lower-level containers). For now, your solution sounds like an excellent one (allowing flexible parsing of the identifier) although this does have the unfortunate effect of making the abstract class interface less clear.

  8. I was considering making some methods private using the traditional double underscore, but I wasn't sure which ones might want to be used independently of the full data processing chain. For example, it seems like it could be possible that at some point someone may want to be able to instantiate a DataProcessor and call DataProcessor.crop separately without having to work through .process_event or something like that. Whether we want to allow this would depend on what sort of use cases we might imagine/want to support.

@aribrill
Copy link
Collaborator

aribrill commented Jun 2, 2018

How should the overall concept of a dataset (files containing events and images) and different implementations of a dataset be handled?

See notes above. I think the API of a CTALearn dataset requires two functions: load_data() for the tf.Estimator input_fn and get_generators() for the tf.Dataset.

How should cuts be handled? Should the CTALearnDataset class be used as a representation of a collection of events which cannot be modified (which is kind of how it is now)? Or should the cuts be integrated into the class itself, so that you can use a mutator method to alter some class attribute (like self.cuts) which controls which events the Dataset lets you access?

I think the way we have now is reasonable. The cuts are defined at the time of class creation as a constraint on what events are accessible, so as far as anyone outside the Dataset is concerned events failing cuts don't exist. This seems simplest and right now we don't yet have a use case for needing mutable cuts.

Should the DataProcessor class take a CTALearnDataset instance as an instance attribute (i.e. should the DataProcessor be built directly onto a Dataset)? The advantage is that it is no longer necessary for the CTALearnDataset and DataProcessor to be explicit parameters for the data loading function. The disadvantage is that it becomes impossible to use a DataProcessor independently of a CTALearnDataset.

I think this is a critical question. I'm interested to hear what everyone thinks on this. Actually, my opinion is the reverse of this. I think CTALearnDataset should be the fundamental basis of the data loading code, with the data loading function as a class function. DataProcessor would be a secondary module that plugs in and could be used independently. To me, data loading is fundamental, while data processing is inherently optional and modular; I could see use cases in which data processing isn't used at all, our data processing code is used in a different context, or an external library is used for data processing instead of or in conjunction with our code. The usage could be as follows:

HDF5_dataset_settings = dict(...) # includes data loading and image mapping settings
data_processing_settings = dict(...)
data_processing_fn = DataProcessor(**data_processing_settings).process_data
mydataset = HDF5Dataset(file_list, data_processing_fn=data_processing_fn, **HDF5_dataset_settings)
data, label = mydataset.load_example()
train_generator = mydataset.get_generators(train=True) # or mydataset.get_generators(mode="train")

How the various settings and attributes should be handled, as well as their initialization? Ex. Should the various settings of a DataProducer be laid out in the constructor as separate parameters with default values rather than received from a single dict or config object? As a compromise, should a secondary method also be added to initialize all settings from a single dict?
How the settings and attributes should be stored for both the CTALearnDataset and DataProcessor classes? Should they be grouped into dicts or made separate attributes

I think it's reasonable to store the settings and attributes internally as separate parameters. If there's a best practice for this we should do that, but I'm not sure what it is. Maybe a good way to do it is to provide the keyword arguments from the dictionary via dictionary unpacking.

@aribrill
Copy link
Collaborator

aribrill commented Jun 2, 2018

One final note: we have some alternatives for terminology to use. For the data from a single gamma-ray event, we are currently calling it "event", "example", and "data". For a non-single-telescope event, we are currently calling it either "array-level", "event", or "multiple-tel". For each, we should pick a single term. Speak up if you have a strong opinion.

@aribrill aribrill added the v0.2.0 label Jun 2, 2018
@aribrill
Copy link
Collaborator

aribrill commented Jun 2, 2018

  1. This occurred to me as well and I also think it's important to make the distinction clear. I think this relates to how we want to conceptualize the class itself. Basically, should it be a persistent object which represents/describes a collection of events/data (in which case some variation on dataset seems appropriate) or should it be viewed mostly as a factory which can be used to read data (in which case something like data_loader seems appropriate)? I think both ways seem workable, but the choice might somewhat affect how we organize things.

I was envisioning something closer to the factory approach, but I don't have any issues with the persistent object approach (clearly it works well for the TensorFlow developers). I don't know if Daniel or Qi have thoughts on this.

  1. Is it actually specific to the HDF5 format in particular? My understanding is that all of the approaches we've implemented in CTALearn use the telescope traces as images/2D matrices and would want a method to return images.

On second thought, isn't there always a need for this function because of the plot_bounding_boxes.py script? So it needs to be there regardless of the underlying data format.

  1. This makes sense, although it also implies a choice about how flexibly the class can be used. If the arguments are moved into the instance attributes, then it becomes less transparent and clear when trying to call it multiple times on the same dataset. For example, you could conceivably have a single dataset and want to get_event_generators for one cut condition and then again with another.

This is a good point. I'd say that since we haven't had a workflow like this yet this tradeoff is OK to make since making a new dataset with new settings is clear and not much more difficult. If this is a common situation in the future though we may have to revisit this.

  1. I was considering making some methods private using the traditional double underscore, but I wasn't sure which ones might want to be used independently of the full data processing chain. For example, it seems like it could be possible that at some point someone may want to be able to instantiate a DataProcessor and call DataProcessor.crop separately without having to work through .process_event or something like that. Whether we want to allow this would depend on what sort of use cases we might imagine/want to support.

Why not use a single leading underscore? That's standard notation to indicate that the method is internal, but can still be accessed outside the class if necessary.

@bryankim96
Copy link
Collaborator Author

Thanks again for the detailed feedback! I'll work on and push some commits in the next few days along the lines you described and addressing the issues you mentioned. Then we can take stock again and discuss what remains to be done/fixed + maybe Qi and Daniel will be able to weigh in on some of the longer-term design questions as well.

qi-feng and others added 5 commits June 11, 2018 22:49
HDF5DataLoader.
All cuts and data loading-related settings now moved to constructor and
instance attributes. Cuts and other types of data selection are now
"permanent" and tied to each instance of DataLoader.
Data processing now handled by providing a DataProcessor to a
DataLoader.
Single tel and array level get_example functions now replaced with a
single get_example, the example type of which is set by the DataLoader
instance.
Arguments to get_example and get_image are no longer specified by the
abstract methods (should be determined by each sub-class).
Some name changes ("example", "event", "image") for consistency.
Some methods marked as private.
DataLoader.get_example().
Skeleton for DataProcessor._augment_data() added.
Process example functions for single tel and array data combined into
DataProcessor.process_example(), which now infers the example type by
type checking.
@bryankim96
Copy link
Collaborator Author

The data input class originally represent a complete collection of events and images (the entire dataset) which could be provided repeatedly with different cut conditions and options to get lists of events. By pushing all of the settings out of the instance methods and to the constructor, as you recommended (and completely denying access to non-selected/non-passing events), I think it now makes more sense to think of the class as a factory separate from the underlying dataset and which controls access to it. Considering this, I switched the name back to your original suggestion of DataLoader.

For now I put off the question of what unique identifier will be used to identify each event/image until we could discuss it in more depth. AFAIK abstract methods don't require that you specify all of the arguments, so get_image and get_example can be implemented by each sub-class with whatever arguments are required. This is definitely sub-optimal because the user needs to be aware of each sub-class instead of relying on a uniform interface, but the same is true if we make the identifier a list and delegate the parsing to the sub-class (the user needs to know or be told via error messages whether the provided identifiers are in a valid format), so I figured it would be ok as a temporary measure.

All of the settings are now separate keyword arguments and can be provided by dictionary unpacking by the user like you suggested.

Regarding terminology, for now I tried to be consistent in using the following (although I may have missed something):

  • "Example": a collection of features/data combined with a label (can be either single tel or array). I didn't want to use the term "event" for this because it also has a second meaning (i.e. "shower").
  • "Event": an event in the sense of the simulation, meaning a single shower and all of its corresponding info
  • "Image": a single image of a shower from one telescope

For DataProcessor.process_example(), for now I had it infer whether it received a single tel example or an array example by a type check, but this does not seem like a very clear way to handle it. The example type could be provided explicitly by a keyword argument, but it seems like that is not desired. My understanding is that we want a DataProcessor to be independent of DataLoader but also have process_example() work correctly on both types of examples without additional arguments.

In line with what you suggested, the top-level example loading functionality was moved to DataLoader.get_example(). DataLoader can now be provided a DataProcessor (or None) when instantiated and it is used to process the data (or not) in .get_example().

@aribrill
Copy link
Collaborator

The data loading/processing code looks good to me. Assuming it's all been debugged this should be good to close #14, #15, #16, and #18. I have only a couple small comments.

  1. Since the data loading class is now called DataLoader, maybe the module should be named data_loading.py instead of data_input.py.
  2. In data_input.py starting on line 438, example_type is either one of self.selected_telescopes or "event". Unless I'm misunderstanding something, for consistency for the rest of the code I think it should be either "single_tel" or "array" (also it should be self.example_type I believe).
  3. In data_input.py line 461 "reproducable" -> "reproducible"
  4. Right now there is no clean way to get a training set generator without a validation set too, as suggested in Refactor data loading #15. However, since I haven't heard any demand for this feature, maybe not having it is fine.

bryankim96 and others added 20 commits June 22, 2018 03:42
metadata dict is now an OrderedDict.
telescope_types list and telescope_ids dict are replaced with a single
OrderedDict.
num_events_by_particle_id and num_images_by_particle_id dicts added.
particle_ids list added.
image_charge_min and image_charge_max renamed to "..._mins" and
"..._maxes" respectively for clarity.
particle_id_by_file renamed to "particle_ids...".
All other data loading functions refactored to respect changes.
Calls load_metadata_HDF5 to read metadata, then prints to command line
or writes to file.
Prints additional info about class breakdown of events/images by
telescope.
HDF5DataLoader.
All cuts and data loading-related settings now moved to constructor and
instance attributes. Cuts and other types of data selection are now
"permanent" and tied to each instance of DataLoader.
Data processing now handled by providing a DataProcessor to a
DataLoader.
Single tel and array level get_example functions now replaced with a
single get_example, the example type of which is set by the DataLoader
instance.
Arguments to get_example and get_image are no longer specified by the
abstract methods (should be determined by each sub-class).
Some name changes ("example", "event", "image") for consistency.
Some methods marked as private.
DataLoader.get_example().
Skeleton for DataProcessor._augment_data() added.
Process example functions for single tel and array data combined into
DataProcessor.process_example(), which now infers the example type by
type checking.
Updated data_loading.py and data_processing.py to use image_mapper from
image_mapping.py.
DataLoader and DataProcessor now both take an image_mapper. If
DataLoader is instantiated with a DataProcessor instance, its
image_mapper is shared by both. If not, then DataLoader uses a default
image_mapper instance.
Fixed functionality for applying cuts when in single_tel mode. Now
correctly checks for example_type, uses the single selected telescope type in
self.selected_tel_types, and locates non-blank images from the selected
telescopes.
Removed image.py, now replaced by functionality in image_mapping.py
Changed --predict command line option to a mode setting.
Reworked config parsing to use configobj instead of configparser
and added corresponding config_spec.ini.
Reworked run_model.py to use DataLoader, DataProcessor, and image_mapper
Reorganized ImageMapper methods for clarity and to better match other
modules.
Refactored references to ImageMapper in DataLoader and DataProcessor.
to work correctly.
Fixed minor issues with configobj parsing and validation.
Fixed issue with incorrect behavior when num_epochs == 0.
single tel type. Data loading now only supports a single telescope type
at one time.
Added missing metadata fields and logic to combine metadata from
DataProcessor.
Names of run_id and event_id fixed to run_number and event_number to
match PyTables files.
Use telescope positions option added to constructor.
Generator and map_fn output names, types, and is_label added as
hardcoded instance variables.
Other minor bugfixes.
examples, as provided by DataLoader.
Added instance variables for image shapes and number of auxiliary
parameters.
Added get_metadata() method to get post-processing metadata parameters.
@qi-feng
Copy link
Collaborator

qi-feng commented Jul 3, 2018

A question regarding the rotation of the image mapper:
See here: https://github.com/bryankim96/ctalearn/blob/rework-data-input/tests/test_image_mapper.ipynb
The image of MSTS and VTS are rotated by 90 deg clockwise; I can add a transpose to the code, but would like to understand why this is happening. Is it related to numpy array being row-major or column-major?

instantiation.
Fixed get_image to correctly use image_mapper.map_image.
Moved conversion of images, triggers, and aux_info to numpy arrays out
of DataProcessor.process_example and into get_example.
Fixed bug with single_tel_examples_to_indices.
Changed datatype of triggers from float32 to int8.
Removed unnecessary expand_dims to correct image dimensions.
Removed unnecessary expand_dims to fix image shape.
Removed some comments.
passed in for use with normalization.
Fixed bug with thresholds.
can receive the image_charge_mins from DataLoader.
@bryankim96
Copy link
Collaborator Author

I think I remember noticing something like this with MSTS a while ago. I don't think there's any particular reason for it other than how I the mapping table function was written (although it might be nice to transpose as you suggest for consistency). My understanding is that the orientation shouldn't affect anything , though, as long as all of the telescopes within a class are consistent.

@bryankim96 bryankim96 merged commit e288cb8 into master Jul 4, 2018
@bryankim96
Copy link
Collaborator Author

I'm going to close this PR for now because all of the major issues it aimed at are now addressed (although there are likely still some bugs and issues to be addressed). We can continue bugfixing and working on other changes (for example to the image mapping) in other PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants