-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework data input and processing #29
Conversation
metadata dict is now an OrderedDict. telescope_types list and telescope_ids dict are replaced with a single OrderedDict. num_events_by_particle_id and num_images_by_particle_id dicts added. particle_ids list added. image_charge_min and image_charge_max renamed to "..._mins" and "..._maxes" respectively for clarity. particle_id_by_file renamed to "particle_ids...". All other data loading functions refactored to respect changes.
Calls load_metadata_HDF5 to read metadata, then prints to command line or writes to file. Prints additional info about class breakdown of events/images by telescope.
Hey Bryan, here are my comments on the PR. First are specific comments on the code as implemented, then I'll put my thoughts on the important issues that you raise. data_input.py:
|
data_processing.py:
|
Thanks for the fast response!
|
See notes above. I think the API of a CTALearn dataset requires two functions: load_data() for the tf.Estimator input_fn and get_generators() for the tf.Dataset.
I think the way we have now is reasonable. The cuts are defined at the time of class creation as a constraint on what events are accessible, so as far as anyone outside the Dataset is concerned events failing cuts don't exist. This seems simplest and right now we don't yet have a use case for needing mutable cuts.
I think this is a critical question. I'm interested to hear what everyone thinks on this. Actually, my opinion is the reverse of this. I think CTALearnDataset should be the fundamental basis of the data loading code, with the data loading function as a class function. DataProcessor would be a secondary module that plugs in and could be used independently. To me, data loading is fundamental, while data processing is inherently optional and modular; I could see use cases in which data processing isn't used at all, our data processing code is used in a different context, or an external library is used for data processing instead of or in conjunction with our code. The usage could be as follows:
I think it's reasonable to store the settings and attributes internally as separate parameters. If there's a best practice for this we should do that, but I'm not sure what it is. Maybe a good way to do it is to provide the keyword arguments from the dictionary via dictionary unpacking. |
One final note: we have some alternatives for terminology to use. For the data from a single gamma-ray event, we are currently calling it "event", "example", and "data". For a non-single-telescope event, we are currently calling it either "array-level", "event", or "multiple-tel". For each, we should pick a single term. Speak up if you have a strong opinion. |
I was envisioning something closer to the factory approach, but I don't have any issues with the persistent object approach (clearly it works well for the TensorFlow developers). I don't know if Daniel or Qi have thoughts on this.
On second thought, isn't there always a need for this function because of the plot_bounding_boxes.py script? So it needs to be there regardless of the underlying data format.
This is a good point. I'd say that since we haven't had a workflow like this yet this tradeoff is OK to make since making a new dataset with new settings is clear and not much more difficult. If this is a common situation in the future though we may have to revisit this.
Why not use a single leading underscore? That's standard notation to indicate that the method is internal, but can still be accessed outside the class if necessary. |
Thanks again for the detailed feedback! I'll work on and push some commits in the next few days along the lines you described and addressing the issues you mentioned. Then we can take stock again and discuss what remains to be done/fixed + maybe Qi and Daniel will be able to weigh in on some of the longer-term design questions as well. |
HDF5DataLoader. All cuts and data loading-related settings now moved to constructor and instance attributes. Cuts and other types of data selection are now "permanent" and tied to each instance of DataLoader. Data processing now handled by providing a DataProcessor to a DataLoader. Single tel and array level get_example functions now replaced with a single get_example, the example type of which is set by the DataLoader instance. Arguments to get_example and get_image are no longer specified by the abstract methods (should be determined by each sub-class). Some name changes ("example", "event", "image") for consistency. Some methods marked as private.
DataLoader.get_example(). Skeleton for DataProcessor._augment_data() added. Process example functions for single tel and array data combined into DataProcessor.process_example(), which now infers the example type by type checking.
…learn into rework-data-input
The data input class originally represent a complete collection of events and images (the entire dataset) which could be provided repeatedly with different cut conditions and options to get lists of events. By pushing all of the settings out of the instance methods and to the constructor, as you recommended (and completely denying access to non-selected/non-passing events), I think it now makes more sense to think of the class as a factory separate from the underlying dataset and which controls access to it. Considering this, I switched the name back to your original suggestion of DataLoader. For now I put off the question of what unique identifier will be used to identify each event/image until we could discuss it in more depth. AFAIK abstract methods don't require that you specify all of the arguments, so get_image and get_example can be implemented by each sub-class with whatever arguments are required. This is definitely sub-optimal because the user needs to be aware of each sub-class instead of relying on a uniform interface, but the same is true if we make the identifier a list and delegate the parsing to the sub-class (the user needs to know or be told via error messages whether the provided identifiers are in a valid format), so I figured it would be ok as a temporary measure. All of the settings are now separate keyword arguments and can be provided by dictionary unpacking by the user like you suggested. Regarding terminology, for now I tried to be consistent in using the following (although I may have missed something):
For DataProcessor.process_example(), for now I had it infer whether it received a single tel example or an array example by a type check, but this does not seem like a very clear way to handle it. The example type could be provided explicitly by a keyword argument, but it seems like that is not desired. My understanding is that we want a DataProcessor to be independent of DataLoader but also have process_example() work correctly on both types of examples without additional arguments. In line with what you suggested, the top-level example loading functionality was moved to DataLoader.get_example(). DataLoader can now be provided a DataProcessor (or None) when instantiated and it is used to process the data (or not) in .get_example(). |
The data loading/processing code looks good to me. Assuming it's all been debugged this should be good to close #14, #15, #16, and #18. I have only a couple small comments.
|
metadata dict is now an OrderedDict. telescope_types list and telescope_ids dict are replaced with a single OrderedDict. num_events_by_particle_id and num_images_by_particle_id dicts added. particle_ids list added. image_charge_min and image_charge_max renamed to "..._mins" and "..._maxes" respectively for clarity. particle_id_by_file renamed to "particle_ids...". All other data loading functions refactored to respect changes.
Calls load_metadata_HDF5 to read metadata, then prints to command line or writes to file. Prints additional info about class breakdown of events/images by telescope.
HDF5DataLoader. All cuts and data loading-related settings now moved to constructor and instance attributes. Cuts and other types of data selection are now "permanent" and tied to each instance of DataLoader. Data processing now handled by providing a DataProcessor to a DataLoader. Single tel and array level get_example functions now replaced with a single get_example, the example type of which is set by the DataLoader instance. Arguments to get_example and get_image are no longer specified by the abstract methods (should be determined by each sub-class). Some name changes ("example", "event", "image") for consistency. Some methods marked as private.
DataLoader.get_example(). Skeleton for DataProcessor._augment_data() added. Process example functions for single tel and array data combined into DataProcessor.process_example(), which now infers the example type by type checking.
Updated data_loading.py and data_processing.py to use image_mapper from image_mapping.py. DataLoader and DataProcessor now both take an image_mapper. If DataLoader is instantiated with a DataProcessor instance, its image_mapper is shared by both. If not, then DataLoader uses a default image_mapper instance. Fixed functionality for applying cuts when in single_tel mode. Now correctly checks for example_type, uses the single selected telescope type in self.selected_tel_types, and locates non-blank images from the selected telescopes. Removed image.py, now replaced by functionality in image_mapping.py
…learn into rework-data-input
Changed --predict command line option to a mode setting. Reworked config parsing to use configobj instead of configparser and added corresponding config_spec.ini. Reworked run_model.py to use DataLoader, DataProcessor, and image_mapper
…tation that needs checking
Reorganized ImageMapper methods for clarity and to better match other modules. Refactored references to ImageMapper in DataLoader and DataProcessor.
…learn into rework-data-input
to work correctly. Fixed minor issues with configobj parsing and validation. Fixed issue with incorrect behavior when num_epochs == 0.
single tel type. Data loading now only supports a single telescope type at one time. Added missing metadata fields and logic to combine metadata from DataProcessor. Names of run_id and event_id fixed to run_number and event_number to match PyTables files. Use telescope positions option added to constructor. Generator and map_fn output names, types, and is_label added as hardcoded instance variables. Other minor bugfixes.
examples, as provided by DataLoader. Added instance variables for image shapes and number of auxiliary parameters. Added get_metadata() method to get post-processing metadata parameters.
A question regarding the rotation of the image mapper: |
instantiation. Fixed get_image to correctly use image_mapper.map_image. Moved conversion of images, triggers, and aux_info to numpy arrays out of DataProcessor.process_example and into get_example. Fixed bug with single_tel_examples_to_indices. Changed datatype of triggers from float32 to int8. Removed unnecessary expand_dims to correct image dimensions.
Removed unnecessary expand_dims to fix image shape. Removed some comments.
passed in for use with normalization. Fixed bug with thresholds.
can receive the image_charge_mins from DataLoader.
I think I remember noticing something like this with MSTS a while ago. I don't think there's any particular reason for it other than how I the mapping table function was written (although it might be nice to transpose as you suggest for consistency). My understanding is that the orientation shouldn't affect anything , though, as long as all of the telescopes within a class are consistent. |
I'm going to close this PR for now because all of the major issues it aimed at are now addressed (although there are likely still some bugs and issues to be addressed). We can continue bugfixing and working on other changes (for example to the image mapping) in other PRs. |
This request resolves #14, resolves #15, resolves #16, and resolves #18. This pull request is being opened prematurely to facilitate discussion on proposed changes. Before merging, the requested changes need to be discussed, modified, debugged, and tested, and the rewrite of train.py requested in #17 must be completed.
In this proposed change:
All data input is now handled by an abstract class CTALearnDataset. The underlying implementation is delegated to sub-classes which inherit from CTALearnDataset, such as HDF5Dataset, which is implemented. The CTALearnDataset interface is built on using run_id, event_id, and tel_ids as unique identifiers for events and images. How these are used internally to locate and return data is left up to the implemented data format and the inheriting class (i.e. HDF5Dataset).
All data processing is now handled by a class DataProcessor.
The top-level data loading functions, analogous to load_single_tel_event_HDF5() and load_array_event_HDF5() are implemented as separate non-class functions load_array_event() and load_single_tel_event() and located in data_processing.py as I was unsure of which module to place them in.
Other changes include implementing support for multiple telescope types in an array event, major changes to metadata loading, and refactoring and modifications of other functions.
NOTE: The test_metadata script rework from #27 is still incomplete, in spite of changes to test_metadata.py and print_dataset_metadata.py and must be re-done to align with the above changes.
Key questions regarding these changes include: