Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize header metadata. #10

Merged
merged 24 commits into from
Jun 28, 2021
Merged

Standardize header metadata. #10

merged 24 commits into from
Jun 28, 2021

Conversation

DinoBektesevic
Copy link
Member

This is a first attempt at taking the various varied FITS headers and standardizing a select number of header keywords that we can use in our header metadata DB table.

The header keywords we standardize on here will set our table schema and our query tool interface but there are some outstanding issues with standardizing the WCSs.
If we can not easily separate WCS data into a standardized set of columns querying that table will be much harder but it should still be possible to save a subset of keywords (center pixel values and coordinates) and the whole WCS as a pickled blob.

This would not be optimal, as optimal as storing all of these values separately, I suspect. Ergo, this draft PR to give some insight into what the difficulties are.

trail/upload/models.py Outdated Show resolved Hide resolved
trail/upload/models.py Outdated Show resolved Hide resolved
@mrawls
Copy link
Member

mrawls commented Apr 21, 2021

Metadata still missing: band or filter, exposure duration (grab independently instead of doing end - start if feasible), some kind of processed/reduced flag (probably either "yes" or "unknown")

@mrawls mrawls linked an issue Apr 30, 2021 that may be closed by this pull request
@DinoBektesevic
Copy link
Member Author

Metadata still missing: band or filter, exposure duration (grab independently instead of doing end - start if feasible), some kind of processed/reduced flag (probably either "yes" or "unknown")

Added filter and exposure time for astro_metadata_translator recognized instruments.

Added test dataset to get an overview of where functionality is right now. Coverage right now is about 50/50 of the dataset you compiled. Images in the dataset have been cropped for size but the WCS values have been shifted to maintain positional accuracy (approximately).

I started gnawing at getting more instruments supported today but nothing that I would like to commit quite yet. I think our discussion was on-point and that whether we want or no we will have to create a HeaderStandardizer and a ImageStandardizer classes that do the atomic work on a single header and image and then see if FitsProcessor can be made to support both single and multi-extension fits files. If not a bit of a redesign of the FitsProcessor into a SingleExtensionFitsProcessor and a MultiExtensionFitsProcessor might be needed so I don't want to promote this into a full PR quite yet.

@DinoBektesevic DinoBektesevic force-pushed the astrometadata branch 2 times, most recently from 9d73c83 to 7ee1de1 Compare May 3, 2021 07:15
@DinoBektesevic
Copy link
Member Author

Ok, this is now getting much better. I think someone may take a look at what is in this code and not feel offended.

Added header standardizer classes, essentially maps between what we want in our DB and what we can find in particular instrument's header

Processors are essentially recipes of how particular file should be processed. For example, a compressed archive needs to be uncompressed, "unarchived" then each file needs to be processed; but is this particular file Rubin calexp - then the image is only in 1st HDU - or is it a DECam NOAO image - then all HDUs have headers and data - is it an SDSS....

To do:

  • rework the DB schema to be able to accept focal planes.
    • Different pipelines produce different types of FITS files. Some store all CCDs from a focal plane in the same FITS, some split them up. The fake DB model right now accepts everything as a single CCD. We either unravel multiext fits into individual CCDs on a generic basis or we implement a better schema.
    • DB schema could also use a lot of work in terms of recording uploading times, ips, etc...
  • Fix the astro_metadata_translator issues with Header Fix for Astropy Header not behaving dict-like. lsst/astro_metadata_translator#53
  • Fix science validity of the parsed headers.
    • As-is we are not parsing MOA-II fits file WCS data correctly and it's unclear how to. Perhaps this is a good test case for Astrometry.NET integration. Other headers insert data into the DB, but timezones and other details could be an issue.
    • Fix timezone-naive Time.iso output.
      Format for the time-zone aware datetime objects seem to be a bit different and Astropy does not seem to return timezone encoded strings.
    • Standardize database schema values, as is instruments, telescopes, filters etc all kind of return something different. For certain things we can guarantee consistency (f.e. what is in telescope, what in instrument etc. for others, i.e. filter or science-program, we can do our best to be as close to something standard but really can not be consistent). Perhaps make these fields something like "identifier" and then adopt the instruments conventional naming scheme (f.e. run-calcol-filter-field for SDSS or run-field-filter-chip for MOA-II etc.)
    • This is literally an easy free-for-all problem, it just takes emailing people in cases like MOA-II or, for bigger instruments, reading documentation. Las Cumbres standardizer even has a comment to paper with the lookup table.
  • Figure out what I want to do with the dataclasses
    • they are very helpful when debugging etc and it is nice when you don't have to traverse highly nested dicts, but their function would ideally be handled by the model classes. Unfortunately the schema is not very reflective of processing steps and will complain for mandatory values for example.
    • no tests for now, until we decide we like or don't like them
  • Add a register method to processors and standardizers?
    • gets rid of the import funkery that has to happen right now.
  • Add a (manual, or automatic) priority levels to processors/standardizers
    • or remember procs./stand. that volunteered and try to reprocess on fail....
  • better naming than process_upload.upload_processor.process etc...
    • help appreciated, I am not imaginative
  • Consider getting rid of from* and get* methods on these classes and make instantiation do that bit of work too (makes it prettier?)
  • Perhaps a mixin class to standardize all these similar method names and reduce amount of code duplication.
  • flakify
  • flesh out the tests
    • as is we essentially just run the entire pipeline end-to-end so we excercise the code, but we should really make it more atomic
  • fix how gallery works before this is merged.

trail/upload/models.py Outdated Show resolved Hide resolved
Copy link
Member

@mrawls mrawls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a marathon of a review! Impressive work, let me know if any of my comments are unclear. I trust you to merge when you think you've addressed them.

trail/upload/models.py Outdated Show resolved Hide resolved
trail/upload/models.py Outdated Show resolved Hide resolved
trail/upload/models.py Outdated Show resolved Hide resolved
trail/upload/models.py Outdated Show resolved Hide resolved
trail/upload/process_uploads/processors/decam_processor.py Outdated Show resolved Hide resolved
trail/upload/process_uploads/standardized_dataclasses.py Outdated Show resolved Hide resolved
Attempt to use astro_metadata_translator and Astropy WCS to
standardize the FITS metadata we want to push to our DB.
Separate image center pixel, image corner pixel and
radius (between the two) into separate values and store those.

TODO: store the actual WCS as a blob data and move on to
astro metadata translator.
Fix the Django metadata model.
Prettify code. Allow for future expansion of processing, potentially,
compressed, fits archives.

Create small and large thumbnails for gallery.

Add some extra practical functionality to TemporaryUploadedFile.

archives
Fixes for PR review comments.
Add a minimal test dataset.
Add tests.
Fix for naive timezone DB error.
Fix for table migration error.
Remove unneccessary (I think) migration directories from apps that
don't do any database interaction.
Add Processors - classes that understand the layout of differet
    FITS files and how they should be processed.
Add Standardizers- classes that understand the different types of
    HDUs and are extract our database model keys from them.
Add dataclasses - wider adoption unclear but they do make life
    significantly easier when it comes to comparing and validating
    various collections of standardized keys.
Fix the rest of the upload app to interface the new functionality.
Add tests.
Add docstring.
Fix the ugly test data yaml file.
Replace the DECam test data with newer one that will more accurately
represent DECam data.
Fix a slight dataclass issue once we changed the format of the
test dataset yaml file.
Move SingleExtensionFits and MultiExtensionFits to processors.
Move primary header stuff to FitsProcessor as common pattern.
Fix minor problems with priority values.
Move FitsProcessor to UploadProcessor level.
Get rid of the long and short output format.

Simplify dataclasses.
Fix tests, fix comment.
Storing multi-ext FITS files is complicated because it's
not trivial to unravel them (leaving potentially many images
without trails) and it's not easy to store only the relevant
data. As was discussed, we store the whole file, and live
with a little bit of non-user-friendliness.

Fix the poor naming scheme of UploadProcessor attributes,
what was `upload` is `fileWrapper`, what is `upload` now
is the UploadInfo model.

Add UploadInfo model, which can be used to track what IP
uploaded which data and when.

Reset migration history.

Refactor the view and intiial processing code, punting
much of the functionality into the UploadProcessor.

Replace metadata_translator_name column by standardizer
and processor name columns. Add special naming convention
for astro_metadata_translator classes. Default the rest
to their class names.

Fix tests.
Merge standardized dataclasses functionality with database models.

Add thumbnails class in preparation for gallery upgrade to
querying the database for images.
Make processed thumbnail images grayscale instead of viridis.

Minor refactoring of processors and standardizers. Got rid of
unused methods, transitioned from returning dictionaries to
returning model classes.

Fix tests.
@DinoBektesevic
Copy link
Member Author

Ok, fixed the last comments and added tests for Models.

The remainder of issues here were punted other issues in which we will tackle them. Mainly all is well except: we don't have good error handling, data update/replace/remove logic and MOA-II WCS is a placeholder pending error handling, or getting Astrometry.net solver working.

Looking for any last comments?

@mrawls
Copy link
Member

mrawls commented Jun 28, 2021

Understood. Let's get it merged so we can find fun new ways to break things 😄

@DinoBektesevic DinoBektesevic merged commit 444ec9c into main Jun 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Header metadata and DB schema definitions
2 participants