Standardize header metadata. #10

DinoBektesevic · 2021-04-07T00:33:36Z

This is a first attempt at taking the various varied FITS headers and standardizing a select number of header keywords that we can use in our header metadata DB table.

The header keywords we standardize on here will set our table schema and our query tool interface but there are some outstanding issues with standardizing the WCSs.
If we can not easily separate WCS data into a standardized set of columns querying that table will be much harder but it should still be possible to save a subset of keywords (center pixel values and coordinates) and the whole WCS as a pickled blob.

This would not be optimal, as optimal as storing all of these values separately, I suspect. Ergo, this draft PR to give some insight into what the difficulties are.

trail/upload/models.py

trail/upload/process_uploads/processors.py

mrawls · 2021-04-21T18:58:51Z

Metadata still missing: band or filter, exposure duration (grab independently instead of doing end - start if feasible), some kind of processed/reduced flag (probably either "yes" or "unknown")

DinoBektesevic · 2021-05-03T04:31:20Z

Metadata still missing: band or filter, exposure duration (grab independently instead of doing end - start if feasible), some kind of processed/reduced flag (probably either "yes" or "unknown")

Added filter and exposure time for astro_metadata_translator recognized instruments.

Added test dataset to get an overview of where functionality is right now. Coverage right now is about 50/50 of the dataset you compiled. Images in the dataset have been cropped for size but the WCS values have been shifted to maintain positional accuracy (approximately).

I started gnawing at getting more instruments supported today but nothing that I would like to commit quite yet. I think our discussion was on-point and that whether we want or no we will have to create a HeaderStandardizer and a ImageStandardizer classes that do the atomic work on a single header and image and then see if FitsProcessor can be made to support both single and multi-extension fits files. If not a bit of a redesign of the FitsProcessor into a SingleExtensionFitsProcessor and a MultiExtensionFitsProcessor might be needed so I don't want to promote this into a full PR quite yet.

DinoBektesevic · 2021-05-07T23:25:32Z

Ok, this is now getting much better. I think someone may take a look at what is in this code and not feel offended.

Added header standardizer classes, essentially maps between what we want in our DB and what we can find in particular instrument's header

Processors are essentially recipes of how particular file should be processed. For example, a compressed archive needs to be uncompressed, "unarchived" then each file needs to be processed; but is this particular file Rubin calexp - then the image is only in 1st HDU - or is it a DECam NOAO image - then all HDUs have headers and data - is it an SDSS....

To do:

rework the DB schema to be able to accept focal planes.
- Different pipelines produce different types of FITS files. Some store all CCDs from a focal plane in the same FITS, some split them up. The fake DB model right now accepts everything as a single CCD. We either unravel multiext fits into individual CCDs on a generic basis or we implement a better schema.
- DB schema could also use a lot of work in terms of recording uploading times, ips, etc...
Fix the astro_metadata_translator issues with Header Fix for Astropy Header not behaving dict-like. lsst/astro_metadata_translator#53
Fix science validity of the parsed headers.
- As-is we are not parsing MOA-II fits file WCS data correctly and it's unclear how to. Perhaps this is a good test case for Astrometry.NET integration. Other headers insert data into the DB, but timezones and other details could be an issue.
- Fix timezone-naive Time.iso output.
  Format for the time-zone aware datetime objects seem to be a bit different and Astropy does not seem to return timezone encoded strings.
- Standardize database schema values, as is instruments, telescopes, filters etc all kind of return something different. For certain things we can guarantee consistency (f.e. what is in telescope, what in instrument etc. for others, i.e. filter or science-program, we can do our best to be as close to something standard but really can not be consistent). Perhaps make these fields something like "identifier" and then adopt the instruments conventional naming scheme (f.e. run-calcol-filter-field for SDSS or run-field-filter-chip for MOA-II etc.)
- This is literally an easy free-for-all problem, it just takes emailing people in cases like MOA-II or, for bigger instruments, reading documentation. Las Cumbres standardizer even has a comment to paper with the lookup table.
Figure out what I want to do with the dataclasses
- they are very helpful when debugging etc and it is nice when you don't have to traverse highly nested dicts, but their function would ideally be handled by the model classes. Unfortunately the schema is not very reflective of processing steps and will complain for mandatory values for example.
- no tests for now, until we decide we like or don't like them
Add a register method to processors and standardizers?
- gets rid of the import funkery that has to happen right now.
Add a (manual, or automatic) priority levels to processors/standardizers
- or remember procs./stand. that volunteered and try to reprocess on fail....
better naming than process_upload.upload_processor.process etc...
- help appreciated, I am not imaginative
Consider getting rid of from* and get* methods on these classes and make instantiation do that bit of work too (makes it prettier?)
Perhaps a mixin class to standardize all these similar method names and reduce amount of code duplication.
flakify
flesh out the tests
- as is we essentially just run the entire pipeline end-to-end so we excercise the code, but we should really make it more atomic
fix how gallery works before this is merged.

trail/upload/process_uploads/processors/decam_processor.py

trail/upload/models.py

mrawls

That was a marathon of a review! Impressive work, let me know if any of my comments are unclear. I trust you to merge when you think you've addressed them.

trail/upload/models.py

trail/upload/process_uploads/processors/decam_processor.py

trail/upload/process_uploads/header_standardizer.py

trail/upload/process_uploads/processors/decam_processor.py

trail/upload/process_uploads/processors/multi_extension_fits.py

trail/upload/process_uploads/processors/single_extension_fits.py

trail/upload/process_uploads/standardized_dataclasses.py

Attempt to use astro_metadata_translator and Astropy WCS to standardize the FITS metadata we want to push to our DB.

Separate image center pixel, image corner pixel and radius (between the two) into separate values and store those. TODO: store the actual WCS as a blob data and move on to astro metadata translator.

Fix the Django metadata model.

Prettify code. Allow for future expansion of processing, potentially, compressed, fits archives. Create small and large thumbnails for gallery. Add some extra practical functionality to TemporaryUploadedFile. archives

Fixes for PR review comments. Add a minimal test dataset. Add tests. Fix for naive timezone DB error. Fix for table migration error. Remove unneccessary (I think) migration directories from apps that don't do any database interaction.

Add Processors - classes that understand the layout of differet FITS files and how they should be processed. Add Standardizers- classes that understand the different types of HDUs and are extract our database model keys from them. Add dataclasses - wider adoption unclear but they do make life significantly easier when it comes to comparing and validating various collections of standardized keys. Fix the rest of the upload app to interface the new functionality. Add tests. Add docstring.

Fix the ugly test data yaml file. Replace the DECam test data with newer one that will more accurately represent DECam data. Fix a slight dataclass issue once we changed the format of the test dataset yaml file.

Move SingleExtensionFits and MultiExtensionFits to processors. Move primary header stuff to FitsProcessor as common pattern. Fix minor problems with priority values. Move FitsProcessor to UploadProcessor level.

Get rid of the long and short output format. Simplify dataclasses. Fix tests, fix comment.

Storing multi-ext FITS files is complicated because it's not trivial to unravel them (leaving potentially many images without trails) and it's not easy to store only the relevant data. As was discussed, we store the whole file, and live with a little bit of non-user-friendliness. Fix the poor naming scheme of UploadProcessor attributes, what was `upload` is `fileWrapper`, what is `upload` now is the UploadInfo model. Add UploadInfo model, which can be used to track what IP uploaded which data and when. Reset migration history. Refactor the view and intiial processing code, punting much of the functionality into the UploadProcessor. Replace metadata_translator_name column by standardizer and processor name columns. Add special naming convention for astro_metadata_translator classes. Default the rest to their class names. Fix tests.

Merge standardized dataclasses functionality with database models. Add thumbnails class in preparation for gallery upgrade to querying the database for images. Make processed thumbnail images grayscale instead of viridis. Minor refactoring of processors and standardizers. Got rid of unused methods, transitioned from returning dictionaries to returning model classes. Fix tests.

DinoBektesevic · 2021-06-26T02:27:43Z

Ok, fixed the last comments and added tests for Models.

The remainder of issues here were punted other issues in which we will tackle them. Mainly all is well except: we don't have good error handling, data update/replace/remove logic and MOA-II WCS is a placeholder pending error handling, or getting Astrometry.net solver working.

Looking for any last comments?

mrawls · 2021-06-28T19:20:31Z

Understood. Let's get it merged so we can find fun new ways to break things 😄

DinoBektesevic mentioned this pull request Apr 7, 2021

Header metadata and DB schema definitions #7

Closed

DinoBektesevic force-pushed the astrometadata branch 2 times, most recently from 0486e10 to a4e72c8 Compare April 9, 2021 21:49

mrawls reviewed Apr 21, 2021

View reviewed changes

trail/upload/models.py Outdated Show resolved Hide resolved

mrawls reviewed Apr 21, 2021

View reviewed changes

trail/upload/models.py Outdated Show resolved Hide resolved

mrawls reviewed Apr 21, 2021

View reviewed changes

trail/upload/process_uploads/processors.py Outdated Show resolved Hide resolved

mrawls linked an issue Apr 30, 2021 that may be closed by this pull request

Header metadata and DB schema definitions #7

Closed

DinoBektesevic force-pushed the astrometadata branch 2 times, most recently from 9d73c83 to 7ee1de1 Compare May 3, 2021 07:15

mrawls reviewed May 25, 2021

View reviewed changes

trail/upload/process_uploads/processors/decam_processor.py Show resolved Hide resolved

mrawls reviewed May 25, 2021

View reviewed changes

trail/upload/models.py Outdated Show resolved Hide resolved

DinoBektesevic force-pushed the astrometadata branch from 196da71 to 3a02a54 Compare June 18, 2021 03:25

DinoBektesevic marked this pull request as ready for review June 18, 2021 03:34

DinoBektesevic force-pushed the astrometadata branch from 3a02a54 to 92781e0 Compare June 18, 2021 23:42

mrawls reviewed Jun 19, 2021

View reviewed changes

DinoBektesevic added 12 commits June 23, 2021 14:36

Standardize header metadata.

74b21d9

Attempt to use astro_metadata_translator and Astropy WCS to standardize the FITS metadata we want to push to our DB.

Save our own reference pixel values instead.

7317ffd

Separate image center pixel, image corner pixel and radius (between the two) into separate values and store those. TODO: store the actual WCS as a blob data and move on to astro metadata translator.

Ingest more metadata from header.

4810e9d

Fix the Django metadata model.

OOP processing scripts.

e4f3a92

Prettify code. Allow for future expansion of processing, potentially, compressed, fits archives. Create small and large thumbnails for gallery. Add some extra practical functionality to TemporaryUploadedFile. archives

Split code into files and add some documentation (and notes).

20f3cdc

Remove accidentally commited file.

f10e596

Review fixes, add tests, fix bugs.

d5d6cf1

Fixes for PR review comments. Add a minimal test dataset. Add tests. Fix for naive timezone DB error. Fix for table migration error. Remove unneccessary (I think) migration directories from apps that don't do any database interaction.

Rename modules and cleanup.

e4fd360

Add full focal plane thumbnails for DECam Community Pipelines products.

a6e2608

Add docstrings, rename methods to be easier to understand.

33db423

Add prioritization to standardizers and processors.

f5705ed

DinoBektesevic added 10 commits June 23, 2021 14:36

Add tests for correct matching with priorities.

480f2c3

Fix the ugly test data yaml file. Replace the DECam test data with newer one that will more accurately represent DECam data. Fix a slight dataclass issue once we changed the format of the test dataset yaml file.

Reorganization and minor fixes.

0269dd3

Move SingleExtensionFits and MultiExtensionFits to processors. Move primary header stuff to FitsProcessor as common pattern. Fix minor problems with priority values. Move FitsProcessor to UploadProcessor level.

Move .standardizeHeader to FitsProcessor.

f9ad072

Get rid of the long and short output format. Simplify dataclasses. Fix tests, fix comment.

Method storeHeader is not an abstract method.

b32b3a3

Flakeify

db49a61

Fix order of CCD identifiers in DECAMFocalPlane processor.

fe1e66b

Merge standardized dataclasses functionality into models.

5b3977f

Rebase on top of main

70fb3d6

DinoBektesevic force-pushed the astrometadata branch from 92781e0 to 70fb3d6 Compare June 23, 2021 21:41

DinoBektesevic mentioned this pull request Jun 24, 2021

Add type annotation. #17

Open

Add tests for models and dataclass.

c2a2445

DinoBektesevic mentioned this pull request Jun 26, 2021

Use the Astrometry.net as the WCS solver when WCS information is not present in the header. #18

Closed

Fix last of the review comments.

bf7c78a

DinoBektesevic merged commit 444ec9c into main Jun 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize header metadata. #10

Standardize header metadata. #10

DinoBektesevic commented Apr 7, 2021

mrawls commented Apr 21, 2021

DinoBektesevic commented May 3, 2021

DinoBektesevic commented May 7, 2021

mrawls left a comment •

edited

DinoBektesevic commented Jun 26, 2021

mrawls commented Jun 28, 2021

Standardize header metadata. #10

Standardize header metadata. #10

Conversation

DinoBektesevic commented Apr 7, 2021

mrawls commented Apr 21, 2021

DinoBektesevic commented May 3, 2021

DinoBektesevic commented May 7, 2021

mrawls left a comment • edited

Choose a reason for hiding this comment

DinoBektesevic commented Jun 26, 2021

mrawls commented Jun 28, 2021

mrawls left a comment •

edited