Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Selector class for keeping track of quality cuts #1207

Merged
merged 18 commits into from
Mar 30, 2020

Conversation

kosack
Copy link
Contributor

@kosack kosack commented Dec 16, 2019

This replaces the older utils.CutFlow with a new Component class that can be configured from a config file. It is a class that can apply a series of quality criteria to an object, and keep track of totals (e.g. for cuts that happen per-event, before data are written to an output table) - it doesn't reject events, rather it just keeps track of which criteria pass. The totals can also be written to an output file, as an astropy.Table.

It's used like so:

criteria = [
    ('criteria_1',  'lambda x: (x>2.0) and (x < 100)') ,
    ('another_check', 'lambda x: x < 3'), 
]
select = QualityQuery(quality_criteria=criteria) # can also be configured from a config file

# assuming the input is always the same type, we just call select(input) for each item:
for ii in range(100):
    result = query(ii) # result is an array of bools for each criterion
    if all(result == True):
         # do stuff

print(select.to_table()) # shows the total results as a table, with both individual and cumulative product sums.

Table output (from the unit test)

In [11]: query.to_table(functions=True)
Out[11]:
<Table length=4>
      criteria       counts cumulative_counts        func
       str20         int64        int64             str17
-------------------- ------ ----------------- -----------------
               TOTAL      4                 4    lambda x: True
         high_enough      3                 3   lambda x: x > 3
a_value_not_too_high      3                 2 lambda x: x < 100
            smallish      2                 1  lambda x: x < 10

You can also for example store the results in an output table with HDF5TableWriter, as a column of flags, for each event.
Note for example, you could pass it a Container or something if you wanted and have criteria like:

[
    ('good_min_width': 'lambda hillas: hillas.width > 0.2'),
    ('good_min_length': 'lambda hillas: hillas.length >1.0)'
]

An example can be found in #1163 , where it's used to select good images, configured as follows in a JSON file:.

  "ImageQualityQuery": {
        "quality_criteria": [
            ["enough_pixels", "lambda im: np.count_nonzero(im) > 2"],
            ["enough_charge", "lambda im: im.sum() > 100"]
        ]
    }

Caveats:

  • The name 'Selector' may be confusing since there already exists a GainSelector that is a different thing altogether (suggestions for a better name, or keep this one? I had originally called it 'Checker') → Renamed to QualityQuery
  • Currently the critieria are stored as a dict, but there is a strong assumption on order (for the cumulative product), so perhaps this is not a good idea if the data will be written to e.g. a YAML file in the future, where dicts don't preserve order (in pyyaml at least). Perhaps a list of tuples would be better? or a list of dicts?

@codecov
Copy link

codecov bot commented Dec 16, 2019

Codecov Report

Merging #1207 into master will increase coverage by 0.39%.
The diff coverage is 96.90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1207      +/-   ##
==========================================
+ Coverage   86.24%   86.64%   +0.39%     
==========================================
  Files         185      187       +2     
  Lines       11600    11759     +159     
==========================================
+ Hits        10004    10188     +184     
+ Misses       1596     1571      -25     
Impacted Files Coverage Δ
ctapipe/core/tests/test_qualityquery.py 95.65% <95.65%> (ø)
ctapipe/core/qualityquery.py 97.72% <97.72%> (ø)
ctapipe/core/__init__.py 100.00% <100.00%> (ø)
ctapipe/utils/CutFlow.py 95.08% <100.00%> (+0.16%) ⬆️
ctapipe/tools/display_dl1.py 86.27% <0.00%> (-3.83%) ⬇️
ctapipe/core/traits.py 94.52% <0.00%> (-3.14%) ⬇️
ctapipe/visualization/mpl_array.py 93.80% <0.00%> (-2.28%) ⬇️
ctapipe/visualization/mpl_camera.py 80.00% <0.00%> (-0.45%) ⬇️
ctapipe/image/muon/muon_diagnostic_plots.py 11.18% <0.00%> (-0.08%) ⬇️
ctapipe/io/simteleventsource.py 97.72% <0.00%> (-0.06%) ⬇️
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3171e67...52706dc. Read the comment docs.

ctapipe/core/selector.py Outdated Show resolved Hide resolved
@kosack
Copy link
Contributor Author

kosack commented Mar 12, 2020

to mitigate security issue of using eval here, see #1163 (comment)

@maxnoe
Copy link
Member

maxnoe commented Mar 27, 2020

@kosack the cutflow tests fail now

ctapipe/core/selector.py Outdated Show resolved Hide resolved
ensure only np.X and u.X functions and builtins can be used in the eval() of each selection function (Statements like import are already excluded by eval()).
since dicts lose their order when converted to/from JSON, YAML, etc, we have to use a list instead, since order matters
@kosack
Copy link
Contributor Author

kosack commented Mar 27, 2020

I've now made the security changes requested, and also switched to using a list of tuples of (criteria_description, selection_function) for configuration.

One further simplification could be to remove the criteria_description, so you just have the function string, and no more detail. Any preference whether or not we keep the descriptive name or not? The name was just there to make it easy for somebody to understand why the cut was there, but for most cuts I guess the function string would be simple enough.

maxnoe
maxnoe previously approved these changes Mar 27, 2020
Copy link
Member

@maxnoe maxnoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the description, I would leave it in

@watsonjj
Copy link
Contributor

The name 'Selector' may be confusing since there already exists a GainSelector that is a different thing altogether (suggestions for a better name, or keep this one? I had originally called it 'Checker')

This reminds me of pandas.DataFrame.query. So how about the class being called Query? It does not seem like this class is doing any selecting itself...it is instead just querying the data.

@kosack
Copy link
Contributor Author

kosack commented Mar 27, 2020

This reminds me of pandas.DataFrame.query. So how about the class being called Query? It does not seem like this class is doing any selecting itself...it is instead just querying the data.

That's not a bad idea - yes, it does not drop anything, that's still up to the user, it just returns an array of which criteria passed and keeps a running total. Originally I think I called it DataChecker, but didn't really like that term (but maybe it was ok). Query may be too generic a term since there will be other queries, potentially like for databases. Perhaps QualityQuery (so you'd have things like ImageQualityQuery in the config file)? The purpose is to evaluate a set of criteria on the same object (event, images, etc).

It's only useful for event-wise analysis, of course, anything operating on all events at once like a DataFrame would just use frame.query() as you suggest. It's also useful for storing some quality info in the case the images are not recorded (so you can still check why for example there are no hillas parameters for some events, even if you don't have the original image).

In a standard analysis there would be one of these for:

  • checking if images in an event should be parameterized (just to avoid errors, or to reduce DL1 data size, but perhaps it's not a good idea to make such cuts and store all events?)
  • checking if telescopes in an event should be used in reconstruction
  • checking if an event should be reconstructed at all

If we switch to "chunked" analysis, where we write out DL1 tables and reconstruct them whole, then this is better done another way. So it's just a stop-gap until then and could eventually be dropped.

@kosack kosack merged commit c7832e3 into cta-observatory:master Mar 30, 2020
@kosack kosack deleted the feature/selector branch October 2, 2020 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants