Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of name for container of waveforms #608

Closed
watsonjj opened this issue Dec 13, 2017 · 18 comments
Closed

Refactor of name for container of waveforms #608

watsonjj opened this issue Dec 13, 2017 · 18 comments

Comments

@watsonjj
Copy link
Contributor

I would like to propose a refactor of the name for the container of waveforms. Currently we have:

event.r0.tel[1].adc_samples
event.r1.tel[1].pe_samples
event.dl0.tel[1].pe_samples
event.dl1.tel[1].image

It is potentially misleading to call them "adc_samples" and "pe_samples" as these assume units that might not be the case. Therefore I would propose to rename them to "samples". This would also make the access of waveforms more consistent across data levels.

This would be a huge refactor that would presumably affect every ctapipe user's code. Therefore it should probably be delayed until we intend a huge refactor of the code. This could be the same time we remove the "tel" containers for the R0 -> DL1 containers (I presume this is an intended change once we move to single telescopes per file).

@calispac
Copy link
Contributor

In this case should :

  1. pe_samples.shape == (n_channels, n_pixels, n_samples) ? i.e. some kind of waveform where the max is the reconstructed p.e. charge
  2. pe_samples.shape == (n_channels, n_pixels) ?

In case of 2. I think it is more intuitive to have them named pe_image. Or not ?

@watsonjj
Copy link
Contributor Author

event.r0.tel[1].adc_samples --> event.r0.tel[1].samples
event.r1.tel[1].pe_samples --> event.r1.tel[1].samples
event.dl0.tel[1].pe_samples --> event.dl0.tel[1].samples

All of these are waveforms, so have shape (n_channels, n_pixels, n_samples)

The units depend on the low-level R1 calibration each camera applies on them. I'm not sure if there is a requirement to have the waveforms in p.e. at the dl0 level (they will at least be something proportional to p.e. i presume, such that the online analysis can use them)

@calispac
Copy link
Contributor

calispac commented Dec 13, 2017

Ok I did not get your point first.
It seems more user friendly to have the "main data" being named similarly across data levels. As the data level indicates some transformation of samples.

@kosack
Copy link
Contributor

kosack commented Jan 15, 2018

I'm happy with the suggested names, though there will need to be some small change to support the fact that in the final DL1 we have only samples for a subset and images for the rest (so we need both DL0.image and DL0.samples.

In fact, I'm working on standardizing the definitions for all of CTA. In my current doc I have "image" = a single image for a full camera, and "traces" = the time-evolution of a subset of pixels, but "samples" is perhaps better (any preference or better name?).

@kosack
Copy link
Contributor

kosack commented Jan 15, 2018

In case of 2. I think it is more intuitive to have them named pe_image. Or not ?

Yes, I think if there is no sample dimension, it should be "image" (for example, we may have the intensity image, peak_time_image, etc...)

@watsonjj
Copy link
Contributor Author

watsonjj commented Jan 15, 2018

I'm happy with the suggested names, though there will need to be some small change to support the fact that in the final DL1 we have only samples for a subset and images for the rest (so we need both DL0.image and DL0.samples.

I'm a little confused. I didn't think there is a image at DL0, we are still in samples at that data level, no charge extraction has been performed. whereas at DL1 we no longer have samples and only have image, as charge extraction has been performed.

Correct me if I'm wrong, but it should be changed to:

event.r0.tel[1].samples
event.r1.tel[1].samples
event.dl0.tel[1].samples
event.dl1.tel[1].image

@watsonjj
Copy link
Contributor Author

watsonjj commented Jan 15, 2018

Did you mean to only specify DL0, and were talking about that some pixels (the important ones) will have all the samples, but other pixels will be reduced to single samples as a form of data reduction to save storage space?

If so then yeah your suggestion may be appropriate...but it might be useful to have a proper discussion of how to handle this.

Some immediate question I have:

  • Will the "important" pixels always have the same number of samples?
  • Is there a way when we load the data that is could still be stored in a (npixels, nsamples) array?
  • Is there some special array object that we could define to properly handle the reduced pixels properly? (therefore requiring minimum changes to other code)

@kosack
Copy link
Contributor

kosack commented Jan 15, 2018

Did you mean to only specify DL0, and were talking about that some pixels (the important ones) >will have all the samples, but other pixels will be reduced to single samples as a form of data >reduction to save storage space?

Yes, that's only DL0 - for R0, it's much simpler, just as we have now

Some immediate question I have:

Will the "important" pixels always have the same number of samples?

I think we can assume yes

Is there a way when we load the data that is could still be stored in a (npixels, nsamples) array?
Is there some special array object that we could define to properly handle the reduced pixels >properly? (therefore requiring minimum changes to other code)

I think that is basically a sparse array (which numpy supports), or we could just use a masked array (wasting a bit of memory). We can assume that even if the data are stored in a more sparse way, they can be "unpacked" into this sort of in-memory structure, for convenience. So I think to answer the question, we can just assume it looks like it is now, but some elements may be not filled in, or all zeros.

The way I've defined it so far in the doc for DL0 is a bit different, but can be converted to this way of thinking. It is stored as a K_pix (only for pixels with waveforms) by N_samp array, and there is a third array that maps one to the other: it is basically a mask of which pixels have waveforms. So we could also store it that way, which s more compact. Then you'd have

  • samples[K_pix, N_samp]
  • image[N_pix]
  • sample_mask[N_pix] (Maps K_pix to N_pix)

@kosack
Copy link
Contributor

kosack commented Jan 15, 2018

I should note in that previous comment, K_pix is the number of "important" pixels (so it's a variable quantity per event).

@kosack
Copy link
Contributor

kosack commented Jan 16, 2018

Another option would be dl0.tel[x].waveform for waveform data? That maybe is less jargon than "samples" (though of course the waveform is sampled at regular intervals, techincally the integrated pixel is also a sample, just over a larger time-base).

There doesn't seem to be a good name for a sequence of images in time (other than "movie", but that doesn't sound very scientific). In any case, since we will only have a subset of the image with waveforms.

@watsonjj
Copy link
Contributor Author

I have no preference over samples/traces/waveform. The only input I have is that with ASTRI, its a little bit of a misnomer to describe their data as a waveform or a trace. Whereas it is more accurate to describe it as a sample (because as you stated "the integrated pixel is also a sample, just over a larger time-base")

@kosack
Copy link
Contributor

kosack commented Jan 16, 2018

Ok, I've updated the DL0 interface doc to use "waveform" everywhere where there are sampled data, and "image" where it is integrated. I like that the best so far. In the case of ASTRI, there is only the image, and no waveform data (it's not really a single sample anyhow, since they use the pulse shape and time-over-threshold to predict the integral)

@kosack
Copy link
Contributor

kosack commented Jan 16, 2018

So far (not finalized) the DL0 fields are called:
image

For the pipeline, this is a bit too low-level, since we can unpack the pix_status bitfield (contains the gain channel used and the waveform mask), and create a single Time field, so it would be more simply:

 dl0.tel[x].trig_time # telescope trigger time (base time of waveform)
 dl0.tel[x].trig_type # what kind of trigger (cherenkov, FF flasher, muon, etc)
 dl0.tel[x].image  # integrated value for all pixels
 dl0.tel[x].waveform  # generally a sparse or masked array, since not all pix have values
 dl0.tel[x].channel_info # 0=pixel off, 1=low-gain used, 2=high-gain used, 3=saturated low-gain

@dneise
Copy link
Member

dneise commented Jan 16, 2018

Could you put a link to that document please?

@kosack
Copy link
Contributor

kosack commented Jan 16, 2018

For R1, it's nearly identical, except there is no image field, and waveform is not sparse (there are entries for all pixels).

@kosack
Copy link
Contributor

kosack commented Jan 16, 2018

Could you put a link to that document please?

No, that document is not yet complete or publicly available - I'm working on it now with a few other experts, and there will be a meeting with the project office to validate it in a month or so. The camera groups will need to review and accept it, in particular, before it is official, so it will take some time. I'm just putting this info here, since I think it will be closer to this format than what we have now, and it helps me if we can prototype it as I write the requirements and specs. If we discover something missing, it helps a lot. However, we can't add much more information to DL0.TEL.EVT info without saturating the off-site link.

@maxnoe
Copy link
Member

maxnoe commented Jan 16, 2018

@kosack I know it's personal taste, but i'm not a huge fan of unneeded abbreviations or truncations. Why not trigger_type, trigger_time, pixel_status.

Just out of curiosity: why quarter nano seconds?

@kosack
Copy link
Contributor

kosack commented Jan 30, 2018

@kosack I know it's personal taste, but i'm not a huge fan of unneeded abbreviations or truncations. Why not trigger_type, trigger_time, pixel_status.

Agreed. In fact, this is even explicitly stated in the coding guidelines document (which I wrote, and apparently ignored :-) I'll remove the abbreviations.

Just out of curiosity: why quarter nano seconds?

Mainly so it fits in a 32-bit integer, and also since we don't need better precision than that and 250ps is close to the precision of the underlying time distribution system, so smaller would be useless, and larger would throw away some info.

kosack added a commit to kosack/ctapipe that referenced this issue Feb 1, 2018
kosack added a commit that referenced this issue Feb 5, 2018
* rename containers and fields according to #608

- rename
- change run_id -> obs_id
- adc_sums -> image
- adc_samples and pe_samples -> waveform everywhere
@kosack kosack closed this as completed Feb 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants