Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usefulness of data_range #55

Closed
FynnBe opened this issue Nov 30, 2020 · 13 comments
Closed

Usefulness of data_range #55

FynnBe opened this issue Nov 30, 2020 · 13 comments
Assignees

Comments

@FynnBe
Copy link
Member

FynnBe commented Nov 30, 2020

I found myself wondering what the actual usefulness of the data_range config field is, after commenting in https://github.com/bioimage-io/configuration/pull/54/files#r532405610

  • dtype_range: My current understanding is that it is intended to capture the theoretical data limits, e.g. (-inf, inf) for raw data in float32. But simply writing the capacity of the given data_type does not seem all that useful.
  • displaying range: Is it indented for displaying purposes, if, let's say I have normalized data of float32 as input, thus would specify [0,1]?
  • mini-batch: Mini-batch (minimum,maximum) are random due to the undetermined length of the mini-batch
  • global: global range might not actually required, thus computing it unnecessarily, while it would not be sufficient for e.g. percentile normalization, thus we need a different 'statistics input' anyway, which we have solved through arguments to preprocessing transformations.

What am I missing? Or should we get rid of it in 0.3.1?

@FynnBe FynnBe added this to To Do in RDF Releases via automation Nov 30, 2020
@FynnBe FynnBe moved this from To Do to In Progress 0.3.1 in RDF Releases Nov 30, 2020
@esgomezm
Copy link
Contributor

It's true that it might not be that useful. This is how I understood/see it:

It's the range of values that the input image should have when entering the model (after preprocessing). Even if it is of type float32 and the network computations can be done with values in the range [-inf, inf], the correct/expected value range might be [0, 1] or [-1, 1].

mini-batch: Mini-batch (minimum,maximum) are random due to the undetermined length of the mini-batch

I wouldn't refer to the batch information by calling it data_range. It sounds quite confusing to me.

global: global range might not actually required, thus computing it unnecessarily, while it would not be sufficient for e.g. percentile normalization, thus we need a different 'statistics input' anyway, which we have solved through arguments to preprocessing transformations.

It might be better to specify it when required (in the preprocessing or any other kind of transformation), IMO. The main reason is that when I read the input specification, I expect to have technical information about a "single" input patch/batch/image of the model that makes the inference possible. However, when talking about global parameters of the data I get confused because I'm not sure whether it refers to a whole set of images (from one experiment), or to a single image, or to the batch. If this parameter goes together with the preprocessing that uses it, I think it will be easier to understand its meaning.

@constantinpape
Copy link
Collaborator

It's the range of values that the input image should have when entering the model (after preprocessing). Even if it is of type float32 and the network computations can be done with values in the range [-inf, inf], the correct/expected value range might be [0, 1] or [-1, 1].

I think that's pretty much what we intended for the data_range. Should we just put this in the description :).

I also agree that having any of the more global options here is rather confusing.

@FynnBe
Copy link
Member Author

FynnBe commented Nov 30, 2020

It's the range of values that the input image should have when entering the model (after preprocessing)

Interesting. I see inputs/outputs as the API description. imho the inputs should describe the data before preprocessing (as we know how the preprocessing steps change it and any consumer software would have to provide the data (as it is before preprocessing) to the runner.
Analog for outputs my impression is that outputs describe data after postprocessing.

Otherwise inputs/outputs would not truly describe the inputs/outputs of the whole bioimage.io model.

I wouldn't refer to the batch information by calling it data_range. It sounds quite confusing to me.

Sorry for being imprecise; that is not what I meant. with data_range of mini-batch, I meant the minimum/maximum of a given mini-batch, irrespective of the length of the mini-batch. However with b>1 for an independent sample in the mini-batch these values become meaningless.

I also agree that having any of the more global options here is rather confusing.

👍

@esgomezm
Copy link
Contributor

Interesting. I see inputs/outputs as the API description. imho the inputs should describe the data before preprocessing (as we know how the preprocessing steps change it and any consumer software would have to provide the data (as it is before preprocessing) to the runner.
Analog for outputs my impression is that outputs describe data after postprocessing.

Otherwise inputs/outputs would not truly describe the inputs/outputs of the whole bioimage.io model.

Good point, but then what about 'halo' and 'offset'? Those values refer to the raw output of the model rather than to the post-processed output, no?

@FynnBe
Copy link
Member Author

FynnBe commented Nov 30, 2020

Good point, but then what about 'halo' and 'offset'? Those values refer to the raw output of the model rather than to the post-processed output, no?

I'd say they refer to the postprocessed output. As a consumer software I don't care if the halo was cropped due to a valid convolution in the actual neural network or cropped away in a postprocessing step. Either way I get an output that has a certain shape relative to the reference input. (as you wrote in your example: output_shape = input_shape * scale - 2* halo). offset merely tells the consumer software any shift between input/output. for scale=1 this would typically offset=halo, to have the output in the center of the input.

@esgomezm
Copy link
Contributor

For the data type and in general, the information about the output. I think there's some compromise:

  • output it's supposed to be a tensor, then, not necessarily the output after the post-processing. Doing so, allows the consumer to have control over how to implement the post-processing and especially, how to manage the processing of the input image as you control all the technicalities that relate the input and the output.

  • In the case that we refer to the output of the entire bioimage.io model, then I wouldn't call it tensor and it could be really whatever output of the postprocessing (i.e. labels, tracks, csv file with the morfology of the segmented cells or whatever), especially if each consumer may have some custom transformations. While now the postprocessing steps are limited, this would open the door to specify more the output of workflow rather than the one of the model.

For the halo and offset, I think then, issue #22 is not really closed. Shall we keep discussing it there?

@constantinpape
Copy link
Collaborator

This came up in the last bioimage.io call and we didn't resolve it yet. To summarise, we have two different interpretations of data_type and data_range (or the other fields in the tensor description):

  1. They describe the input to preprocessing. The consumer software has to ensure that the input is converted to the data-range and data-type (e.g. converting converting the input to uint8 for data_type: uint8 and data_range: [0, 255]).
  2. They describe the input to the model. In that case, the preprocessing function needs to ensure that data is converted to the correct type and range.

@FynnBe
Copy link
Member Author

FynnBe commented Dec 11, 2020

my problem with option 2.:
The consumer software needs to "backtrack" what to input to the preprocessing, which is problematic for many transformation.

@FynnBe
Copy link
Member Author

FynnBe commented Dec 11, 2020

  • In the case that we refer to the output of the entire bioimage.io model, then I wouldn't call it tensor and it could be really whatever output of the postprocessing (i.e. labels, tracks, csv file with the morfology of the segmented cells or whatever), especially if each consumer may have some custom transformations. While now the postprocessing steps are limited, this would open the door to specify more the output of workflow rather than the one of the model.

a note on file inputs/outpus: I would prefer if we have in memory inputs/outputs only. writing tabular data to a csv, an image to a specific file format etc, should be left to the consumer software. The examples you mention here can all be represented by a tensor.

@oeway
Copy link
Contributor

oeway commented Dec 11, 2020

I am more thinking about option 3: They describe the input to the preprocessing, the software need to taken care of the input before preprocessing.

There are also ambiguity for data_type, is it for the preprocessing or for the model? I guess the vast majority models are using float32. And it will be also reasonable to assume the inputs/outputs of all the preprocessing models are float32.

If that's the case then I would treat data_type and data_range for describing the input to preprocessing.

Why do we even need the data_rang when we have data_type? Because, for example, if the data_type is set to float32 and the default data range will be 1.175494351 E-38 to 3.402823466 E+38, obviously this is not something we should use. In most cases, we will need to define the data_range explicitly, e.g. [0, 1].
When feeding the images from the consumer software, in whatever format, it need to convert into the corresponding data_range. If it's 8-bit then divide 255, if it's 16-bit then divide by 65535.

@FynnBe
Copy link
Member Author

FynnBe commented Dec 11, 2020

[option 1:] They describe the input to preprocessing. The consumer software has to ensure that the input is converted to the data-range and data-type (e.g. converting converting the input to uint8 for data_type: uint8 and data_range: [0, 255]).

I am more thinking about option 3: They describe the input to the preprocessing, the software need to taken care of the input before preprocessing.

what is the difference between option 1 and 3?

@oeway
Copy link
Contributor

oeway commented Dec 11, 2020

Sorry, I miss read, so option 1 and 3 are the same.

@constantinpape
Copy link
Collaborator

Fixed in #59.

RDF Releases automation moved this from In Progress 0.3.1 to Done 0.3.0 Dec 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
RDF Releases
Model RDF 0.3.0
Development

No branches or pull requests

6 participants