Usefulness of data_range #55

FynnBe · 2020-11-30T08:13:05Z

I found myself wondering what the actual usefulness of the data_range config field is, after commenting in https://github.com/bioimage-io/configuration/pull/54/files#r532405610

dtype_range: My current understanding is that it is intended to capture the theoretical data limits, e.g. (-inf, inf) for raw data in float32. But simply writing the capacity of the given data_type does not seem all that useful.
displaying range: Is it indented for displaying purposes, if, let's say I have normalized data of float32 as input, thus would specify [0,1]?
mini-batch: Mini-batch (minimum,maximum) are random due to the undetermined length of the mini-batch
global: global range might not actually required, thus computing it unnecessarily, while it would not be sufficient for e.g. percentile normalization, thus we need a different 'statistics input' anyway, which we have solved through arguments to preprocessing transformations.

What am I missing? Or should we get rid of it in 0.3.1?

The text was updated successfully, but these errors were encountered:

esgomezm · 2020-11-30T08:49:40Z

It's true that it might not be that useful. This is how I understood/see it:

It's the range of values that the input image should have when entering the model (after preprocessing). Even if it is of type float32 and the network computations can be done with values in the range [-inf, inf], the correct/expected value range might be [0, 1] or [-1, 1].

mini-batch: Mini-batch (minimum,maximum) are random due to the undetermined length of the mini-batch

I wouldn't refer to the batch information by calling it data_range. It sounds quite confusing to me.

global: global range might not actually required, thus computing it unnecessarily, while it would not be sufficient for e.g. percentile normalization, thus we need a different 'statistics input' anyway, which we have solved through arguments to preprocessing transformations.

It might be better to specify it when required (in the preprocessing or any other kind of transformation), IMO. The main reason is that when I read the input specification, I expect to have technical information about a "single" input patch/batch/image of the model that makes the inference possible. However, when talking about global parameters of the data I get confused because I'm not sure whether it refers to a whole set of images (from one experiment), or to a single image, or to the batch. If this parameter goes together with the preprocessing that uses it, I think it will be easier to understand its meaning.

constantinpape · 2020-11-30T08:56:47Z

It's the range of values that the input image should have when entering the model (after preprocessing). Even if it is of type float32 and the network computations can be done with values in the range [-inf, inf], the correct/expected value range might be [0, 1] or [-1, 1].

I think that's pretty much what we intended for the data_range. Should we just put this in the description :).

I also agree that having any of the more global options here is rather confusing.

FynnBe · 2020-11-30T09:28:17Z

It's the range of values that the input image should have when entering the model (after preprocessing)

Interesting. I see inputs/outputs as the API description. imho the inputs should describe the data before preprocessing (as we know how the preprocessing steps change it and any consumer software would have to provide the data (as it is before preprocessing) to the runner.
Analog for outputs my impression is that outputs describe data after postprocessing.

Otherwise inputs/outputs would not truly describe the inputs/outputs of the whole bioimage.io model.

I wouldn't refer to the batch information by calling it data_range. It sounds quite confusing to me.

Sorry for being imprecise; that is not what I meant. with data_range of mini-batch, I meant the minimum/maximum of a given mini-batch, irrespective of the length of the mini-batch. However with b>1 for an independent sample in the mini-batch these values become meaningless.

I also agree that having any of the more global options here is rather confusing.

👍

esgomezm · 2020-11-30T09:31:21Z

Interesting. I see inputs/outputs as the API description. imho the inputs should describe the data before preprocessing (as we know how the preprocessing steps change it and any consumer software would have to provide the data (as it is before preprocessing) to the runner.
Analog for outputs my impression is that outputs describe data after postprocessing.

Otherwise inputs/outputs would not truly describe the inputs/outputs of the whole bioimage.io model.

Good point, but then what about 'halo' and 'offset'? Those values refer to the raw output of the model rather than to the post-processed output, no?

FynnBe · 2020-11-30T09:38:52Z

Good point, but then what about 'halo' and 'offset'? Those values refer to the raw output of the model rather than to the post-processed output, no?

I'd say they refer to the postprocessed output. As a consumer software I don't care if the halo was cropped due to a valid convolution in the actual neural network or cropped away in a postprocessing step. Either way I get an output that has a certain shape relative to the reference input. (as you wrote in your example: output_shape = input_shape * scale - 2* halo). offset merely tells the consumer software any shift between input/output. for scale=1 this would typically offset=halo, to have the output in the center of the input.

esgomezm · 2020-11-30T14:57:14Z

For the data type and in general, the information about the output. I think there's some compromise:

output it's supposed to be a tensor, then, not necessarily the output after the post-processing. Doing so, allows the consumer to have control over how to implement the post-processing and especially, how to manage the processing of the input image as you control all the technicalities that relate the input and the output.
In the case that we refer to the output of the entire bioimage.io model, then I wouldn't call it tensor and it could be really whatever output of the postprocessing (i.e. labels, tracks, csv file with the morfology of the segmented cells or whatever), especially if each consumer may have some custom transformations. While now the postprocessing steps are limited, this would open the door to specify more the output of workflow rather than the one of the model.

For the halo and offset, I think then, issue #22 is not really closed. Shall we keep discussing it there?

constantinpape · 2020-12-11T08:35:22Z

This came up in the last bioimage.io call and we didn't resolve it yet. To summarise, we have two different interpretations of data_type and data_range (or the other fields in the tensor description):

They describe the input to preprocessing. The consumer software has to ensure that the input is converted to the data-range and data-type (e.g. converting converting the input to uint8 for data_type: uint8 and data_range: [0, 255]).
They describe the input to the model. In that case, the preprocessing function needs to ensure that data is converted to the correct type and range.

FynnBe · 2020-12-11T09:32:14Z

my problem with option 2.:
The consumer software needs to "backtrack" what to input to the preprocessing, which is problematic for many transformation.

FynnBe · 2020-12-11T09:35:37Z

In the case that we refer to the output of the entire bioimage.io model, then I wouldn't call it tensor and it could be really whatever output of the postprocessing (i.e. labels, tracks, csv file with the morfology of the segmented cells or whatever), especially if each consumer may have some custom transformations. While now the postprocessing steps are limited, this would open the door to specify more the output of workflow rather than the one of the model.

a note on file inputs/outpus: I would prefer if we have in memory inputs/outputs only. writing tabular data to a csv, an image to a specific file format etc, should be left to the consumer software. The examples you mention here can all be represented by a tensor.

oeway · 2020-12-11T10:40:34Z

I am more thinking about option 3: They describe the input to the preprocessing, the software need to taken care of the input before preprocessing.

There are also ambiguity for data_type, is it for the preprocessing or for the model? I guess the vast majority models are using float32. And it will be also reasonable to assume the inputs/outputs of all the preprocessing models are float32.

If that's the case then I would treat data_type and data_range for describing the input to preprocessing.

Why do we even need the data_rang when we have data_type? Because, for example, if the data_type is set to float32 and the default data range will be 1.175494351 E-38 to 3.402823466 E+38, obviously this is not something we should use. In most cases, we will need to define the data_range explicitly, e.g. [0, 1].
When feeding the images from the consumer software, in whatever format, it need to convert into the corresponding data_range. If it's 8-bit then divide 255, if it's 16-bit then divide by 65535.

FynnBe · 2020-12-11T13:35:21Z

[option 1:] They describe the input to preprocessing. The consumer software has to ensure that the input is converted to the data-range and data-type (e.g. converting converting the input to uint8 for data_type: uint8 and data_range: [0, 255]).

I am more thinking about option 3: They describe the input to the preprocessing, the software need to taken care of the input before preprocessing.

what is the difference between option 1 and 3?

oeway · 2020-12-11T14:11:19Z

Sorry, I miss read, so option 1 and 3 are the same.

constantinpape · 2020-12-19T10:11:24Z

Fixed in #59.

FynnBe assigned oeway and frauzufall Nov 30, 2020

FynnBe added this to To Do in RDF Releases via automation Nov 30, 2020

FynnBe assigned constantinpape, m-novikov and esgomezm Nov 30, 2020

FynnBe moved this from To Do to In Progress 0.3.1 in RDF Releases Nov 30, 2020

constantinpape mentioned this issue Nov 30, 2020

Extend description #54

Closed

oeway mentioned this issue Dec 11, 2020

BioImage.IO Meeting Minutes bioimage-io/bioimage.io#28

Open

constantinpape closed this as completed Dec 19, 2020

RDF Releases automation moved this from In Progress 0.3.1 to Done 0.3.0 Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usefulness of data_range #55

Usefulness of data_range #55

FynnBe commented Nov 30, 2020

esgomezm commented Nov 30, 2020

constantinpape commented Nov 30, 2020

FynnBe commented Nov 30, 2020 •

edited

Loading

esgomezm commented Nov 30, 2020

FynnBe commented Nov 30, 2020

esgomezm commented Nov 30, 2020

constantinpape commented Dec 11, 2020

FynnBe commented Dec 11, 2020

FynnBe commented Dec 11, 2020

oeway commented Dec 11, 2020

FynnBe commented Dec 11, 2020 •

edited

Loading

oeway commented Dec 11, 2020

constantinpape commented Dec 19, 2020

Usefulness of data_range #55

Usefulness of data_range #55

Comments

FynnBe commented Nov 30, 2020

esgomezm commented Nov 30, 2020

constantinpape commented Nov 30, 2020

FynnBe commented Nov 30, 2020 • edited Loading

esgomezm commented Nov 30, 2020

FynnBe commented Nov 30, 2020

esgomezm commented Nov 30, 2020

constantinpape commented Dec 11, 2020

FynnBe commented Dec 11, 2020

FynnBe commented Dec 11, 2020

oeway commented Dec 11, 2020

FynnBe commented Dec 11, 2020 • edited Loading

oeway commented Dec 11, 2020

constantinpape commented Dec 19, 2020

FynnBe commented Nov 30, 2020 •

edited

Loading

FynnBe commented Dec 11, 2020 •

edited

Loading