BigTIFF Support/Directory Parsing #132

ilan-gold · 2020-03-12T17:47:02Z

Hello, I have a use case of large BigTIFF files (hundreds of image pages) and I am having trouble parsing them as things stand. I believe the issue centers around how 64 bit integers are handled for this type https://github.com/geotiffjs/geotiff.js/blob/master/src/dataslice.js#L97. Currently, the operation for combining the two 32 bit integers is

let left;
let right;
if (this._littleEndian) {
  left = this.readInt32(offset);
  right = this.readUint32(offset + 4);

  return (left << 32) | right;
}
left = this.readUint32(offset - this._sliceOffset);
right = this.readInt32(offset - this._sliceOffset + 4);
return (right << 32) | left;

but on the Mozilla page, one of two is recommended:

const left =  dataview.getUint32(byteOffset, littleEndian);
const right = dataview.getUint32(byteOffset+4, littleEndian);

// combine the two 32-bit values
const combined = littleEndian? left + 2**32*right : 2**32*left + right;
if (!Number.isSafeInteger(combined))
    console.warn(combined, 'exceeds MAX_SAFE_INTEGER. Precision may be lost');
left + 2**32*right

or

const left = BigInt(dataview.getUint32(byteOffset|0, !!littleEndian)>>>0);
const right = BigInt(dataview.getUint32((byteOffset|0) + 4|0, !!littleEndian)>>>0);

// combine the two 32-bit values and return
return littleEndian ? (right<<bigThirtyTwo)|left : (left<<bigThirtyTwo)|right;

The first approach is not guaranteed to be precise past 2 ** 53, though, hence the warning. I have done some testing and not had problems with this approach but I could see things going wrong (although 2 ** 53 is a lot, so not sure we'd actually ever run into that if all this method is for is byte offsets; 2 ** 53 bytes is nine million gigabytes, I think).

The behavior I was seeing originally was similar to #63 and #68 where I would get negative numbers. I see in #68 that you added a BigInt support but I am not sure that is necessary (I also tried it using the above approach and did not have success, and tried your approach but it nearly crashed my browser).

Additionally, the current state of parseFileDirectories seems problematic for me because it attempts to parse all the file directories, which is unnecessary in our use case. Could we add a flag to stop that? Our use case only needs the first image pane to get some metadata and can infer what is in the rest. If you think there is a good reason for parsing everything every time, I'd love to discuss - I'm new to the tiff ecosystem so could certainly be missing a crucial use-case.

Thanks! This package is quite the workhorse, great stuff!!!! :) :)

The text was updated successfully, but these errors were encountered:

ilan-gold · 2020-03-12T17:52:00Z

P.S I am happy to make PR's for either or both of these :)

constantinius · 2020-03-12T19:24:23Z

Ha! Aweseome!

To explain: when I wrote both 64bit support and the automatic parsing of all file directories I always knew that the approach was not ideal, but there simply was no way to test the feasibility. So I'm grateful that the issue has finally come up and that people are actually using it on really big TIFFs.

Regarding the 64bit integer parsing: I did not see the Mozilla page you mentioned, I tried to figure it out myself and obviously failed. I would love to see a PR, if you are willing to do so. Which route to take is up to you. I'm not sure what the performance of BigInt is, but I guess it is negligible as it should only be used in header data and not actual raster data.

Regarding the parsing of all file directories: this may be a little more tricky. The issue is the internal TIFF structure (i.e: essentially a linked list of IFDs) and my laziness when I started to work on the lib. I don't think that it is too hard to do, it is basically just a deferred parsing of the IFDs up to the point what page was requested, and continue later on from there. One drawback of that is that some operations that seem lightweight (like getting the number of pages) can take an unexpected amount of work.

If you want, you can prepare a PR for that aswell, but I would advise to keep it separate from the other one.

Thanks for raising these issues and for the interest in geotiff.js!

ilan-gold · 2020-03-12T19:32:31Z

Yes, no problem! That sounds like a good plan regarding the separate PR's.

I will probably not use BigInt as my testing did reveal some performance problems, at least given the current parsing set up (perhaps implementing better parsing would allow for us to use it). Mozilla mentions the performance though: "BigInts will always be much slower than 32-bit integers in JavaScript due to the nature of their variable size." I think we should be safe with just using numbers but I'll try to look more deeply into it.

As for the IFD's, I had not thought of that. I will look into this problem as well, but I have a less clear idea of the solution space.

Looking forward to resolving this!

constantinius · 2020-03-12T20:38:53Z

You can have a look at it. If you don't want the hassle just tell me here and I'll try to fix it myself.

Thanks for your contribution!

ilan-gold · 2020-03-15T00:01:28Z

I have put some thought into grabbing the nth image and I have not come up with anything particularly special. My idea then would be to allow users to pass in a byteOffset parameter to getImage so that if they want, they can extract that information in any data pipelines they are building or infer it using other knowledge. For example here is a log of offsets and differences for an image pyramid - when the difference halves, that is when we have changed from one set of images at one resolution to another. Furthermore these are uncompressed so this information is highly deterministic without knowing much about the nature of the data:

offset  30417379632 geotiff.bundle.min.js:1:72876
difference:  4194632 geotiff.bundle.min.js:1:73738
offset  30421574264 geotiff.bundle.min.js:1:72876
difference:  4194632 geotiff.bundle.min.js:1:73738
offset  30425768896 geotiff.bundle.min.js:1:72876
difference:  2097448 geotiff.bundle.min.js:1:73738
offset  30427866344 geotiff.bundle.min.js:1:72876
difference:  2097448

For example 2097448 = ((1024 ** 2) * 2) + 296 which is the tile size (1024) squared times 2 (each pixel is two bytes) for the single-tile image plus 296 bytes of metadata/pad (I assume - I don't know too much about this and the number appears to vary at different pyramid levels, but will look into it more), at the lowest resolution. 4194632 bytes represents the next lowest resolution, where there are only two tiles per image.

Does this sound like a direction that makes sense? Perhaps a class inheritance would be better, where you override a function? I could look into this as well but I don't know much about the codebase so would be concerned about overriding something that is important.

constantinius · 2020-03-16T10:11:50Z

Unfortunately, that assumption would not work for all cases. The IFDs would not necessarily be in ascending order. Currently, there is no other way than to traverse the linked list to reliably get to the image metadata/data.

ilan-gold · 2020-03-16T12:44:03Z

Could you clarify a bit as to why this means a byteOffset parameter or the like for getImage would not work? If it were correct, would it not allow me to parse the IFD for that image, thus extracting the fileDirectoryAndGeoKey for that image? Perhaps I should just try it out and see what happens. Thank you for discussing, I appreciate your taking the time to help me hash this out.

ilan-gold · 2020-03-16T18:32:42Z

I poked around a bit allowing for a byteOffset parameter and seemed to work well. I can find sample data/write unit tests and then make a PR so you can review and approve/deny. If you don't want this extra parameter, would it be possible to separate out a parseFileDirectory function that takes in a offset as a parameter from the current parseFileDirectories so that one might be able to create images using that? In this manner, there would be no "breaking API change" from adding a new parameter, just adding a new function, parseFileDirectory , to the GeoTIFF class. Thanks!

constantinius · 2020-03-17T09:27:46Z

I'll add a PR today that will address the directory parsing. I'll do it in a fashion, that one can parse IFDs from arbitrary offsets in the file (at own risk ;-) ).

Does this suit your requirements?

ilan-gold · 2020-03-17T15:27:34Z

Yes, that would be perfect!

ilan-gold · 2020-03-25T18:38:49Z

Hi @constantinius, thanks for getting the PR started. How is it going? I would be willing to help out on it if you are busy with other things. Let me know!

constantinius · 2020-04-08T21:14:35Z

@ilan-gold Is this issue now fixed?

ilan-gold · 2020-04-08T22:24:37Z

#136 looks like the solution, yes!

ilan-gold mentioned this issue Mar 13, 2020

64bit int uint #134

Merged

constantinius closed this as completed in #134 Mar 18, 2020

constantinius reopened this Mar 18, 2020

constantinius mentioned this issue Mar 19, 2020

Improvements to IFD parsing: Not all IFDs are parsed alltogether unless necessary #136

Merged

ilan-gold mentioned this issue Mar 26, 2020

Add TIFF Metadata for Offsets hubmapconsortium/portal-containers#19

Merged

ilan-gold mentioned this issue Apr 15, 2020

Fixed DataView 64 bit parsing #142

Merged

constantinius closed this as completed Apr 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigTIFF Support/Directory Parsing #132

BigTIFF Support/Directory Parsing #132

ilan-gold commented Mar 12, 2020

ilan-gold commented Mar 12, 2020

constantinius commented Mar 12, 2020

ilan-gold commented Mar 12, 2020

constantinius commented Mar 12, 2020

ilan-gold commented Mar 15, 2020 •

edited

constantinius commented Mar 16, 2020

ilan-gold commented Mar 16, 2020

ilan-gold commented Mar 16, 2020 •

edited

constantinius commented Mar 17, 2020

ilan-gold commented Mar 17, 2020

ilan-gold commented Mar 25, 2020

constantinius commented Apr 8, 2020

ilan-gold commented Apr 8, 2020

BigTIFF Support/Directory Parsing #132

BigTIFF Support/Directory Parsing #132

Comments

ilan-gold commented Mar 12, 2020

ilan-gold commented Mar 12, 2020

constantinius commented Mar 12, 2020

ilan-gold commented Mar 12, 2020

constantinius commented Mar 12, 2020

ilan-gold commented Mar 15, 2020 • edited

constantinius commented Mar 16, 2020

ilan-gold commented Mar 16, 2020

ilan-gold commented Mar 16, 2020 • edited

constantinius commented Mar 17, 2020

ilan-gold commented Mar 17, 2020

ilan-gold commented Mar 25, 2020

constantinius commented Apr 8, 2020

ilan-gold commented Apr 8, 2020

ilan-gold commented Mar 15, 2020 •

edited

ilan-gold commented Mar 16, 2020 •

edited