Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigTIFF Support/Directory Parsing #132

Closed
ilan-gold opened this issue Mar 12, 2020 · 13 comments · Fixed by #134
Closed

BigTIFF Support/Directory Parsing #132

ilan-gold opened this issue Mar 12, 2020 · 13 comments · Fixed by #134

Comments

@ilan-gold
Copy link
Contributor

Hello, I have a use case of large BigTIFF files (hundreds of image pages) and I am having trouble parsing them as things stand. I believe the issue centers around how 64 bit integers are handled for this type https://github.com/geotiffjs/geotiff.js/blob/master/src/dataslice.js#L97. Currently, the operation for combining the two 32 bit integers is

let left;
let right;
if (this._littleEndian) {
  left = this.readInt32(offset);
  right = this.readUint32(offset + 4);

  return (left << 32) | right;
}
left = this.readUint32(offset - this._sliceOffset);
right = this.readInt32(offset - this._sliceOffset + 4);
return (right << 32) | left;

but on the Mozilla page, one of two is recommended:

const left =  dataview.getUint32(byteOffset, littleEndian);
const right = dataview.getUint32(byteOffset+4, littleEndian);

// combine the two 32-bit values
const combined = littleEndian? left + 2**32*right : 2**32*left + right;
if (!Number.isSafeInteger(combined))
    console.warn(combined, 'exceeds MAX_SAFE_INTEGER. Precision may be lost');
left + 2**32*right

or

const left = BigInt(dataview.getUint32(byteOffset|0, !!littleEndian)>>>0);
const right = BigInt(dataview.getUint32((byteOffset|0) + 4|0, !!littleEndian)>>>0);

// combine the two 32-bit values and return
return littleEndian ? (right<<bigThirtyTwo)|left : (left<<bigThirtyTwo)|right;

The first approach is not guaranteed to be precise past 2 ** 53, though, hence the warning. I have done some testing and not had problems with this approach but I could see things going wrong (although 2 ** 53 is a lot, so not sure we'd actually ever run into that if all this method is for is byte offsets; 2 ** 53 bytes is nine million gigabytes, I think).

The behavior I was seeing originally was similar to #63 and #68 where I would get negative numbers. I see in #68 that you added a BigInt support but I am not sure that is necessary (I also tried it using the above approach and did not have success, and tried your approach but it nearly crashed my browser).

Additionally, the current state of parseFileDirectories seems problematic for me because it attempts to parse all the file directories, which is unnecessary in our use case. Could we add a flag to stop that? Our use case only needs the first image pane to get some metadata and can infer what is in the rest. If you think there is a good reason for parsing everything every time, I'd love to discuss - I'm new to the tiff ecosystem so could certainly be missing a crucial use-case.

Thanks! This package is quite the workhorse, great stuff!!!! :) :)

@ilan-gold
Copy link
Contributor Author

P.S I am happy to make PR's for either or both of these :)

@constantinius
Copy link
Member

Ha! Aweseome!

To explain: when I wrote both 64bit support and the automatic parsing of all file directories I always knew that the approach was not ideal, but there simply was no way to test the feasibility. So I'm grateful that the issue has finally come up and that people are actually using it on really big TIFFs.

Regarding the 64bit integer parsing: I did not see the Mozilla page you mentioned, I tried to figure it out myself and obviously failed. I would love to see a PR, if you are willing to do so. Which route to take is up to you. I'm not sure what the performance of BigInt is, but I guess it is negligible as it should only be used in header data and not actual raster data.

Regarding the parsing of all file directories: this may be a little more tricky. The issue is the internal TIFF structure (i.e: essentially a linked list of IFDs) and my laziness when I started to work on the lib. I don't think that it is too hard to do, it is basically just a deferred parsing of the IFDs up to the point what page was requested, and continue later on from there. One drawback of that is that some operations that seem lightweight (like getting the number of pages) can take an unexpected amount of work.

If you want, you can prepare a PR for that aswell, but I would advise to keep it separate from the other one.

Thanks for raising these issues and for the interest in geotiff.js!

@ilan-gold
Copy link
Contributor Author

Yes, no problem! That sounds like a good plan regarding the separate PR's.

I will probably not use BigInt as my testing did reveal some performance problems, at least given the current parsing set up (perhaps implementing better parsing would allow for us to use it). Mozilla mentions the performance though: "BigInts will always be much slower than 32-bit integers in JavaScript due to the nature of their variable size." I think we should be safe with just using numbers but I'll try to look more deeply into it.

As for the IFD's, I had not thought of that. I will look into this problem as well, but I have a less clear idea of the solution space.

Looking forward to resolving this!

@constantinius
Copy link
Member

You can have a look at it. If you don't want the hassle just tell me here and I'll try to fix it myself.

Thanks for your contribution!

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Mar 15, 2020

I have put some thought into grabbing the nth image and I have not come up with anything particularly special. My idea then would be to allow users to pass in a byteOffset parameter to getImage so that if they want, they can extract that information in any data pipelines they are building or infer it using other knowledge. For example here is a log of offsets and differences for an image pyramid - when the difference halves, that is when we have changed from one set of images at one resolution to another. Furthermore these are uncompressed so this information is highly deterministic without knowing much about the nature of the data:

offset  30417379632 geotiff.bundle.min.js:1:72876
difference:  4194632 geotiff.bundle.min.js:1:73738
offset  30421574264 geotiff.bundle.min.js:1:72876
difference:  4194632 geotiff.bundle.min.js:1:73738
offset  30425768896 geotiff.bundle.min.js:1:72876
difference:  2097448 geotiff.bundle.min.js:1:73738
offset  30427866344 geotiff.bundle.min.js:1:72876
difference:  2097448

For example 2097448 = ((1024 ** 2) * 2) + 296 which is the tile size (1024) squared times 2 (each pixel is two bytes) for the single-tile image plus 296 bytes of metadata/pad (I assume - I don't know too much about this and the number appears to vary at different pyramid levels, but will look into it more), at the lowest resolution. 4194632 bytes represents the next lowest resolution, where there are only two tiles per image.

Does this sound like a direction that makes sense? Perhaps a class inheritance would be better, where you override a function? I could look into this as well but I don't know much about the codebase so would be concerned about overriding something that is important.

@constantinius
Copy link
Member

Unfortunately, that assumption would not work for all cases. The IFDs would not necessarily be in ascending order. Currently, there is no other way than to traverse the linked list to reliably get to the image metadata/data.

@ilan-gold
Copy link
Contributor Author

Could you clarify a bit as to why this means a byteOffset parameter or the like for getImage would not work? If it were correct, would it not allow me to parse the IFD for that image, thus extracting the fileDirectoryAndGeoKey for that image? Perhaps I should just try it out and see what happens. Thank you for discussing, I appreciate your taking the time to help me hash this out.

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Mar 16, 2020

I poked around a bit allowing for a byteOffset parameter and seemed to work well. I can find sample data/write unit tests and then make a PR so you can review and approve/deny. If you don't want this extra parameter, would it be possible to separate out a parseFileDirectory function that takes in a offset as a parameter from the current parseFileDirectories so that one might be able to create images using that? In this manner, there would be no "breaking API change" from adding a new parameter, just adding a new function, parseFileDirectory , to the GeoTIFF class. Thanks!

@constantinius
Copy link
Member

I'll add a PR today that will address the directory parsing. I'll do it in a fashion, that one can parse IFDs from arbitrary offsets in the file (at own risk ;-) ).

Does this suit your requirements?

@ilan-gold
Copy link
Contributor Author

Yes, that would be perfect!

@ilan-gold
Copy link
Contributor Author

Hi @constantinius, thanks for getting the PR started. How is it going? I would be willing to help out on it if you are busy with other things. Let me know!

@constantinius
Copy link
Member

@ilan-gold Is this issue now fixed?

@ilan-gold
Copy link
Contributor Author

#136 looks like the solution, yes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants