The BloscCompressor used InputStream.available() to determine if byte… #13

electricsam · 2021-01-21T18:04:59Z

The BloscCompressor used InputStream.available() to determine if bytes were available. InputStream.available() is not guaranteed to be accurate. Some implementations rely on the implementation provided by InputStream itself, which always returns 0. An EOFException was thrown in this case. One example of this is software.amazon.awssdk.services.s3.checksums.ChecksumValidatingInputStream. The BlockCompressor was updated to just attempt to read from the stream and see if bytes were returned.

…s were available. InputStream.available() is not guaranteed to be accurate. Some implementations rely on the implementation in InputStream, which always returns 0. An EOFException was thrown in this case. The BlockCompressor was updated to just attempt to read from the stream and see if bytes were returned.

…ng. Some InputStream implementations may mot always return the requested number of bytes from read(). This is the case when reading from an S3 bucket, where the stream is chunked.

SabineEmbacher · 2021-01-22T08:32:34Z

Dear Chris Slater

That is a good hint and valuable information. It is only the place where
this problem is solved that I do not like. The same input stream would
also be fed into other compressors (there will be more compressors in
the future) and can cause the same or similar problems there.
That's why I would prefer a general solution for this AWS-specific problem.

The streams are created in the respective implementation of the
com.bc.zarr.storage.Store interface.
See com.bc.zarr.storage.Store.getInputStream(String key) [line 43].

For example
a simple ByteArrayInputStream in com.bc.zarr.storage.InMemoryStore [line 44]
or
a java.nio.file.FileSystem specific stream in com.bc.zarr.storage.FileSystemStore [line 62].

I like the following 3 ideas to solve the problem:

Implementing a custom AwsStore that encapsulates this problem by first
reading all data for a single compressed zarr chunk from the AwsStream
and then having the AwsStore return a new ByteArrayInputStrem() that
can then be passed to any compressor.
The FileSystemStore could be extended so that an InputStreamAdapterFactory
(e.g. AwsInputStreamAdaperFactory) [see Adapter Pattern] could be given
to the constructor. This InputStreamAdapter could then encapsulate the
respective problems caused by the stream implementation of the respective
data source (AWS, ...).
Give the FileSystemStore constructor a StreamCreationStrategy [see Strategy Patern].
An AwsStreamCreationStrategy would then replace a StandardStreamCreationStrategy.
The Aws strategy could then read all data for a single compressed zarr chunk from
the AWS stream in a first step and return a ByteArrayInputStream at the end to
solve the problem.

Since I personally prefer ObjectComposition to inheritance, I would prefer idea 3.

For info:
In general, the framework is built in such a way that the ZarrChunks
(not the Blosc Chunks) are read threadsave.

What do you think?

If you want, you can do that, I'm also available to support you, or I can take your pull
request into another branch (not master), and refactor it into a general solution later.

What would you prefer?

With kind regards
Sabine

electricsam · 2021-01-22T16:45:12Z

Hi Sabine,

I ran into another issue when using an S3 based store where calls to read() in the compressor did not return the requested number of bytes. This was caused by the Apache HttpClient InputStream that the AWS InputStreams wrap, returning chunked data. This only happened when the size of the object being retrieved was above some threshold. The documentation for InputStream states that this is valid behavior.

When this happened, the output of the compressor was filled with zeros, which was caused because cbufferSizes() would return zero values on repeated passes though the loop that was in the uncompress method. The header data that was read multiple times in that loop was filled with payload data.

I can take a look at implementing option 3 and see if both problems can be solved with this approach. Even with that, I don't think reading a header in a loop is needed in BloscCompressor.uncompress().

Cheers,

Chris

SabineEmbacher · 2021-01-22T18:06:31Z

Dear Chris,

you can try the new S3_AWS branch.
Maybe this solves your problems.

Please also take a look into the class
https://github.com/bcdev/jzarr/blob/S3_AWS/docs/examples/java/S3Array_nio.java
check the method "readFromS3Bucket(Path bucketPath)" where you can see an example strategy implementation, which maybe solves the problems.

Let me know, if this works as expected, so I will merge the branch to the master branch.

Cheers,
Sabine

SabineEmbacher · 2021-01-22T18:27:33Z

Dear Chris,

if you like, you can also use the email address from my profile for communication that is not tied to a specific single issue.
Profile: https://github.com/SabineEmbacher

Or maybe even better, so that others can read our conversations, in the discussion platform associated with this project.
https://github.com/bcdev/jzarr/discussions

Have a nice weekend!
Cheers,
Sabine

electricsam added 2 commits January 21, 2021 11:00

Blosc compressor now checks how many bytes were read when decompressi…

c954da0

…ng. Some InputStream implementations may mot always return the requested number of bytes from read(). This is the case when reading from an S3 bucket, where the stream is chunked.

SabineEmbacher changed the base branch from master to blosc_AWS_S3_improvement_Chris_Slater January 25, 2021 16:27

SabineEmbacher merged commit 67e899a into bcdev:blosc_AWS_S3_improvement_Chris_Slater Jan 25, 2021

electricsam deleted the eof-exception-blosc branch May 28, 2021 12:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The BloscCompressor used InputStream.available() to determine if byte… #13

The BloscCompressor used InputStream.available() to determine if byte… #13

electricsam commented Jan 21, 2021

SabineEmbacher commented Jan 22, 2021

electricsam commented Jan 22, 2021

SabineEmbacher commented Jan 22, 2021 •

edited

Loading

SabineEmbacher commented Jan 22, 2021

The BloscCompressor used InputStream.available() to determine if byte… #13

The BloscCompressor used InputStream.available() to determine if byte… #13

Conversation

electricsam commented Jan 21, 2021

SabineEmbacher commented Jan 22, 2021

electricsam commented Jan 22, 2021

SabineEmbacher commented Jan 22, 2021 • edited Loading

SabineEmbacher commented Jan 22, 2021

SabineEmbacher commented Jan 22, 2021 •

edited

Loading