Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The BloscCompressor used InputStream.available() to determine if byte… #13

Merged
merged 2 commits into from
Jan 25, 2021

Conversation

electricsam
Copy link
Contributor

The BloscCompressor used InputStream.available() to determine if bytes were available. InputStream.available() is not guaranteed to be accurate. Some implementations rely on the implementation provided by InputStream itself, which always returns 0. An EOFException was thrown in this case. One example of this is software.amazon.awssdk.services.s3.checksums.ChecksumValidatingInputStream. The BlockCompressor was updated to just attempt to read from the stream and see if bytes were returned.

…s were available. InputStream.available() is not guaranteed to be accurate. Some implementations rely on the implementation in InputStream, which always returns 0. An EOFException was thrown in this case. The BlockCompressor was updated to just attempt to read from the stream and see if bytes were returned.
…ng. Some InputStream implementations may mot always return the requested number of bytes from read(). This is the case when reading from an S3 bucket, where the stream is chunked.
@SabineEmbacher
Copy link
Collaborator

Dear Chris Slater

That is a good hint and valuable information. It is only the place where
this problem is solved that I do not like. The same input stream would
also be fed into other compressors (there will be more compressors in
the future) and can cause the same or similar problems there.
That's why I would prefer a general solution for this AWS-specific problem.

The streams are created in the respective implementation of the
com.bc.zarr.storage.Store interface.
See com.bc.zarr.storage.Store.getInputStream(String key) [line 43].

For example
a simple ByteArrayInputStream in com.bc.zarr.storage.InMemoryStore [line 44]
or
a java.nio.file.FileSystem specific stream in com.bc.zarr.storage.FileSystemStore [line 62].

I like the following 3 ideas to solve the problem:

  1. Implementing a custom AwsStore that encapsulates this problem by first
    reading all data for a single compressed zarr chunk from the AwsStream
    and then having the AwsStore return a new ByteArrayInputStrem() that
    can then be passed to any compressor.
  2. The FileSystemStore could be extended so that an InputStreamAdapterFactory
    (e.g. AwsInputStreamAdaperFactory) [see Adapter Pattern] could be given
    to the constructor. This InputStreamAdapter could then encapsulate the
    respective problems caused by the stream implementation of the respective
    data source (AWS, ...).
  3. Give the FileSystemStore constructor a StreamCreationStrategy [see Strategy Patern].
    An AwsStreamCreationStrategy would then replace a StandardStreamCreationStrategy.
    The Aws strategy could then read all data for a single compressed zarr chunk from
    the AWS stream in a first step and return a ByteArrayInputStream at the end to
    solve the problem.

Since I personally prefer ObjectComposition to inheritance, I would prefer idea 3.

For info:
In general, the framework is built in such a way that the ZarrChunks
(not the Blosc Chunks) are read threadsave.

What do you think?

If you want, you can do that, I'm also available to support you, or I can take your pull
request into another branch (not master), and refactor it into a general solution later.

What would you prefer?

With kind regards
Sabine

@electricsam
Copy link
Contributor Author

Hi Sabine,

I ran into another issue when using an S3 based store where calls to read() in the compressor did not return the requested number of bytes. This was caused by the Apache HttpClient InputStream that the AWS InputStreams wrap, returning chunked data. This only happened when the size of the object being retrieved was above some threshold. The documentation for InputStream states that this is valid behavior.

When this happened, the output of the compressor was filled with zeros, which was caused because cbufferSizes() would return zero values on repeated passes though the loop that was in the uncompress method. The header data that was read multiple times in that loop was filled with payload data.

I can take a look at implementing option 3 and see if both problems can be solved with this approach. Even with that, I don't think reading a header in a loop is needed in BloscCompressor.uncompress().

Cheers,

Chris

@SabineEmbacher
Copy link
Collaborator

SabineEmbacher commented Jan 22, 2021

Dear Chris,

you can try the new S3_AWS branch.
Maybe this solves your problems.

Please also take a look into the class
https://github.com/bcdev/jzarr/blob/S3_AWS/docs/examples/java/S3Array_nio.java
check the method "readFromS3Bucket(Path bucketPath)" where you can see an example strategy implementation, which maybe solves the problems.

Let me know, if this works as expected, so I will merge the branch to the master branch.

Cheers,
Sabine

@SabineEmbacher
Copy link
Collaborator

Dear Chris,

if you like, you can also use the email address from my profile for communication that is not tied to a specific single issue.
Profile: https://github.com/SabineEmbacher

Or maybe even better, so that others can read our conversations, in the discussion platform associated with this project.
https://github.com/bcdev/jzarr/discussions

Have a nice weekend!
Cheers,
Sabine

@SabineEmbacher SabineEmbacher changed the base branch from master to blosc_AWS_S3_improvement_Chris_Slater January 25, 2021 16:27
@SabineEmbacher SabineEmbacher merged commit 67e899a into bcdev:blosc_AWS_S3_improvement_Chris_Slater Jan 25, 2021
@electricsam electricsam deleted the eof-exception-blosc branch May 28, 2021 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants