Skip to content

OAK-10182: use streams to avoid buffer positioning issues leading to corrupted files#893

Merged
fabriziofortino merged 3 commits intoapache:trunkfrom
fabriziofortino:OAK-10182
Apr 11, 2023
Merged

OAK-10182: use streams to avoid buffer positioning issues leading to corrupted files#893
fabriziofortino merged 3 commits intoapache:trunkfrom
fabriziofortino:OAK-10182

Conversation

@fabriziofortino
Copy link
Copy Markdown
Contributor

#886 introduced a regression when checksum validation is enabled. The buffer mark is wrongly positioned when the destination file gets written leading to corrupted files. This also explains why the performance results with and without checksum validation were similar.

This PR uses streams and byte arrays instead of channels and ByteBuffers to perform write file operations and checksum. When the latter is enabled, the performance degradation is around 15%.

Copy link
Copy Markdown
Contributor

@nfsantos nfsantos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +177 to +178
try (InputStream inputStream = sourceUrl.getInputStream();
FileOutputStream outputStream = new FileOutputStream(destinationPath.toFile())) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using buffered output/input streams here would help in speeding up the download? Probably not. What was the rationale for not using buffered streams?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not tested it. But I have compared this solution with the previous one based on channels and there is no performance degradation when checksum is not enabled.

Comment on lines 182 to +183
if (md != null) {
md.update(buffer);
md.update(buffer, 0, bytesRead);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checksum could in principle be handed over to another thread, but we have to be careful about managing the buffer. One idea is to have 2 buffers and swap them between the download thread and the checksum thread, so that when one buffer is being used for checksum calculations the other one is used for downloading. Just an idea for future improvement.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but this would complicate the logic. An even better option could be to use multiple blocking queues where we put the buffer. We can then have separate downstream consumer threads (eg: one for writing the file, and another to compute the checksum). In this way, we can actually decouple the reads from the writes further boosting performance. Complexity will obviously increase. A reactive library might help here.

@fabriziofortino fabriziofortino merged commit f8aeae7 into apache:trunk Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants