Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Oct 16, 2018

Implement CompressedInputStream and CompressedOutputStream C++ classes. Tested with gzip, brotli and zstd codecs.

I initially intended to expose the functionality in Python, but NativeFile expects a RandomAccessFile in read mode (rather than a mere InputStream).

@wesm
Copy link
Member

wesm commented Oct 18, 2018

Will review this today

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, but this looks great!

#include <memory>
#include <mutex>
#include <string>
#include <utility>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


class CompressedOutputStream::Impl {
public:
Impl(MemoryPool* pool, Codec* codec, std::shared_ptr<OutputStream> raw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do const std::shared_ptr<T>& to possibly avoid extra copy

int64_t bytes_read, bytes_written;
int64_t input_len = nbytes;
int64_t output_len = compressed_->size() - compressed_pos_;
uint8_t* output = compressed_->mutable_data() + compressed_pos_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there might be a use for an intermediate abstraction "CompressionBuffer" that encapsulates some of this book-keeping that shows up in many places (unit tests and implementations). This could be passed into the stream compressor functions instead of raw pointers, allowing the compressor to request that the buffer be enlarged, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how it would look like, though.

if (is_open_) {
is_open_ = false;
RETURN_NOT_OK(FinalizeCompression());
return raw_->Close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we definitely want to close the passed output stream? I'm trying to think if there are scenarios where we would not want to

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think by default, yes. We could add a constructor argument if we need to keep the underlying file alive in some cases.

bool need_more_output;
int64_t bytes_read, bytes_written;
int64_t input_len = compressed_->size() - compressed_pos_;
const uint8_t* input = compressed_->data() + compressed_pos_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here re: output buffer bookkeeping


private:
// Write 64 KB compressed data at a time
static const int64_t CHUNK_SIZE = 64 * 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kChunkSize

std::shared_ptr<CompressedOutputStream>* out);
static Status Make(MemoryPool* pool, util::Codec* codec,
std::shared_ptr<OutputStream> raw,
std::shared_ptr<CompressedOutputStream>* out);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use const std::shared_ptr<T>& here and elsewhere in the public APIs for consistency

/// \brief Create a compressed output stream wrapping the given output stream.
static Status Make(util::Codec* codec, std::shared_ptr<OutputStream> raw,
std::shared_ptr<CompressedOutputStream>* out);
static Status Make(MemoryPool* pool, util::Codec* codec,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can codec be const? Also, can it by const Codec&?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, Codec::MakeCompressor and Codec::MakeDecompressor are non-const methods. IMHO it doesn't mean much to have a const Codec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more of an argument passing consistency. We generally use const T& for immutable arguments, T* for mutable arguments, and const T* in the rarer case when the input may be null

std::shared_ptr<CompressedInputStream>* out);
static Status Make(MemoryPool* pool, util::Codec* codec,
std::shared_ptr<InputStream> raw,
std::shared_ptr<CompressedInputStream>* out);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

ASSERT_RAISES(IOError, stream->Read(1024, &out_buf));
}

// NOTE: Snappy doesn't support streaming decompression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could (should?) define a framed format of our own devising

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is it useful? The main point here is to interact with existing files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not. I will say YAGNI for now ;)

@pitrou pitrou force-pushed the ARROW-1019-compressed-streams branch from a0b333b to 64461f3 Compare October 18, 2018 17:41
@wesm
Copy link
Member

wesm commented Oct 18, 2018

+1. Let me have a look at the appveyor build

@wesm
Copy link
Member

wesm commented Oct 18, 2018

I'll wait a little while to make sure appveyor looks good then merge

@wesm wesm closed this in eab7d5f Oct 18, 2018
@pitrou pitrou deleted the ARROW-1019-compressed-streams branch October 18, 2018 19:53
wesm pushed a commit that referenced this pull request Oct 20, 2018
Also works for other bundled compression types.

Based on PR #2777.

Author: Antoine Pitrou <antoine@python.org>

Closes #2786 from pitrou/ARROW-3380-gzipped-csv-read and squashes the following commits:

9a2244f <Antoine Pitrou> ARROW-3380:  Support reading gzipped CSV files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants