Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16006: [C++][Docs] Provide row conversion example for dynamic schemas #12775

Merged
merged 7 commits into from Jul 6, 2022

Conversation

wjones127
Copy link
Member

@wjones127 wjones127 commented Apr 1, 2022

This PR is meant to improve the ease of converting between Arrow tabular structure and row-based data. It adds an example of implementing and using the row converters to convert between vectors of rapidjson::Document and Arrow tables, highlighting some common patterns and important utilities provided by Arrow C++.

@github-actions
Copy link

github-actions bot commented Apr 1, 2022

@emkornfield emkornfield self-requested a review April 21, 2022 17:36
@wjones127 wjones127 changed the title ARROW-16006: [C++] Row conversion helpers and example [WIP] ARROW-16006: [C++] Row conversion helpers and example Apr 28, 2022
@wjones127
Copy link
Member Author

Not sure if this is related to the appveyor segfault, but some batch sizes produce a segfault:

# segfault for me
rapidjson-row-converter 1000 500

# runs fine
rapidjson-row-converter 1000 800

@wjones127 wjones127 marked this pull request as ready for review May 5, 2022 23:36
@wjones127 wjones127 force-pushed the ARROW-16006-row-conversion branch 2 times, most recently from 487323c to b627507 Compare June 3, 2022 14:41
@wjones127
Copy link
Member Author

Just rebased. CI failure is unrelated.

@emkornfield @pitrou Would one of you be willing to review soon?

@pitrou
Copy link
Member

pitrou commented Jun 9, 2022

@wjones127 I don't know if that would be painful or not, but if not, would you like to submit the RecordBatchBuilder API changes in a separate PR that we can merge quickly?


/// \brief Base class for Arrow to row converter.
///
/// To use, specialize the conversion for a particular row type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... so that means that for a given RowType I cannot implement two different converters (for different actual formats) because symbols would clash. That sounds a bit limiting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I hadn't seen that you had to subclass in addition to specializing the class. My bad.

@pitrou
Copy link
Member

pitrou commented Jun 9, 2022

I don't know what the exact requirements for this are, but if performance is not important then do note there are already conversion utilities in stl.h that go back and forth between arrow::Table and std::vector<std::tuple<...>>.

@pitrou
Copy link
Member

pitrou commented Jun 9, 2022

Generally I have mixed feelings about exposing and maintaing this API. On the one hand it can ease writing conversions a bit, on the other hand it doesn't really abstract the task a lot anyway (mostly it seems to translate between iterator <-> record batch sequence <-> table).

Perhaps what we need is simpler ways of writing the translations mentioned above?

cc @bkietz for opinions

@wjones127
Copy link
Member Author

I don't know what the exact requirements for this are, but if performance is not important then do note there are already conversion utilities in stl.h that go back and forth between arrow::Table and std::vector<std::tuple<...>>.

One of the intended use case is designing performant connectors for NoSQL databases, who typically provide results as a vector or iterator of row / document objects. For example, DynamoDB's C++ SDK returns Aws::Map<Aws::String, Aws::DynamoDB::Model::AttributeValue>. Cassandra has a similar class. More generally, there are a lot of existing systems that work off of rows that an adopter of Arrow may need to connect to.

Looking that the utilities in stl.h, one thing I haven't explored is whether arrow::stl::ConversionTraits<T> and arrow::CTypeTraits<T>.

Generally I have mixed feelings about exposing and maintaining this API. On the one hand it can ease writing conversions a bit, on the other hand it doesn't really abstract the task a lot anyway (mostly it seems to translate between iterator <-> record batch sequence <-> table).

I agree that the new API doesn't do that much on the task; the translation it does between iterator, RB, and tables is mostly for convenience. The main thing it does is establish a pattern of converting one batch at a time, and potentially operating on a smaller batch size than a table is currently grouped in for cases where the inner conversion loops performs better with smaller batches.

It's hard to make an API that general enough that handles most of the conversion; that's why most of this PR is an example demonstrating how to use Arrow's VisitXInline methods, type traits, and conversions to C types to implement row conversions. Those features make writing the row-conversions a lot easier, but they aren't well-documented for end users. If we think the new API doesn't add much, I could just fold its logic into the example and this could just be a docs PR 🤷 .

@wjones127
Copy link
Member Author

@wjones127 I don't know if that would be painful or not, but if not, would you like to submit the RecordBatchBuilder API changes in a separate PR that we can merge quickly?

Yes, I should do that. I was being lazy not making a separate PR.

@pitrou
Copy link
Member

pitrou commented Jun 9, 2022

cc @lidavidm for opinions on the NoSQL connector use case.


For schemas not known at compile time, implement an :class:`arrow::ToRowConverter`
or an :class:`arrow::FromRowConverter`.
The following example shows how to implement conversion between `rapidjson::Document`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following example shows how to implement conversion between `rapidjson::Document`
The following example shows how to implement conversion between ``rapidjson::Document``


arrow::Status Visit(const arrow::ListArray& array) {
// First create rows from values
std::shared_ptr<arrow::Array> values = array.values();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While minor, I wonder why BaseListArray::values doesn't return const Array& or at least const std::shared_ptr<Array>&…not that it should be a big deal here.

cpp/examples/arrow/rapidjson_row_converter.cc Show resolved Hide resolved
/// \param batch Record batch to convert
/// \return A vector of rows containing all data in batch
virtual Result<ContainerType<RowType>> ConvertToVector(
std::shared_ptr<RecordBatch> batch) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not take const RecordBatch& or const std::shared_ptr<RecordBatch>&?

RowBatchBuilder builder{batch->num_rows()};

for (int i = 0; i < batch->num_columns(); ++i) {
builder.SetField(std::move(batch->schema()->field(i)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think this std::move does anything here

for (const rapidjson::Document& doc : records) {
rapidjson::StringBuffer sb;
rapidjson::Writer<rapidjson::StringBuffer> writer(sb);
// At batch_size >= 495, we segfault here:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's old. IIRC there was a bug in my handling of the rapidjson allocators.

// Convert records into a table
arrow::Result<std::shared_ptr<arrow::Table>> table_result =
to_arrow_converter.ConvertToTable(records, schema);
std::shared_ptr<arrow::Table> table = std::move(table_result).ValueOrDie();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case why not just table = *to_arrow_converter.ConvertToTable();?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or, we could factor out a Status Main() that gets called from int main

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of a Status Main(), since that enables us to show use of result/status macros in the example code blocks 👍

// Default implementation
arrow::Status Visit(const arrow::DataType& type) {
return arrow::Status::NotImplemented(
"Can not convert to json document for array of type ", type.ToString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot convert to Arrow array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, thanks!

@@ -17,6 +17,8 @@

add_arrow_example(row_wise_conversion_example)

add_arrow_example(rapidjson_row_converter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap in if(ARROW_JSON)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also surprised you don't have to link to rapidjson explicitly…

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rapidjson is header-only. I didn't see it linked anywhere in CMakeLists for ARROW_JSON.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good to know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, if(ARROW_JSON) would still be good so that this doesn't fail building if RapidJSON wasn't required for building Arrow itself.

Comment on lines 53 to 55
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<Table> table,
Table::FromRecordBatches({batch}));
return ConvertToVector(table, batch_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it's pointless to wrap in a table only to unwrap it right away

auto builder_result = arrow::RecordBatchBuilder::Make(arrow_schema, memory_pool);

if (garrow::check(error, builder_result, "[record-batch-builder][new]")) {
std::unique_ptr<arrow::RecordBatchBuilder> arrow_builder = std::move(builder_result).ValueOrDie();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: formatting, as is this change related?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not related. I've moved this change to #13356 and will rebase after that's merged.

// * conversion for Arrow types that have corresponding C types (bool, integer,
// float).

rapidjson::Value kNullJsonSingleton = rapidjson::Value();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make this const static?

const ArrayType& array) {
for (int64_t i = 0; i < array.length(); ++i) {
if (!array.IsNull(i)) {
rapidjson::Value str_key(field_->name().c_str(), rows_[i].GetAllocator());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: c_str,looks wrong here (what about unicode field names)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rapidjson doesn't accept std::string, it only accepts const char*. The character encoding is part of the value type definition: rapidjson::Value = rapidjson::GenericValue<UTF8<> >. So I think unicode should be correctly handled here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there is a StringRef that can take a length.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it turns out if I #define RAPIDJSON_HAS_STDSTRING 1, rapidjson will just accept std::string 🤦 . Removed all the .c_str() 😄

arrow::Status Visit(const arrow::StringArray& array) {
for (int64_t i = 0; i < array.length(); ++i) {
if (!array.IsNull(i)) {
rapidjson::Value str_key(field_->name().c_str(), rows_[i].GetAllocator());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about c_str.

for (int64_t i = 0; i < array.length(); ++i) {
if (!array.IsNull(i)) {
rapidjson::Value str_key(field_->name().c_str(), rows_[i].GetAllocator());
// StringArray.Value returns a string view
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems it would be more succinct to spell out the type below then to have this comment.

arrow::Status Visit(const arrow::StructArray& array) {
const arrow::StructType* type = array.struct_type();

RowBatchBuilder child_builder(rows_.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be based on array.size()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is this saying only extract n_rows from the struct array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectation is that the class is constructed with the correct number of rows, so rows_.size() == array.length(). But I think it's worth adding an assert here to make that explicit.

@lidavidm
Copy link
Member

Oops, I didn't actually reply to the use case part.

One of the intended use case is designing performant connectors for NoSQL databases, who typically provide results as a vector or iterator of row / document objects. For example, DynamoDB's C++ SDK returns Aws::Map<Aws::String, Aws::DynamoDB::Model::AttributeValue>. Cassandra has a similar class. More generally, there are a lot of existing systems that work off of rows that an adopter of Arrow may need to connect to.

It would be useful to make this use case easier, but I'm still not sure if it needs to be a new API, vs a code sample or suggestions on how to approach it. row_conversion.h here is very minimal and could frankly just be copy-pasted into an application easily enough without having to make it a public Arrow API. That said I don't necessarily object to having it if it's going to be a common enough pattern and/or if that makes it easy to point to.

for (int64_t i = 0; i < array.length(); ++i) {
rapidjson::Document::AllocatorType& allocator = rows_[i].GetAllocator();
auto array_len = array.value_length(i);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't there be a null check someplace in here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes there should be. Thanks!

Refactor to use record batch builder

Fix most compiler errors

Get a compiling version

Allow overriding container type

A bit more progress

Draft list handling

Get parsing working

Get example running all the way through

Draft narrative documentation

Enhance docs

Remove unused line

Fix lint issues

Fix ORC references

Move at end

Fix integer types

Use sizetype from rj

Use static cast

Format

Setup for benchmarks

Start debugging

Always retrieve the correct allocator

Must copy value to new allocator

Better handling of unique ptr

Format

Use better template and pure virtual functions
@wjones127
Copy link
Member Author

I have refactored this PR to only create an example, rebased so that the RecordBatchBuilder changes are separated, and rewrote most of the prose in the example.

template <typename ArrayType, typename DataClass = typename ArrayType::TypeClass>
arrow::enable_if_primitive_ctype<DataClass, arrow::Status> Visit(
const ArrayType& array) {
assert((int64_t)rows_.size() == array.length());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer C++-style cast? (here and below)

@lidavidm
Copy link
Member

Do we also want to put the example in the Cookbook somehow? I feel like it's a bit too long/involved to go there directly, but maybe we can cross-reference it.

@wjones127
Copy link
Member Author

Do we also want to put the example in the Cookbook somehow? I feel like it's a bit too long/involved to go there directly, but maybe we can cross-reference it.

Yeah, I can create a link to it somewhere under the "Creating Arrow Objects" section I think.

@wjones127 wjones127 changed the title ARROW-16006: [C++] Row conversion helpers and example ARROW-16006: [C++][Docs] Provide row conversion example for dynamic schemas Jun 23, 2022
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM though I'm unsure how many people would actually follow such a long and non-trivial example.

@wjones127
Copy link
Member Author

This LGTM though I'm unsure how many people would actually follow such a long and non-trivial example.

It's a common use case and I tried to add narrative docs that broke up the example to highlight useful utilities. But it is unfortunate we weren't able to find any general abstractions to make this operation short and trivial.

I think this is ready to be merged, unless @emkornfield has any objections?

@emkornfield
Copy link
Contributor

sorry never got through a full review so if other reviewers are happy with it I'm fine checking it in. Is there a summary for why we don't think there is a path forward for a generalized method?

@wjones127
Copy link
Member Author

Is there a summary for why we don't think there is a path forward for a generalized method?

Here it came down to the abstraction we did come up with didn't do very much; most of the value was in the example. In the way it was designed, we asked the user to implement the conversion of one batch of rows to one RecordBatch, and the API took care of the rest. But in our example, most of the logic was within that single batch conversion example implementation. It's possible with another pass we could come up a worthwhile API; perhaps if the user-implemented hooks were narrower in scope.

@lidavidm lidavidm merged commit 73588c3 into apache:master Jul 6, 2022
@wjones127 wjones127 deleted the ARROW-16006-row-conversion branch July 6, 2022 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants