New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16006: [C++][Docs] Provide row conversion example for dynamic schemas #12775
Conversation
252e3eb
to
ca87369
Compare
e625503
to
b0b140b
Compare
Not sure if this is related to the appveyor segfault, but some batch sizes produce a segfault:
|
15cea97
to
f6f1eaa
Compare
487323c
to
b627507
Compare
Just rebased. CI failure is unrelated. @emkornfield @pitrou Would one of you be willing to review soon? |
@wjones127 I don't know if that would be painful or not, but if not, would you like to submit the RecordBatchBuilder API changes in a separate PR that we can merge quickly? |
cpp/src/arrow/row_conversion.h
Outdated
|
||
/// \brief Base class for Arrow to row converter. | ||
/// | ||
/// To use, specialize the conversion for a particular row type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... so that means that for a given RowType I cannot implement two different converters (for different actual formats) because symbols would clash. That sounds a bit limiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I hadn't seen that you had to subclass in addition to specializing the class. My bad.
I don't know what the exact requirements for this are, but if performance is not important then do note there are already conversion utilities in |
Generally I have mixed feelings about exposing and maintaing this API. On the one hand it can ease writing conversions a bit, on the other hand it doesn't really abstract the task a lot anyway (mostly it seems to translate between iterator <-> record batch sequence <-> table). Perhaps what we need is simpler ways of writing the translations mentioned above? cc @bkietz for opinions |
One of the intended use case is designing performant connectors for NoSQL databases, who typically provide results as a vector or iterator of row / document objects. For example, DynamoDB's C++ SDK returns Looking that the utilities in stl.h, one thing I haven't explored is whether
I agree that the new API doesn't do that much on the task; the translation it does between iterator, RB, and tables is mostly for convenience. The main thing it does is establish a pattern of converting one batch at a time, and potentially operating on a smaller batch size than a table is currently grouped in for cases where the inner conversion loops performs better with smaller batches. It's hard to make an API that general enough that handles most of the conversion; that's why most of this PR is an example demonstrating how to use Arrow's |
Yes, I should do that. I was being lazy not making a separate PR. |
cc @lidavidm for opinions on the NoSQL connector use case. |
|
||
For schemas not known at compile time, implement an :class:`arrow::ToRowConverter` | ||
or an :class:`arrow::FromRowConverter`. | ||
The following example shows how to implement conversion between `rapidjson::Document` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following example shows how to implement conversion between `rapidjson::Document` | |
The following example shows how to implement conversion between ``rapidjson::Document`` |
|
||
arrow::Status Visit(const arrow::ListArray& array) { | ||
// First create rows from values | ||
std::shared_ptr<arrow::Array> values = array.values(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While minor, I wonder why BaseListArray::values
doesn't return const Array&
or at least const std::shared_ptr<Array>&
…not that it should be a big deal here.
cpp/src/arrow/row_conversion.h
Outdated
/// \param batch Record batch to convert | ||
/// \return A vector of rows containing all data in batch | ||
virtual Result<ContainerType<RowType>> ConvertToVector( | ||
std::shared_ptr<RecordBatch> batch) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not take const RecordBatch&
or const std::shared_ptr<RecordBatch>&
?
RowBatchBuilder builder{batch->num_rows()}; | ||
|
||
for (int i = 0; i < batch->num_columns(); ++i) { | ||
builder.SetField(std::move(batch->schema()->field(i))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think this std::move
does anything here
for (const rapidjson::Document& doc : records) { | ||
rapidjson::StringBuffer sb; | ||
rapidjson::Writer<rapidjson::StringBuffer> writer(sb); | ||
// At batch_size >= 495, we segfault here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's old. IIRC there was a bug in my handling of the rapidjson allocators.
// Convert records into a table | ||
arrow::Result<std::shared_ptr<arrow::Table>> table_result = | ||
to_arrow_converter.ConvertToTable(records, schema); | ||
std::shared_ptr<arrow::Table> table = std::move(table_result).ValueOrDie(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case why not just table = *to_arrow_converter.ConvertToTable();
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or, we could factor out a Status Main()
that gets called from int main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of a Status Main()
, since that enables us to show use of result/status macros in the example code blocks 👍
// Default implementation | ||
arrow::Status Visit(const arrow::DataType& type) { | ||
return arrow::Status::NotImplemented( | ||
"Can not convert to json document for array of type ", type.ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot convert to Arrow array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, thanks!
@@ -17,6 +17,8 @@ | |||
|
|||
add_arrow_example(row_wise_conversion_example) | |||
|
|||
add_arrow_example(rapidjson_row_converter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrap in if(ARROW_JSON)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also surprised you don't have to link to rapidjson explicitly…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rapidjson is header-only. I didn't see it linked anywhere in CMakeLists for ARROW_JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, if(ARROW_JSON)
would still be good so that this doesn't fail building if RapidJSON wasn't required for building Arrow itself.
cpp/src/arrow/row_conversion.h
Outdated
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<Table> table, | ||
Table::FromRecordBatches({batch})); | ||
return ConvertToVector(table, batch_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it's pointless to wrap in a table only to unwrap it right away
c_glib/arrow-glib/table-builder.cpp
Outdated
auto builder_result = arrow::RecordBatchBuilder::Make(arrow_schema, memory_pool); | ||
|
||
if (garrow::check(error, builder_result, "[record-batch-builder][new]")) { | ||
std::unique_ptr<arrow::RecordBatchBuilder> arrow_builder = std::move(builder_result).ValueOrDie(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: formatting, as is this change related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not related. I've moved this change to #13356 and will rebase after that's merged.
// * conversion for Arrow types that have corresponding C types (bool, integer, | ||
// float). | ||
|
||
rapidjson::Value kNullJsonSingleton = rapidjson::Value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: make this const static?
const ArrayType& array) { | ||
for (int64_t i = 0; i < array.length(); ++i) { | ||
if (!array.IsNull(i)) { | ||
rapidjson::Value str_key(field_->name().c_str(), rows_[i].GetAllocator()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: c_str,looks wrong here (what about unicode field names)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rapidjson doesn't accept std::string
, it only accepts const char*
. The character encoding is part of the value type definition: rapidjson::Value = rapidjson::GenericValue<UTF8<> >
. So I think unicode should be correctly handled here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there is a StringRef that can take a length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it turns out if I #define RAPIDJSON_HAS_STDSTRING 1
, rapidjson will just accept std::string
🤦 . Removed all the .c_str()
😄
arrow::Status Visit(const arrow::StringArray& array) { | ||
for (int64_t i = 0; i < array.length(); ++i) { | ||
if (!array.IsNull(i)) { | ||
rapidjson::Value str_key(field_->name().c_str(), rows_[i].GetAllocator()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about c_str.
for (int64_t i = 0; i < array.length(); ++i) { | ||
if (!array.IsNull(i)) { | ||
rapidjson::Value str_key(field_->name().c_str(), rows_[i].GetAllocator()); | ||
// StringArray.Value returns a string view |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems it would be more succinct to spell out the type below then to have this comment.
arrow::Status Visit(const arrow::StructArray& array) { | ||
const arrow::StructType* type = array.struct_type(); | ||
|
||
RowBatchBuilder child_builder(rows_.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be based on array.size()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or is this saying only extract n_rows from the struct array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expectation is that the class is constructed with the correct number of rows, so rows_.size() == array.length()
. But I think it's worth adding an assert
here to make that explicit.
Oops, I didn't actually reply to the use case part.
It would be useful to make this use case easier, but I'm still not sure if it needs to be a new API, vs a code sample or suggestions on how to approach it. |
for (int64_t i = 0; i < array.length(); ++i) { | ||
rapidjson::Document::AllocatorType& allocator = rows_[i].GetAllocator(); | ||
auto array_len = array.value_length(i); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't there be a null check someplace in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes there should be. Thanks!
Refactor to use record batch builder Fix most compiler errors Get a compiling version Allow overriding container type A bit more progress Draft list handling Get parsing working Get example running all the way through Draft narrative documentation Enhance docs Remove unused line Fix lint issues Fix ORC references Move at end Fix integer types Use sizetype from rj Use static cast Format Setup for benchmarks Start debugging Always retrieve the correct allocator Must copy value to new allocator Better handling of unique ptr Format Use better template and pure virtual functions
50f9efd
to
c32ef47
Compare
I have refactored this PR to only create an example, rebased so that the RecordBatchBuilder changes are separated, and rewrote most of the prose in the example. |
template <typename ArrayType, typename DataClass = typename ArrayType::TypeClass> | ||
arrow::enable_if_primitive_ctype<DataClass, arrow::Status> Visit( | ||
const ArrayType& array) { | ||
assert((int64_t)rows_.size() == array.length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: prefer C++-style cast? (here and below)
Do we also want to put the example in the Cookbook somehow? I feel like it's a bit too long/involved to go there directly, but maybe we can cross-reference it. |
Yeah, I can create a link to it somewhere under the "Creating Arrow Objects" section I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM though I'm unsure how many people would actually follow such a long and non-trivial example.
It's a common use case and I tried to add narrative docs that broke up the example to highlight useful utilities. But it is unfortunate we weren't able to find any general abstractions to make this operation short and trivial. I think this is ready to be merged, unless @emkornfield has any objections? |
sorry never got through a full review so if other reviewers are happy with it I'm fine checking it in. Is there a summary for why we don't think there is a path forward for a generalized method? |
Here it came down to the abstraction we did come up with didn't do very much; most of the value was in the example. In the way it was designed, we asked the user to implement the conversion of one batch of rows to one RecordBatch, and the API took care of the rest. But in our example, most of the logic was within that single batch conversion example implementation. It's possible with another pass we could come up a worthwhile API; perhaps if the user-implemented hooks were narrower in scope. |
This PR is meant to improve the ease of converting between Arrow tabular structure and row-based data. It adds an example of implementing and using the row converters to convert between vectors of
rapidjson::Document
and Arrow tables, highlighting some common patterns and important utilities provided by Arrow C++.