Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add Utility function to simplify converting any row-based structure into an arrow::RecordBatchReader or an arrow::Table #34056

Closed
gringasalpastor opened this issue Feb 6, 2023 · 0 comments · Fixed by #34057

Comments

@gringasalpastor
Copy link
Contributor

gringasalpastor commented Feb 6, 2023

Enhancement Description:

Arrow is column based, but often clients need to import external data sources that are stored in a row based fashion. To help simplify the process, I propose we create a RowsToBatches utility function that can take any valid C++ range (std::begin/std::end is defined for T) and returns an arrow::RecordBatchReader (convertible to an arrow::Table). This is particularly useful when useful when the data types for each column are not known at compile time - like in the case of an std::variant

The interface could look like the following (simplified for clarity)

Result<std::shared_ptr<RecordBatchReader>>> RowsToBatches(const std::shared_ptr<Schema>& schema, std::reference_wrapper<Range> rows, DataPointConvertor&& data_point_convertor);

See linked pull request for full details. The client would only need to provide their Schema and a callable type that converts their structure’s types into the associated arrow types.

If the client type is not a C++ range, they can either add iterators or write a wrapper/adaptor that provides the iterators for the type.

Example Usage:

auto IntConvertor = [](ArrayBuilder& array_builder, int value) {
	return static_cast<Int64Builder&>(array_builder).Append(value);
};
std::vector<std::vector<int>> data = {{1, 2, 4}, {5, 6, 7}};
auto batches = RowsToBatches(kTestSchema, std::ref(data), IntConvertor);

Example Supported Types:

  • std::vector<std::vector<std::variant<int, bsl::string>>>
  • std::vector<MyRowStruct>

Component(s)

C++

wjones127 added a commit that referenced this issue Feb 21, 2023
…ased structure into an `arrow::RecordBatchReader` or an `arrow::Table` (#34057)

*Are these changes tested?*

The following tests are provided:
- basic usage
- const ranges
- custom struct accessor
- usage with `std::variant`

* Closes: #34056

Lead-authored-by: Mike Hancock <mhancock34@bloomberg.net>
Co-authored-by: Michael Hancock <javaiscoolmike@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@wjones127 wjones127 added this to the 12.0.0 milestone Feb 21, 2023
fatemehp pushed a commit to fatemehp/arrow that referenced this issue Feb 24, 2023
… row-based structure into an `arrow::RecordBatchReader` or an `arrow::Table` (apache#34057)

*Are these changes tested?*

The following tests are provided:
- basic usage
- const ranges
- custom struct accessor
- usage with `std::variant`

* Closes: apache#34056

Lead-authored-by: Mike Hancock <mhancock34@bloomberg.net>
Co-authored-by: Michael Hancock <javaiscoolmike@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment