ARROW-17610: [C++] Support additional source types in SourceNode #14041

rtpsw · 2022-09-05T06:55:23Z

See https://issues.apache.org/jira/browse/ARROW-17610

github-actions · 2022-09-05T06:55:57Z

https://issues.apache.org/jira/browse/ARROW-17610

…to java/ (apache#13911) Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

rtpsw · 2022-09-06T08:14:51Z

@westonpace, I see you were involved in the Enumerated code and I'd like to ask about it.

Why int, and not int64_t for the index type? When I tried to change to int64_t, I observed failures in arrow-dataset-datasdt-test like:

[ RUN      ] TestEndToEnd.EndToEndSingleDataset
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/dataset/dataset_test.cc:468: Failure
Failed
'_error_or_value141.status()' failed with Invalid: Referenced field __fragment_index was int64 but should have been int32
[  FAILED  ] TestEndToEnd.EndToEndSingleDataset (6 ms)

which could probably be fixed in the tester.

Also, why int and not int32_t?

Change "There are the [...]" to "These are the [...]". Incorrect wording in previous README. Authored-by: Wilhelm Ågren <36638274+willeagren@users.noreply.github.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

@lafiona

…pache#13975) # Overview This pull request: 1. Removes hard-coded dependencies on "master" as the default branch name in the crossbow infrastructure and CI template files. # Implementation 1. Removed comment/text references to "master" branch, including URLs to other repositories. 2. Modified `core.py` to add a new `default_branch` property and a new method `is_default_branch`, for checking whether on the default branch, to the `Target` class. 3. Modified CI template files to use the new `is_default_branch` function to check whether on the default branch. # Testing 1. Using [lafiona/crossbow](https://github.com/lafiona/crossbow) as a queue repository for qualification. 2. Ran modified template jobs. All failures appeared to be unrelated to the changes. 3. The branch names for all relevant qualification jobs are prefixed with `build-34-*`. 4. Example of a passing job: [https://github.com/lafiona/crossbow/actions/runs/2920227769](https://github.com/lafiona/crossbow/actions/runs/2920227769) 5. Example of a failing job: [https://github.com/lafiona/crossbow/runs/7998190113](https://github.com/lafiona/crossbow/runs/7998190113) - in this example, the *"Push Docker Image"* workflow step is not included, since we are not on the default branch. The failure appears to be related to issues fetching R package resources and not related to the default branch checking logic. There were a variety of other kinds of failures, but none of them appear related to the default branch checking logic. # Future Directions 1. Remove "master" from `default_branch` name property of `Target` class. 2. Remove all remaining uses of "master" terminology in crossbow. 3. [ARROW-17512](https://issues.apache.org/jira/browse/ARROW-17512): Address minor issues with crossbow documentation. # Notes 1. Thank you to @lafiona for her help with this pull request! 2. Due to unexpected technical issues, we opened this pull request as a follow up to apache#13750. Please see apache#13750 for more discussion regarding qualification efforts. Lead-authored-by: Kevin Gurney <kgurney@mathworks.com> Co-authored-by: Fiona La <fionala7@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org>

This is the last change I propose to improve our S3 error message. For certain errors, unfortunately the AWS SDK is doing a poor job in propagating the error and just reports UNKNOWN (see https://github.com/aws/aws-sdk-cpp/blob/1614bce979a201ada1e3436358edb7bd1834b5d6/aws-cpp-sdk-core/source/client/AWSClient.cpp#L77), in these cases the HTTP status code can be an important source to find out what is going wrong (and is also reported by boto3). This has the downside of cluttering the error message a bit more, but in general this information will be very valuable to diagnose the problem. Given that we now have the API call and the HTTP status error, in general there is good documentation on the internet that helps diagnose the problem. Before: > When getting information for key 'test.csv' in bucket 'pcmoritz-test-bucket-arrow-errors': AWS Error UNKNOWN during HeadObject call: No response body. After: > When getting information for key 'test.csv' in bucket 'pcmoritz-test-bucket-arrow-errors': AWS Error UNKNOWN **(HTTP status 400)** during HeadObject call: No response body. Lead-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

…ding transcoding function option to CSV scanner (apache#13820) This is an alternative version of apache#13709, to compare what the best approach is. Instead of extending the C++ ReadOptions struct with an `encoding` field, this implementations adds a python version of the ReadOptions object to both `CsvFileFormat` and `CsvFragmentScanOptions`. The reason it is needed in both places, is to prevent these kinds of inconsistencies: ``` >>> import pyarrow.dataset as ds >>> import pyarrow.csv as csv >>> ro =csv.ReadOptions(encoding='iso8859') >>> fo = ds.CsvFileFormat(read_options=ro) >>> fo.default_fragment_scan_options.read_options.encoding 'utf8' ``` Authored-by: Joost Hoozemans <joosthooz@msn.com> Signed-off-by: David Li <li.davidm96@gmail.com>

…rdReader interface (apache#14057) Lead-authored-by: Matt Topol <zotthewizard@gmail.com> Co-authored-by: Matthew Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

This PR removes the RTools35 CI job as it's currently failing and we're about to drop support following the vote to move to C++17. Authored-by: Dewey Dunnington <dewey@voltrondata.com> Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>

…#13973) Support ORC file format in java Dataset API Authored-by: igor.suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

Authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

…7) (apache#14067) Authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

…_handler (apache#14061) Authored-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…nt64 to match C++ code (apache#14032) To fix an exception while writing large parquet files: ``` Traceback (most recent call last): File "pyarrow/_dataset_parquet.pyx", line 165, in pyarrow._dataset_parquet.ParquetFileFormat._finish_write File "pyarrow/dataset.pyx", line 2695, in pyarrow._dataset.WrittenFile.init_ OverflowError: value too large to convert to int Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor' ``` Authored-by: Joost Hoozemans <joosthooz@msn.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…y of fields (apache#14048) ```python table = pa.table({'a': [None, 1], 'b': [None, True]}) new_schema = pa.schema([pa.field("a", "int64", nullable=True), pa.field("b", "bool", nullable=False)]) casted = table.cast(new_schema) ``` Now leads to ``` RuntimeError: Casting field 'b' with null values to non-nullable ``` Authored-by: kshitij12345 <kshitijkalambarkar@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

westonpace

These seem like helpful utilities. I have a few questions but generally think this would be a good thing to get in.

cpp/src/arrow/compute/exec/options.h

westonpace · 2022-09-08T12:29:27Z

cpp/src/arrow/compute/exec/source_node.cc

+    if (io_executor == NULLPTR) {
+      return Status::Invalid(kKindName, " requires IO-executor which is not null");
+    }


Couldn't we default to some kind of default IO executor here?

IIRC, I ran into a runtime problem when creating a generator using a null executor. I'm not sure whether the alternative of using the default executor would work well for generators intended for IO, rather than CPU, work. I'm also not aware of a default IO executor, but if there is one then I'd try using it here.

default_io_context().executor() should work. Or, more directly, io::internal::GetIOThreadPool()

westonpace · 2022-09-08T12:32:31Z

cpp/src/arrow/compute/exec/source_node.cc

+    // TODO: Should Enumerated<>.index be changed to int64_t? Currently, this change
+    // causes dataset unit-test failures


I think there are probably many places that would fail today if we had more than 2Bi batches. That sort of scale is usually larger than a single node I think.

I agree and thought about it too. However, note that Enumerated isn't specific for batches; it can enumerate anything coming out of an iterator, which might emit more than 2B items. I'm fine leaving it int, though I'd prefer int32_t.

westonpace · 2022-09-08T12:34:45Z

cpp/src/arrow/compute/exec/source_node.cc

+    auto exec_batch_it = MakeMapIterator(to_exec_batch, std::move(batch_it));
+    auto enum_it = MakeEnumeratedIterator(std::move(exec_batch_it));
+    ARROW_ASSIGN_OR_RAISE(auto enum_gen,
+                          MakeBackgroundGenerator(std::move(enum_it), io_executor));
+    return MakeUnenumeratedGenerator(std::move(enum_gen));


Why are we enumerating and then unenumerating all these generators?

The problem is that a background generator, which is useful for overlapped IO, does not guarantee in-order delivery. My solution is to enumerate at the iterator, which does guarantee order, then apply the background generator, and finally sort by the enumeration. Though there might be a better way to do overlapped IO; I'd be interested in alternatives.

It turns out my solution doesn't solve the problem - as before, I still see infrequent cases of out-of-order delivery when the IO-context has multiple threads. Even after examining a fair amount of Arrow code related to generators, and trying a couple of things, I still have no good idea how to fix this. OTOH, it looks like I'm not the only one, because there is an existing multi-threaded unit test involving SourceNode that checks result batches match while ignoring their order. This suggests the possibility that the out-of-order-batches problem originates in SourceNode. I won't try to fix this problem in this PR and will instead resort to ignoring order in the unit tests.

Note that there is an archived Apache Arrow Dev discussion of the batch ordering issue.

Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

Otherwise a minimal build fails with errors like ``` CMake Error at cmake_modules/BuildUtils.cmake:272 (target_link_libraries): Target "arrow_testing_objlib" links to: rapidjson::rapidjson but the target was not found. Possible reasons include: * There is a typo in the target name. * A find_package call is missing for an IMPORTED target. * An ALIAS target is missing. Call Stack (most recent call first): src/arrow/CMakeLists.txt:653 (add_arrow_lib) ``` Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

This is the initial PR to set the util functions and structure to include the `ToProto` functionality to relations. Here the objective is to create an ACERO relation by interpretting what is included in a Substrait-Relation. In this PR the `read` relation ToProto is added. Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

See https://issues.apache.org/jira/browse/ARROW-17412 Lead-authored-by: Yaron Gvili <rtpsw@hotmail.com> Co-authored-by: rtpsw <rtpsw@hotmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

…oreSchema (apache#14087) Authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

``` /arrow/cpp/src/arrow/compute/exec/bloom_filter.h:252:25: error: type attributes ignored after type is already defined [-Werror=attributes] 252 | enum class ARROW_EXPORT BloomFilterBuildStrategy { | ^~~~~~~~~~~~~~~~~~~~~~~~ ``` Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

…he case where buffer_ is null (apache#14085) Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

- Provide compatibility for 32-bit platforms - Avoid memory leak in tests - Make checks less strict Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

…ouldIncludeMetadata provided (apache#14196) Current implementation include [catalog,table,column,type](https://github.com/apache/arrow/blob/master/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowUtils.java#L248) metadata, but schema metadata field is missing. In terms of PostgreSQL catalog - is database, schema - namespace inside database, so catalog name is insufficient for table addressing without schema. Proposed changes is + metadata.put(Constants.SQL_SCHEMA_KEY, rsmd.getSchemaName(i)); Authored-by: igor.suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

rtpsw · 2022-09-21T20:48:14Z

@westonpace, any idea about the CI compilation errors? It's complaining (also) about code I haven't changed, which is suspicious.

Starting from LLVM 13, LLVM IR has been shifting towards a unified opaque pointer type, i.e. pointers without pointee types. It has provided workarounds until LLVM 15. The temporary workarounds need to be replaced in order to support LLVM 15 and onwards. We need to supply the pointee type to the CreateGEP and CreateLoad methods. For more background info, see https://llvm.org/docs/OpaquePointers.html and https://lists.llvm.org/pipermail/llvm-dev/2015-February/081822.html Related issues: https://issues.apache.org/jira/browse/ARROW-14363 https://issues.apache.org/jira/browse/ARROW-17728 https://issues.apache.org/jira/browse/ARROW-17775 Lead-authored-by: Jin Shang <shangjin1997@gmail.com> Co-authored-by: jinshang <jinshang@tencent.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

…OW-17610

pitrou · 2022-09-22T12:29:50Z

@rtpsw Looking at the number of lines changed here, it seems this PR got astray from git master at some point. Perhaps close this and open a new clean PR?

rtpsw · 2022-09-22T12:56:10Z

Replaced by #14207

ARROW-17610: [C++] Support additional source types in SourceNode

3e8ae9f

github-actions bot added the Component: C++ label Sep 5, 2022

rtpsw and others added 4 commits September 5, 2022 10:19

fix source ordering

59427a1

add doc strings

9c9c204

ARROW-17081: [Java][Datasets] Move JNI build configuration from cpp/ …

7a0ba80

…to java/ (apache#13911) Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

fix test failure

abd05fb

wilhelmagren and others added 4 commits September 6, 2022 17:26

MINOR: [Ruby] Fix wrong English (apache#14046)

23c5cee

Change "There are the [...]" to "These are the [...]". Incorrect wording in previous README. Authored-by: Wilhelm Ågren <36638274+willeagren@users.noreply.github.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

westonpace self-requested a review September 7, 2022 01:55

zeroshade and others added 8 commits September 7, 2022 11:18

ARROW-17638: [Go] Extend C Data API support for Union arrays and Reco…

ff3aa3b

…rdReader interface (apache#14057) Lead-authored-by: Matt Topol <zotthewizard@gmail.com> Co-authored-by: Matthew Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

ARROW-17525: [Java] Read ORC files using NativeDatasetFactory (apache…

21491ec

…#13973) Support ORC file format in java Dataset API Authored-by: igor.suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

ARROW-17600: [Go] Implement Casting for Nested types (apache#14056)

d123277

Authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

ARROW-17646: [Go][CI] Switch C Data to use cgo.Handle (bumps to Go1.1…

47314c3

…7) (apache#14067) Authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

ARROW-17641: [python] Fix ParseOptions deserialization of invalid_row…

6ff5224

…_handler (apache#14061) Authored-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

westonpace reviewed Sep 8, 2022

View reviewed changes

lidavidm and others added 8 commits September 8, 2022 10:01

ARROW-16384: [Docs] Add Flight SQL to status page (apache#14053)

2920d52

Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

ARROW-17412: [C++] AsofJoin multiple keys and types (apache#13880)

99b57e8

See https://issues.apache.org/jira/browse/ARROW-17412 Lead-authored-by: Yaron Gvili <rtpsw@hotmail.com> Co-authored-by: rtpsw <rtpsw@hotmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

ARROW-17627: [Go][Parquet] Forward schema metadata to file without St…

f184255

…oreSchema (apache#14087) Authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

remove ordering

b8fac31

ARROW-17675: [C++] Modified the FileSource::Equals method to handle t…

a63e60b

…he case where buffer_ is null (apache#14085) Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

pitrou and others added 3 commits September 21, 2022 22:07

ARROW-17800: [C++] Fix failures in jemalloc stats tests (apache#14194)

2914325

- Provide compatibility for 32-bit platforms - Avoid memory leak in tests - Make checks less strict Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

requested fixes

17a2137

js8544 and others added 10 commits September 22, 2022 10:40

ARROW-17610: [C++] Support additional source types in SourceNode

4d3001b

fix source ordering

66a4a29

add doc strings

e11b9bd

fix test failure

81d3445

remove ordering

8e3c91b

support RecordBatchReader maker

3f98c86

requested fixes

6e57054

rebase

541e53d

Merge branch 'ARROW-17610' of https://github.com/rtpsw/arrow into ARR…

2c5140a

…OW-17610

rtpsw closed this Sep 22, 2022

rtpsw deleted the ARROW-17610 branch September 22, 2022 12:32

github-actions bot added Component: Documentation Component: FlightRPC Component: C++ - Gandiva Component: GLib Component: Go Component: Java Component: MATLAB Component: Python Component: R Component: Ruby Component: Parquet labels Sep 22, 2022

asfimport mentioned this pull request Nov 23, 2022

[C++] Support additional source types in SourceNode #32855

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17610: [C++] Support additional source types in SourceNode #14041

ARROW-17610: [C++] Support additional source types in SourceNode #14041

rtpsw commented Sep 5, 2022

github-actions bot commented Sep 5, 2022

rtpsw commented Sep 6, 2022

westonpace left a comment

westonpace Sep 8, 2022

rtpsw Sep 8, 2022

westonpace Sep 21, 2022

westonpace Sep 8, 2022

rtpsw Sep 8, 2022

westonpace Sep 8, 2022

rtpsw Sep 8, 2022

rtpsw Sep 10, 2022 •

edited

rtpsw Sep 13, 2022

rtpsw commented Sep 21, 2022

pitrou commented Sep 22, 2022

rtpsw commented Sep 22, 2022

		// TODO: Should Enumerated<>.index be changed to int64_t? Currently, this change
		// causes dataset unit-test failures

ARROW-17610: [C++] Support additional source types in SourceNode #14041

ARROW-17610: [C++] Support additional source types in SourceNode #14041

Conversation

rtpsw commented Sep 5, 2022

github-actions bot commented Sep 5, 2022

rtpsw commented Sep 6, 2022

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rtpsw Sep 10, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rtpsw commented Sep 21, 2022

pitrou commented Sep 22, 2022

rtpsw commented Sep 22, 2022

rtpsw Sep 10, 2022 •

edited