Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17610: [C++] Support additional source types in SourceNode #14041

Closed
wants to merge 134 commits into from

Conversation

rtpsw
Copy link
Contributor

@rtpsw rtpsw commented Sep 5, 2022

@github-actions
Copy link

github-actions bot commented Sep 5, 2022

rtpsw and others added 4 commits September 5, 2022 10:19
…to java/ (apache#13911)

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@rtpsw
Copy link
Contributor Author

rtpsw commented Sep 6, 2022

@westonpace, I see you were involved in the Enumerated code and I'd like to ask about it.

Why int, and not int64_t for the index type? When I tried to change to int64_t, I observed failures in arrow-dataset-datasdt-test like:

[ RUN      ] TestEndToEnd.EndToEndSingleDataset
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/dataset/dataset_test.cc:468: Failure
Failed
'_error_or_value141.status()' failed with Invalid: Referenced field __fragment_index was int64 but should have been int32
[  FAILED  ] TestEndToEnd.EndToEndSingleDataset (6 ms)

which could probably be fixed in the tester.

Also, why int and not int32_t?

wilhelmagren and others added 4 commits September 6, 2022 17:26
Change "There are the [...]" to "These are the [...]". Incorrect wording in previous README.

Authored-by: Wilhelm Ågren <36638274+willeagren@users.noreply.github.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…pache#13975)

# Overview

This pull request:

1. Removes hard-coded dependencies on "master" as the default branch name in the crossbow infrastructure and CI template files.

# Implementation

1. Removed comment/text references to "master" branch, including URLs to other repositories.
2. Modified `core.py` to add a new `default_branch` property and a new method `is_default_branch`, for checking whether on the default branch, to the `Target` class.
3. Modified CI template files to use the new `is_default_branch` function to check whether on the default branch.

# Testing

1. Using [lafiona/crossbow](https://github.com/lafiona/crossbow) as a queue repository for qualification.
2. Ran modified template jobs. All failures appeared to be unrelated to the changes.
3. The branch names for all relevant qualification jobs are prefixed with `build-34-*`. 
4. Example of a passing job: [https://github.com/lafiona/crossbow/actions/runs/2920227769](https://github.com/lafiona/crossbow/actions/runs/2920227769)
5. Example of a failing job: [https://github.com/lafiona/crossbow/runs/7998190113](https://github.com/lafiona/crossbow/runs/7998190113) - in this example, the *"Push Docker Image"* workflow step is not included, since we are not on the default branch. The failure appears to be related to issues fetching R package resources and not related to the default branch checking logic. There were a variety of other kinds of failures, but none of them appear related to the default branch checking logic.

# Future Directions

1. Remove "master" from `default_branch` name property of `Target` class.
2. Remove all remaining uses of "master" terminology in crossbow.
3. [ARROW-17512](https://issues.apache.org/jira/browse/ARROW-17512): Address minor issues with crossbow documentation.

# Notes

1. Thank you to @lafiona for her help with this pull request!
2. Due to unexpected technical issues, we opened this pull request as a follow up to apache#13750. Please see apache#13750 for more discussion regarding qualification efforts.

Lead-authored-by: Kevin Gurney <kgurney@mathworks.com>
Co-authored-by: Fiona La <fionala7@gmail.com>
Signed-off-by: Alessandro Molina <amol@turbogears.org>
This is the last change I propose to improve our S3 error message.

For certain errors, unfortunately the AWS SDK is doing a poor job in propagating the error and just reports UNKNOWN (see https://github.com/aws/aws-sdk-cpp/blob/1614bce979a201ada1e3436358edb7bd1834b5d6/aws-cpp-sdk-core/source/client/AWSClient.cpp#L77), in these cases the HTTP status code can be an important source to find out what is going wrong (and is also reported by boto3).

This has the downside of cluttering the error message a bit more, but in general this information will be very valuable to diagnose the problem. Given that we now have the API call and the HTTP status error, in general there is good documentation on the internet that helps diagnose the problem.

Before:

> When getting information for key 'test.csv' in bucket 'pcmoritz-test-bucket-arrow-errors': AWS Error UNKNOWN during HeadObject call: No response body.

After:

> When getting information for key 'test.csv' in bucket 'pcmoritz-test-bucket-arrow-errors': AWS Error UNKNOWN **(HTTP status 400)** during HeadObject call: No response body.

Lead-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
…ding transcoding function option to CSV scanner (apache#13820)

This is an alternative version of apache#13709, to compare what the best approach is.

Instead of extending the C++ ReadOptions struct with an `encoding` field, this implementations adds a python version of the ReadOptions object to both `CsvFileFormat` and `CsvFragmentScanOptions`. The reason it is needed in both places, is to prevent these kinds of inconsistencies:
```
>>> import pyarrow.dataset as ds
>>> import pyarrow.csv as csv
>>> ro =csv.ReadOptions(encoding='iso8859')
>>> fo = ds.CsvFileFormat(read_options=ro)
>>> fo.default_fragment_scan_options.read_options.encoding
'utf8'
```

Authored-by: Joost Hoozemans <joosthooz@msn.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
@westonpace westonpace self-requested a review September 7, 2022 01:55
zeroshade and others added 8 commits September 7, 2022 11:18
…rdReader interface (apache#14057)

Lead-authored-by: Matt Topol <zotthewizard@gmail.com>
Co-authored-by: Matthew Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
This PR removes the RTools35 CI job as it's currently failing and we're about to drop support following the vote to move to C++17.

Authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
…#13973)

Support ORC file format in java Dataset API

Authored-by: igor.suhorukov <igor.suhorukov@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…7) (apache#14067)

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…_handler (apache#14061)

Authored-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…nt64 to match C++ code (apache#14032)

To fix an exception while writing large parquet files:
```
Traceback (most recent call last):
  File "pyarrow/_dataset_parquet.pyx", line 165, in pyarrow._dataset_parquet.ParquetFileFormat._finish_write
  File "pyarrow/dataset.pyx", line 2695, in pyarrow._dataset.WrittenFile.init_
OverflowError: value too large to convert to int
Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor'
```

Authored-by: Joost Hoozemans <joosthooz@msn.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…y of fields (apache#14048)

```python
table = pa.table({'a': [None, 1], 'b': [None, True]})
new_schema = pa.schema([pa.field("a", "int64", nullable=True), pa.field("b", "bool", nullable=False)])
casted = table.cast(new_schema)

```

Now leads to
```
RuntimeError: Casting field 'b' with null values to non-nullable
```

Authored-by: kshitij12345 <kshitijkalambarkar@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem like helpful utilities. I have a few questions but generally think this would be a good thing to get in.

cpp/src/arrow/compute/exec/options.h Show resolved Hide resolved
Comment on lines 321 to 323
if (io_executor == NULLPTR) {
return Status::Invalid(kKindName, " requires IO-executor which is not null");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we default to some kind of default IO executor here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I ran into a runtime problem when creating a generator using a null executor. I'm not sure whether the alternative of using the default executor would work well for generators intended for IO, rather than CPU, work. I'm also not aware of a default IO executor, but if there is one then I'd try using it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default_io_context().executor() should work. Or, more directly, io::internal::GetIOThreadPool()

Comment on lines 330 to 331
// TODO: Should Enumerated<>.index be changed to int64_t? Currently, this change
// causes dataset unit-test failures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are probably many places that would fail today if we had more than 2Bi batches. That sort of scale is usually larger than a single node I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and thought about it too. However, note that Enumerated isn't specific for batches; it can enumerate anything coming out of an iterator, which might emit more than 2B items. I'm fine leaving it int, though I'd prefer int32_t.

Comment on lines 386 to 390
auto exec_batch_it = MakeMapIterator(to_exec_batch, std::move(batch_it));
auto enum_it = MakeEnumeratedIterator(std::move(exec_batch_it));
ARROW_ASSIGN_OR_RAISE(auto enum_gen,
MakeBackgroundGenerator(std::move(enum_it), io_executor));
return MakeUnenumeratedGenerator(std::move(enum_gen));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we enumerating and then unenumerating all these generators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that a background generator, which is useful for overlapped IO, does not guarantee in-order delivery. My solution is to enumerate at the iterator, which does guarantee order, then apply the background generator, and finally sort by the enumeration. Though there might be a better way to do overlapped IO; I'd be interested in alternatives.

Copy link
Contributor Author

@rtpsw rtpsw Sep 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out my solution doesn't solve the problem - as before, I still see infrequent cases of out-of-order delivery when the IO-context has multiple threads. Even after examining a fair amount of Arrow code related to generators, and trying a couple of things, I still have no good idea how to fix this. OTOH, it looks like I'm not the only one, because there is an existing multi-threaded unit test involving SourceNode that checks result batches match while ignoring their order. This suggests the possibility that the out-of-order-batches problem originates in SourceNode. I won't try to fix this problem in this PR and will instead resort to ignoring order in the unit tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lidavidm and others added 8 commits September 8, 2022 10:01
Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Otherwise a minimal build fails with errors like

```
CMake Error at cmake_modules/BuildUtils.cmake:272 (target_link_libraries):
  Target "arrow_testing_objlib" links to:

    rapidjson::rapidjson

  but the target was not found.  Possible reasons include:

    * There is a typo in the target name.
    * A find_package call is missing for an IMPORTED target.
    * An ALIAS target is missing.

Call Stack (most recent call first):
  src/arrow/CMakeLists.txt:653 (add_arrow_lib)
```

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
This is the initial PR to set the util functions and structure to include the `ToProto` functionality to relations.
Here the objective is to create an ACERO relation by interpretting what is included in a Substrait-Relation. 
In this PR the `read` relation ToProto is added. 

Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
See https://issues.apache.org/jira/browse/ARROW-17412

Lead-authored-by: Yaron Gvili <rtpsw@hotmail.com>
Co-authored-by: rtpsw <rtpsw@hotmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
…oreSchema (apache#14087)

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
```
/arrow/cpp/src/arrow/compute/exec/bloom_filter.h:252:25: error: type attributes ignored after type is already defined [-Werror=attributes]
  252 | enum class ARROW_EXPORT BloomFilterBuildStrategy {
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~
```

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
…he case where buffer_ is null (apache#14085)

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
pitrou and others added 3 commits September 21, 2022 22:07
- Provide compatibility for 32-bit platforms
- Avoid memory leak in tests
- Make checks less strict

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
…ouldIncludeMetadata provided (apache#14196)

Current implementation include [catalog,table,column,type](https://github.com/apache/arrow/blob/master/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowUtils.java#L248) metadata, but schema metadata field is missing. In terms of PostgreSQL catalog - is database, schema - namespace inside database, so catalog name is insufficient for table addressing without schema.

Proposed changes is + metadata.put(Constants.SQL_SCHEMA_KEY, rsmd.getSchemaName(i));

Authored-by: igor.suhorukov <igor.suhorukov@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
@rtpsw
Copy link
Contributor Author

rtpsw commented Sep 21, 2022

@westonpace, any idea about the CI compilation errors? It's complaining (also) about code I haven't changed, which is suspicious.

js8544 and others added 10 commits September 22, 2022 10:40
Starting from LLVM 13, LLVM IR has been shifting towards a unified opaque pointer type, i.e. pointers without pointee types. It has provided workarounds until LLVM 15. The temporary workarounds need to be replaced in order to support LLVM 15 and onwards. We need to supply the pointee type to the CreateGEP and CreateLoad methods.

For more background info, see https://llvm.org/docs/OpaquePointers.html and https://lists.llvm.org/pipermail/llvm-dev/2015-February/081822.html

Related issues:

https://issues.apache.org/jira/browse/ARROW-14363

https://issues.apache.org/jira/browse/ARROW-17728

https://issues.apache.org/jira/browse/ARROW-17775

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: jinshang <jinshang@tencent.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou
Copy link
Member

pitrou commented Sep 22, 2022

@rtpsw Looking at the number of lines changed here, it seems this PR got astray from git master at some point. Perhaps close this and open a new clean PR?

@rtpsw rtpsw closed this Sep 22, 2022
@rtpsw rtpsw deleted the ARROW-17610 branch September 22, 2022 12:32
@rtpsw
Copy link
Contributor Author

rtpsw commented Sep 22, 2022

Replaced by #14207

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment