New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17610: [C++] Support additional source types in SourceNode #14041
Conversation
…to java/ (apache#13911) Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@westonpace, I see you were involved in the Why
which could probably be fixed in the tester. Also, why |
Change "There are the [...]" to "These are the [...]". Incorrect wording in previous README. Authored-by: Wilhelm Ågren <36638274+willeagren@users.noreply.github.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…pache#13975) # Overview This pull request: 1. Removes hard-coded dependencies on "master" as the default branch name in the crossbow infrastructure and CI template files. # Implementation 1. Removed comment/text references to "master" branch, including URLs to other repositories. 2. Modified `core.py` to add a new `default_branch` property and a new method `is_default_branch`, for checking whether on the default branch, to the `Target` class. 3. Modified CI template files to use the new `is_default_branch` function to check whether on the default branch. # Testing 1. Using [lafiona/crossbow](https://github.com/lafiona/crossbow) as a queue repository for qualification. 2. Ran modified template jobs. All failures appeared to be unrelated to the changes. 3. The branch names for all relevant qualification jobs are prefixed with `build-34-*`. 4. Example of a passing job: [https://github.com/lafiona/crossbow/actions/runs/2920227769](https://github.com/lafiona/crossbow/actions/runs/2920227769) 5. Example of a failing job: [https://github.com/lafiona/crossbow/runs/7998190113](https://github.com/lafiona/crossbow/runs/7998190113) - in this example, the *"Push Docker Image"* workflow step is not included, since we are not on the default branch. The failure appears to be related to issues fetching R package resources and not related to the default branch checking logic. There were a variety of other kinds of failures, but none of them appear related to the default branch checking logic. # Future Directions 1. Remove "master" from `default_branch` name property of `Target` class. 2. Remove all remaining uses of "master" terminology in crossbow. 3. [ARROW-17512](https://issues.apache.org/jira/browse/ARROW-17512): Address minor issues with crossbow documentation. # Notes 1. Thank you to @lafiona for her help with this pull request! 2. Due to unexpected technical issues, we opened this pull request as a follow up to apache#13750. Please see apache#13750 for more discussion regarding qualification efforts. Lead-authored-by: Kevin Gurney <kgurney@mathworks.com> Co-authored-by: Fiona La <fionala7@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org>
This is the last change I propose to improve our S3 error message. For certain errors, unfortunately the AWS SDK is doing a poor job in propagating the error and just reports UNKNOWN (see https://github.com/aws/aws-sdk-cpp/blob/1614bce979a201ada1e3436358edb7bd1834b5d6/aws-cpp-sdk-core/source/client/AWSClient.cpp#L77), in these cases the HTTP status code can be an important source to find out what is going wrong (and is also reported by boto3). This has the downside of cluttering the error message a bit more, but in general this information will be very valuable to diagnose the problem. Given that we now have the API call and the HTTP status error, in general there is good documentation on the internet that helps diagnose the problem. Before: > When getting information for key 'test.csv' in bucket 'pcmoritz-test-bucket-arrow-errors': AWS Error UNKNOWN during HeadObject call: No response body. After: > When getting information for key 'test.csv' in bucket 'pcmoritz-test-bucket-arrow-errors': AWS Error UNKNOWN **(HTTP status 400)** during HeadObject call: No response body. Lead-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ding transcoding function option to CSV scanner (apache#13820) This is an alternative version of apache#13709, to compare what the best approach is. Instead of extending the C++ ReadOptions struct with an `encoding` field, this implementations adds a python version of the ReadOptions object to both `CsvFileFormat` and `CsvFragmentScanOptions`. The reason it is needed in both places, is to prevent these kinds of inconsistencies: ``` >>> import pyarrow.dataset as ds >>> import pyarrow.csv as csv >>> ro =csv.ReadOptions(encoding='iso8859') >>> fo = ds.CsvFileFormat(read_options=ro) >>> fo.default_fragment_scan_options.read_options.encoding 'utf8' ``` Authored-by: Joost Hoozemans <joosthooz@msn.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…rdReader interface (apache#14057) Lead-authored-by: Matt Topol <zotthewizard@gmail.com> Co-authored-by: Matthew Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
This PR removes the RTools35 CI job as it's currently failing and we're about to drop support following the vote to move to C++17. Authored-by: Dewey Dunnington <dewey@voltrondata.com> Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
…#13973) Support ORC file format in java Dataset API Authored-by: igor.suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…7) (apache#14067) Authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…_handler (apache#14061) Authored-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…nt64 to match C++ code (apache#14032) To fix an exception while writing large parquet files: ``` Traceback (most recent call last): File "pyarrow/_dataset_parquet.pyx", line 165, in pyarrow._dataset_parquet.ParquetFileFormat._finish_write File "pyarrow/dataset.pyx", line 2695, in pyarrow._dataset.WrittenFile.init_ OverflowError: value too large to convert to int Exception ignored in: 'pyarrow._dataset._filesystemdataset_write_visitor' ``` Authored-by: Joost Hoozemans <joosthooz@msn.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…y of fields (apache#14048) ```python table = pa.table({'a': [None, 1], 'b': [None, True]}) new_schema = pa.schema([pa.field("a", "int64", nullable=True), pa.field("b", "bool", nullable=False)]) casted = table.cast(new_schema) ``` Now leads to ``` RuntimeError: Casting field 'b' with null values to non-nullable ``` Authored-by: kshitij12345 <kshitijkalambarkar@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem like helpful utilities. I have a few questions but generally think this would be a good thing to get in.
if (io_executor == NULLPTR) { | ||
return Status::Invalid(kKindName, " requires IO-executor which is not null"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we default to some kind of default IO executor here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, I ran into a runtime problem when creating a generator using a null executor. I'm not sure whether the alternative of using the default executor would work well for generators intended for IO, rather than CPU, work. I'm also not aware of a default IO executor, but if there is one then I'd try using it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default_io_context().executor()
should work. Or, more directly, io::internal::GetIOThreadPool()
// TODO: Should Enumerated<>.index be changed to int64_t? Currently, this change | ||
// causes dataset unit-test failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are probably many places that would fail today if we had more than 2Bi batches. That sort of scale is usually larger than a single node I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and thought about it too. However, note that Enumerated
isn't specific for batches; it can enumerate anything coming out of an iterator, which might emit more than 2B items. I'm fine leaving it int
, though I'd prefer int32_t
.
auto exec_batch_it = MakeMapIterator(to_exec_batch, std::move(batch_it)); | ||
auto enum_it = MakeEnumeratedIterator(std::move(exec_batch_it)); | ||
ARROW_ASSIGN_OR_RAISE(auto enum_gen, | ||
MakeBackgroundGenerator(std::move(enum_it), io_executor)); | ||
return MakeUnenumeratedGenerator(std::move(enum_gen)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we enumerating and then unenumerating all these generators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that a background generator, which is useful for overlapped IO, does not guarantee in-order delivery. My solution is to enumerate at the iterator, which does guarantee order, then apply the background generator, and finally sort by the enumeration. Though there might be a better way to do overlapped IO; I'd be interested in alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out my solution doesn't solve the problem - as before, I still see infrequent cases of out-of-order delivery when the IO-context has multiple threads. Even after examining a fair amount of Arrow code related to generators, and trying a couple of things, I still have no good idea how to fix this. OTOH, it looks like I'm not the only one, because there is an existing multi-threaded unit test involving SourceNode
that checks result batches match while ignoring their order. This suggests the possibility that the out-of-order-batches problem originates in SourceNode
. I won't try to fix this problem in this PR and will instead resort to ignoring order in the unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that there is an archived Apache Arrow Dev discussion of the batch ordering issue.
Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Otherwise a minimal build fails with errors like ``` CMake Error at cmake_modules/BuildUtils.cmake:272 (target_link_libraries): Target "arrow_testing_objlib" links to: rapidjson::rapidjson but the target was not found. Possible reasons include: * There is a typo in the target name. * A find_package call is missing for an IMPORTED target. * An ALIAS target is missing. Call Stack (most recent call first): src/arrow/CMakeLists.txt:653 (add_arrow_lib) ``` Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
This is the initial PR to set the util functions and structure to include the `ToProto` functionality to relations. Here the objective is to create an ACERO relation by interpretting what is included in a Substrait-Relation. In this PR the `read` relation ToProto is added. Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
See https://issues.apache.org/jira/browse/ARROW-17412 Lead-authored-by: Yaron Gvili <rtpsw@hotmail.com> Co-authored-by: rtpsw <rtpsw@hotmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…oreSchema (apache#14087) Authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
``` /arrow/cpp/src/arrow/compute/exec/bloom_filter.h:252:25: error: type attributes ignored after type is already defined [-Werror=attributes] 252 | enum class ARROW_EXPORT BloomFilterBuildStrategy { | ^~~~~~~~~~~~~~~~~~~~~~~~ ``` Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…he case where buffer_ is null (apache#14085) Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
- Provide compatibility for 32-bit platforms - Avoid memory leak in tests - Make checks less strict Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ouldIncludeMetadata provided (apache#14196) Current implementation include [catalog,table,column,type](https://github.com/apache/arrow/blob/master/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowUtils.java#L248) metadata, but schema metadata field is missing. In terms of PostgreSQL catalog - is database, schema - namespace inside database, so catalog name is insufficient for table addressing without schema. Proposed changes is + metadata.put(Constants.SQL_SCHEMA_KEY, rsmd.getSchemaName(i)); Authored-by: igor.suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
@westonpace, any idea about the CI compilation errors? It's complaining (also) about code I haven't changed, which is suspicious. |
Starting from LLVM 13, LLVM IR has been shifting towards a unified opaque pointer type, i.e. pointers without pointee types. It has provided workarounds until LLVM 15. The temporary workarounds need to be replaced in order to support LLVM 15 and onwards. We need to supply the pointee type to the CreateGEP and CreateLoad methods. For more background info, see https://llvm.org/docs/OpaquePointers.html and https://lists.llvm.org/pipermail/llvm-dev/2015-February/081822.html Related issues: https://issues.apache.org/jira/browse/ARROW-14363 https://issues.apache.org/jira/browse/ARROW-17728 https://issues.apache.org/jira/browse/ARROW-17775 Lead-authored-by: Jin Shang <shangjin1997@gmail.com> Co-authored-by: jinshang <jinshang@tencent.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
@rtpsw Looking at the number of lines changed here, it seems this PR got astray from git master at some point. Perhaps close this and open a new clean PR? |
Replaced by #14207 |
See https://issues.apache.org/jira/browse/ARROW-17610