Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to arrow 20.0.0 #3007

Closed

Conversation

avantgardnerio
Copy link
Contributor

Which issue does this PR close?

Closes #3006.

Rationale for this change

We'll have to do it sooner or later, but mostly I want the new authentication capabilities in the FlightSqlServer code.

What changes are included in this PR?

An attempt at upgrading to the latest arrow.

Are there any user-facing changes?

It doesn't (presently) compile, so I'm making a draft PR to seek input on CI failures.

@github-actions github-actions bot added core Core datafusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql labels Aug 1, 2022
@avantgardnerio
Copy link
Contributor Author

I fixed a lot of the obvious errors, but I wasn't sure what to do about

error[E0308]: mismatched types
   --> datafusion/physical-expr/src/expressions/binary.rs:135:40
    |
135 |         .map(|left| left.map(|left| op(left, right)))
    |                                     -- ^^^^ expected `i128`, found struct `arrow::util::decimal::Decimal128`
    |                                     |
    |                                     arguments to this function are incorrect
    |
note: callable defined here
   --> datafusion/physical-expr/src/expressions/binary.rs:131:8
    |
131 |     F: Fn(i128, i128) -> bool,
    |        ^^^^^^^^^^^^^^^^^^^^^^

and

error[E0308]: mismatched types
    --> datafusion/physical-expr/src/expressions/in_list.rs:225:68
     |
225  |                     .map(|vop| match vop.map(|v| !$VALUES.contains(&v)) {
     |                                                           -------- ^^ expected `i128`, found struct `arrow::util::decimal::Decimal128`
     |                                                           |
     |                                                           arguments to this function are incorrect
...
318  |     collection_contains_check!(array, values, negated, contains_null)
     |     ----------------------------------------------------------------- in this macro invocation
     |
     = note: expected reference `&i128`
                found reference `&arrow::util::decimal::Decimal128`
note: associated function defined here
    --> /home/bgardner/snap/rustup/common/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/slice/mod.rs:2172:12
     |
2172 |     pub fn contains(&self, x: &T) -> bool
     |            ^^^^^^^^
     = note: this error originates in the macro `collection_contains_check` (in Nightly builds, run with -Z macro-backtrace for more info)

CC @andygrove and @alamb - if you guys have any advice, it would be appreciated.

@avantgardnerio
Copy link
Contributor Author

If I understand correctly, the other XXXArray types have iterators that return the primitive representations (i.e. u128), but Decimal now returns a struct with From and to() traits?

https://github.com/apache/arrow-rs/blob/master/arrow/src/util/decimal.rs#L132

@avantgardnerio
Copy link
Contributor Author

If I revert the change here: https://github.com/apache/arrow-rs/pull/2140/files to:

def_decimal_array!(
    Decimal128Array,
    "Decimal128Array",
    Decimal128,
    Decimal128Iter
);

vs

def_decimal_array!(Decimal128Array, "Decimal128Array");

It compiles and happily runs. I'm not sure if that's the correct fix though.

@avantgardnerio
Copy link
Contributor Author

CC @HaoYang670

@avantgardnerio
Copy link
Contributor Author

@avantgardnerio
Copy link
Contributor Author

From i128 to Decimal128

But that ripples up to ScalarValues and throughout the codebase quickly.

@viirya
Copy link
Member

viirya commented Aug 1, 2022

expected i128, found struct arrow::util::decimal::Decimal128

Have you tried calling as_i128()?

@stuartcarnie
Copy link
Contributor

👋🏻 hi @avantgardnerio I was able to successfully upgrade arrow-rs via this draft PR for which all DataFusion tests pass: #3010. Perhaps there are commits in this PR that are useful for your work here?

@alamb
Copy link
Contributor

alamb commented Aug 2, 2022

Thank you @avantgardnerio and @stuartcarnie -- this is so great. I have normally been doing this work during / as part of the release process (e.g. apache/arrow-rs#2172)

THANK YOU so much for getting a start on it earlier.

@alamb alamb changed the title Upgrade arrow Upgrade to arrow 20.0.0 Aug 2, 2022
@avantgardnerio
Copy link
Contributor Author

Perhaps there are commits in this PR that are useful for your work here?

100%. Thank you so much!

@avantgardnerio
Copy link
Contributor Author

THANK YOU so much for getting a start on it earlier.

Well, I wanted one of the arrow-rs features :). But really, thanks to @andygrove and @stuartcarnie for figuring out how to make it work. I'm still learning. I'll look over the work the others did and familiarize myself with the solution so I can maybe not get stuck next time.

@avantgardnerio
Copy link
Contributor Author

Have you tried calling as_i128()?

I did, but however I tried to do it rippled up through the codebase and affected lots of places. @andygrove was able to figure it out. Thank you both for your help!

@avantgardnerio
Copy link
Contributor Author

@andygrove and @alamb looks like the checks are failing because lack of protoc? Any idea why this would have just started?

@carols10cents
Copy link
Contributor

Yes, prost is no longer bundling protoc. It looks like this repo will need changes similar to these? apache/arrow-rs#2280

@alamb
Copy link
Contributor

alamb commented Aug 3, 2022

@avantgardnerio I can help mess around with (aka help with) this PR as part of preparing the arrow release candidate, which I will likely do this Friday

@avantgardnerio
Copy link
Contributor Author

likely do this Friday

I'd love to get this one in first, but I'm sure there's a lot of people with a lot of interests and they all need to be considered: apache/arrow-rs#2309

I'm happy to help in any way I can with the release.

@avantgardnerio avantgardnerio force-pushed the bg_upgrade_arrow branch 3 times, most recently from 520c5fa to b015c83 Compare August 6, 2022 01:14
@avantgardnerio
Copy link
Contributor Author

@tustvold I think I'm stuck on these failures:

---- physical_plan::file_format::parquet::tests::parquet_exec_with_error stdout ----
thread 'physical_plan::file_format::parquet::tests::parquet_exec_with_error' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidPath { path: "." }', datafusion/core/src/physical_plan/file_format/parquet.rs:1243:14

---- test_util::tests::test_data_dir stdout ----
thread 'test_util::tests::test_data_dir' panicked at 'assertion failed: `(left == right)`
  left: `"/Users/runner/work/arrow-datafusion/arrow-datafusion/datafusion"`,
 right: `"/Users/runner/work/arrow-datafusion/arrow-datafusion/datafusion/core/.."`', datafusion/core/src/test_util.rs:345:9

It appears they are because this branch is referencing ../objectstore instead of object_store = "0.3.0" on master. I'm a little reticent to switch is back as this seems more appropriate and there's build errors either way. When you get a moment, I'd appreciate input, thanks!

@alamb
Copy link
Contributor

alamb commented Aug 6, 2022

It appears they are because this branch is referencing ../objectstore instead of object_store = "0.3.0" on master. I'm a little reticent to switch is back as this seems more appropriate and there's build errors either way. When you get a moment, I'd appreciate input, thanks!

I'll try and give it a look on monday. Thanks @avantgardnerio

cranelift-module = { version = "0.86.1", optional = true }
object_store = { version = "0.3", optional = true }
object_store = { git = "https://github.com/apache/arrow-rs.git", rev = "30c94dbf1c422f81f8520b9956e96ab7b53c3f47", features = [], optional = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we intend to release object_store as part of arrow 20, and so I would back out this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an alternative version with object_store not upgraded, but I struggled with this too: #3083

@@ -75,10 +75,10 @@ lazy_static = { version = "^1.4.0" }
log = "^0.4"
num-traits = { version = "0.2", optional = true }
num_cpus = "1.13.0"
object_store = "0.3.0"
object_store = { git = "https://github.com/apache/arrow-rs.git", rev = "30c94dbf1c422f81f8520b9956e96ab7b53c3f47", features = [], optional = false }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@avantgardnerio
Copy link
Contributor Author

Thanks @alamb and @tustvold . I'll be out most of today, I'll try to devote time as I have it this weekend. I'm sure we can get it sorted by Monday.

fix decimal (#4)

Fix human error

Patch crates io to fix build (#5)

* fix decimal

* patch crate versions

Patch objectstore

Test in CI

Undo override?

Fix more errors

Fix last error?

Formatting

Clippy

Fixes

Fix refs

Able to get session context, but JDBC driver hung

Upgrade to arrow 20

Upgrade to RC2

Formatting

Fix some imports

Install protoc

Try platform agnostic path

Debug in CI :(

Debug in CI :(

Debug in CI :(

Not worth it, just separate builds

Variables

Fixes

Fix windows?

Fix windows?

Hackily fix windows

Down to 1 failure

Fix protoc

All? tests pass

Formatting
@codecov-commenter
Copy link

Codecov Report

Merging #3007 (a96911d) into master (85d5363) will decrease coverage by 0.01%.
The diff coverage is 92.30%.

❗ Current head a96911d differs from pull request most recent head 5270354. Consider uploading reports for the commit 5270354 to get more accurate results

@@            Coverage Diff             @@
##           master    #3007      +/-   ##
==========================================
- Coverage   85.93%   85.91%   -0.02%     
==========================================
  Files         289      289              
  Lines       52101    52116      +15     
==========================================
+ Hits        44774    44778       +4     
- Misses       7327     7338      +11     
Impacted Files Coverage Δ
...usion/core/src/avro_to_arrow/arrow_array_reader.rs 0.00% <0.00%> (ø)
datafusion/core/src/catalog/information_schema.rs 94.42% <0.00%> (ø)
...ore/src/physical_plan/file_format/chunked_store.rs 62.26% <0.00%> (-5.09%) ⬇️
datafusion/core/src/physical_plan/hash_utils.rs 40.43% <0.00%> (ø)
datafusion/core/src/physical_plan/repartition.rs 93.26% <ø> (ø)
...tafusion/core/src/physical_plan/sort_merge_join.rs 90.36% <0.00%> (ø)
datafusion/core/tests/path_partition.rs 84.92% <0.00%> (-0.87%) ⬇️
datafusion/core/tests/sql/aggregates.rs 99.36% <ø> (ø)
datafusion/core/tests/sql/decimal.rs 100.00% <ø> (ø)
datafusion/core/tests/sql/joins.rs 99.33% <ø> (ø)
... and 40 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@avantgardnerio
Copy link
Contributor Author

I got all the tests passing except windows. I installed a VM and tried to fix that too, but I'm getting lots of interesting errors with paths like file:///?\C:\some\path/myfile.parquet and I have to admit I'm a bit lost on those. I can probably debug them, but I might need to set up a real Windows environment instead of a slow VM.

@tustvold
Copy link
Contributor

tustvold commented Aug 8, 2022

I got all the tests passing except windows. I installed a VM and tried to fix that too, but I'm getting lots of interesting errors with paths like file:///?\C:\some\path/myfile.parquet and I have to admit I'm a bit lost on those. I can probably debug them, but I might need to set up a real Windows environment instead of a slow VM.

The path handling changed upstream, in particular apache/arrow-rs#2269, I could definitely see this having downstream implications. There is an outstanding PR apache/arrow-rs#2371 that may help, but I would strongly advise against upgrading object_store as part of this PR. Once we have a RC for the next object_store version, I'll work on updating DataFusion to it including all the various windows path nonsense

@avantgardnerio
Copy link
Contributor Author

Closing in favor of #3083

@avantgardnerio avantgardnerio deleted the bg_upgrade_arrow branch September 5, 2022 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The upcoming release of Arrow (20?) breaks datafusion
7 participants