Skip to content

Update arrow 40#6412

Merged
tustvold merged 5 commits intoapache:mainfrom
tustvold:update-arrow-40
May 23, 2023
Merged

Update arrow 40#6412
tustvold merged 5 commits intoapache:mainfrom
tustvold:update-arrow-40

Conversation

@tustvold
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions Bot added core Core DataFusion crate optimizer Optimizer rules physical-expr Changes to the physical-expr crates labels May 22, 2023
use datafusion_expr::ColumnarValue;

/// provide DataFusion default cast options
pub const DEFAULT_DATAFUSION_CAST_OPTIONS: CastOptions = CastOptions { safe: false };
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is necessary because FormatOptions isn't const constructible, it is somewhat an accident that CastOptions has historically been - any non-trivial additional options would have broken this

@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label May 22, 2023
SELECT '2000-01-01T00:00:00'::TIMESTAMP::TIMESTAMPTZ
----
2000-01-01T08:00:00+08:00
1999-12-31T16:00:00+08:00
Copy link
Copy Markdown
Contributor Author

@tustvold tustvold May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the result of apache/arrow-rs#4201

Although I'm not entirely sure it is correct... 🤔

Edit: I'm fairly certain this isn't correct, but the logic upstream is correct so I'm confused. Perhaps DF is doing some timezone arithmetic itself somewhere?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I am fairly certain this is a bug in the scalar value logic

❯ create table foo (ts text) as values ('2000-01-01T00:00:00');

❯ select * from foo;
+---------------------+
| ts                  |
+---------------------+
| 2000-01-01T00:00:00 |
+---------------------+

# This is correct
❯ select arrow_cast(ts, 'Timestamp(Nanosecond, Some( "+08:00" ))') from foo;
+---------------------------+
| foo.ts                    |
+---------------------------+
| 2000-01-01T00:00:00+08:00 |
+---------------------------+

# This is incorrect
❯ select arrow_cast('2000-01-01T00:00:00', 'Timestamp(Nanosecond, Some( "+08:00" ))') from foo;
+-----------------------------+
| Utf8("2000-01-01T00:00:00") |
+-----------------------------+
| 1999-12-31T16:00:00+08:00   |
+-----------------------------+
1 row in set. Query took 0.002 seconds.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think a bunch of that logic is replicated in scalar.rs (as it wasn't available upstream in arrow-rs)

I think @berkaysynnada may have some thoughts / be thinking about this.

($DATA_TYPE:ident, $ENUM:expr, $ENUM2:expr, $ARRAY_TYPE:ident, $EXPR:expr, $SIZE:expr) => {{
}

macro_rules! build_timestamp_array_from_option {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split this out into its own macro as overloading based on number of arguments is kind of hard to parse

// Need to call cast to cast to final data type with timezone/extra param
cast(&array, &DataType::$DATA_TYPE($ENUM, $ENUM2))
.expect("cannot do temporal cast")
Arc::new($ARRAY_TYPE::from_value(*value, $SIZE).with_timezone_opt($TZ))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the major change, we use with_timezone_opt instead of cast, as cast will now actually perform timezone conversion

let display_filter = self.filter.as_ref().map_or_else(
|| "".to_string(),
|f| format!(", filter={:?}", f.expression()),
|f| format!(", filter={}", f.expression()),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is so that we don't output all of CastOptions and by extension FormatOptions. It is also more correct to not be printing a debug representation as part of Display

select arrow_cast(timestamp '2000-01-01T00:00:00', 'Timestamp(Nanosecond, Some( "+08:00" ))');
----
2000-01-01T08:00:00+08:00
2000-01-01T00:00:00+08:00
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 It's correct now

@jackwener
Copy link
Copy Markdown
Member

Thanks @tustvold !❤️

@tustvold tustvold merged commit 98397af into apache:main May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate optimizer Optimizer rules physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants