Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11655: [Rust][DataFusion] Postgres String Functions: left, lpad, right, rpad #9565

Closed
wants to merge 2 commits into from

Conversation

seddonm1
Copy link
Contributor

@seddonm1 seddonm1 commented Feb 24, 2021

@alamb Another one. Please pay close attention to the type coercion.

It does two things:

  • fixes the behavior of the type coercion.
  • adds the simple functions left, lpad, right, rpad following the Postgres style.

This PR is a child of #9243

@seddonm1 seddonm1 changed the title ARROW-11655: [left, lpad, right, rpad ARROW-11655: [Rust][DataFusion] Postgres String Functions: left, lpad, right, rpad Feb 24, 2021
@github-actions
Copy link

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work @seddonm1 -- I find this code very straightforward to follow and I think it is superbly tested. Very well done.

I would like to clarify the intention / need for the changes to can_coercce before we merge but otherwise this is 🥇 ✅

Int8 => matches!(type_from, Int8),
Int16 => matches!(type_from, Int8 | Int16 | UInt8),
Int32 => matches!(type_from, Int8 | Int16 | Int32 | UInt8 | UInt16),
Int8 => matches!(type_from, Int8 | Utf8 | LargeUtf8),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems inconsistent with the comments of can_coerce_from:

/// Return true if a value of type `type_from` can be coerced
/// (losslessly converted) into a value of `type_to`

What would happen if we tried to coerce "foo" into an Int8? I think the correct answer is that that we should get a runtime error. I worry that we would end up silently converting to Null.

Can you explain why these coercion rules are needed? And if we decide they are, can we please update the comments of this function to reflect the newfound understanding?

@jorgecarleitao I think did some previous work in this area, so he may have some relevant perspective. cc @andygrove in case you are interested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I think you have hit on a bigger issue.

Postgres will do this type coercion silently: SELECT LEFT('abcde', '1'); will return a. And SELECT LEFT('abcde', 'a'); will return invalid input syntax for type integer: "a".

I think the default return of a failed CAST in DataFusion is currently NULL which is not good and not an ANSI expected behavior. I will volunteer to fix it if we can reach a consensus.

This is one of my major issues with Spark. Some other engines explicitly call out this behavior with SAFE CAST.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cast in DataFusion relies on the cast in Arrow. I think Arrow should have an option (I think there is a JIRA issue for that already) that would return an error on a failing cast instead of nulls, which is I think what we then can/should use for DataFusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan @alamb if you agree I can try to add it as I still have the function work in my head.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that adding an option to the arrow cast kernel to error (rather than silently return Null would be very valuable). I can see usecases for returning Null which is why I don't think we could just change the cast kernel's behavior. But an option would be lovely ❤️

Sorry for the delay in responding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem @alamb. How about I revert the coercion logic from this PR and re-add the explicit cast in the tests. I think there is a major piece of work to fully address CAST/coercion.

Copy link
Contributor

@alamb alamb Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I revert the coercion logic from this PR and re-add the explicit cast in the tests.

Sounds like a great plan

I think there is a major piece of work to fully address CAST/coercion.

I agree 💯

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to add a new function maybe_cast -> Result<Arc<dyn Array>>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao @alamb

I would like to get the rest of these SQL functions implemented then pick up the type-coercion work. I was also thinking a try_cast implementation but it will be a lot of work.

I think we also need to produce a matrix like this: https://docs.microsoft.com/en-us/sql/t-sql/functions/cast-and-convert-transact-sql?view=sql-server-ver15#implicit-conversions to work out the scope of the changes as I think we have a very imbalanced set of implementations at the moment.

register_aggregate_csv(&mut ctx)?;
let sql = "SELECT sin(c1) FROM aggregate_test_100";
let plan = ctx.create_logical_plan(&sql);
assert!(plan.is_err());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave this test in (even if we decide that sin(utf8) should not error) and update it to show the expected results as a way of documenting the expected behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok ill re-add.

rust/datafusion/tests/sql.rs Outdated Show resolved Hide resolved
None
} else {
x.map(|x: &str| {
let n: i64 = n_array.value(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thanks a lot for this, @seddonm1 . Impressive.

I do not have time to go through all of this, but I will try to comment over what I find.

Because we do not check that the length of the two arrays are equal, I think that this will read out of bounds whenever n_array.len() < string_array.len().

I would recommend tthat we do not use .value or is_null and instead only use the corresponding iterators and zip. The reason being that even if the iterators have a different length, the resulting code is still sound as the zipped iterator stops at the shortest iterator.

More generally, I think we should set a specific guideline and not allow new code to use these unsafe-yet-safe APIs, or better, just mark them as deprecated so that people use the corresponding safe ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jorgecarleitao thanks for this.

My understanding is that one of the core properties of a RecordBatch is that all columns must have the same length: https://github.com/apache/arrow/blob/master/rust/arrow/src/record_batch.rs#L52 implemented here: https://github.com/apache/arrow/blob/master/rust/arrow/src/record_batch.rs#L134

From what I can see, if we did adopt a zip then we would implicitly be treating the shorter argument as a None which wont break the out of bounds check but might produce some very strange function results.

I do agree with you that many of the core Rust Arrow implementations are throwing away the benefits of the Rust compiler so we should try to sensibly refactor for safety.

@seddonm1
Copy link
Contributor Author

seddonm1 commented Mar 1, 2021

@alamb This is the PR with the type-coercion changes removed and the explicit CAST(NULL AS INT) re-added to ensure tests will pass.

@seddonm1
Copy link
Contributor Author

seddonm1 commented Mar 2, 2021

@alamb @jorgecarleitao
I have reworked these functions to use zip instead of enumerate .is_null(i). I think it is good to go. I will do the same for the next batch of functions once this is merged.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this again and looks good to me. I didn't fully review all the functions with zip implementations but I skimmed them and spot checked a few and they looked good.

Thanks again @seddonm1 -- epic work

test_expression!("rpad('hi', 5, 'xy')", "hixyx");
test_expression!("rpad('hi', 5, NULL)", "NULL");
test_expression!("rpad('hi', 5)", "hi ");
test_expression!("rpad('hi', CAST(NULL AS INT), 'xy')", "NULL");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is interesting that the casts that are required are for casting Nulls CAST(NULL AS INT) --- I'll file a ticket to potentially allow that to work (I think it is fine to coerce a null constant into any type)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented Mar 3, 2021

Thanks again @seddonm1 -- I think this PR is ready to go and has been hanging out for a long time -- we can improve things in subsequent PRs. I am going to merge it in. FYI @jorgecarleitao

@alamb alamb closed this in 090ac1c Mar 3, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…, right, rpad

@alamb Another one. Please pay close attention to the type coercion.

It does two things:
- fixes the behavior of the **type coercion**.
- adds the simple functions `left`, `lpad`, `right`, `rpad` following the Postgres style.

This PR is a child of apache#9243

Closes apache#9565 from seddonm1/left_ltrim_right_rpad

Authored-by: Mike Seddon <seddonm1@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…, right, rpad

@alamb Another one. Please pay close attention to the type coercion.

It does two things:
- fixes the behavior of the **type coercion**.
- adds the simple functions `left`, `lpad`, `right`, `rpad` following the Postgres style.

This PR is a child of apache#9243

Closes apache#9565 from seddonm1/left_ltrim_right_rpad

Authored-by: Mike Seddon <seddonm1@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
@asfimport asfimport mentioned this pull request Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants