-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11822: [Rust][Datafusion] Support case sensitive for function #9600
Conversation
ARROW-11778: [Rust] Cast from LargeUtf8 to Numerical and temporal types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wqc200 -- this is great functionality to have.
I would personally like to have datafusion follow postgres when it comes to case sensitivity (rather than having it be an option).
@andygrove / @seddonm1 / @Dandandan what do you think about making DF follow postgres's lead (and effectively convert everything to lowercase, except identifiers explicitly quoted.
For example Database
--> database
but "Database"
--> Database
?
Here is a blog post that I found that talks about posgres sensitivity
https://blog.xojo.com/2016/09/28/about-postgresql-case-sensitivity/
This would be great to have indeed (and start doing some more testing on, I think ATM we are both case sensitive and case insensitive. I agree it would be best to start having it default to using PostgreSQL case insensitivity without it being an option, to avoid making the code more complex than needed. |
Thank you very much!I'm going to do this in two steps, step one: support case-sensitive functionality, step two: make it case-insensitive by default, and fix the test cases |
@alamb |
I think it is good that we default to Postgres dialect and case-sensitivity but I would like to make it possible for users to choose other dialects and case-sensitivity if there isn't a huge burden in doing so. I can see DataFusion being incredibly useful in projects where it is necessary to mimic another database system. |
Makes sense @wqc200 -- I will try and review this PR carefully tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the contribution @wqc200 -- this looks like a great start.
I would like to request that you add a test that changes the default value of case_sensitive
and demonstrate that errors still occur.
rust/datafusion/src/sql/planner.rs
Outdated
@@ -343,7 +346,7 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { | |||
} | |||
|
|||
/// Generate a logic plan from an SQL select | |||
fn select_to_plan(&self, select: &Select) -> Result<LogicalPlan> { | |||
pub fn select_to_plan(&self, select: &Select) -> Result<LogicalPlan> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be made pub
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this has nothing to do with the current PR, I'm using this function in my project, so I added pub
@@ -160,20 +160,26 @@ pub enum Expr { | |||
}, | |||
/// Represents the call of a built-in scalar function with a set of arguments. | |||
ScalarFunction { | |||
/// The input name of the function | |||
input_name: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I looked through the diff and I couldn't figure out what input_name
is actually used for. Is this new field needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mainly used to display the name of the function entered by the user.
md5 (Utf8("a")) instead of mD5(Utf8("a")) if there is no input_name.
For example:
mysql> select mD5('a');
+----------------------------------+
| mD5(Utf8("a")) |
+----------------------------------+
| 0cc175b9c0f1b6a831c399e269772661 |
+----------------------------------+
1 row in set (0.01 sec)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW here is what postgres does for the same query (uses the lowercase, canonical function name)
alamb=# select mD5('a');
md5
----------------------------------
0cc175b9c0f1b6a831c399e269772661
(1 row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, i fix it, use the lowercase in postgress. I add a enum with two feild, one is LikeMySQL, another is LikePostgreSQL.
select mD5('a');
md5
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @wqc200 -- I think this PR's code is now ready to go except for are some basic tests that show the difference in behavior when setting CaseStyle
and cast_sensitive
(not changing the defaults).
The purpose of such tests are to ensure no one breaks this feature accidentally when making changes elsewhere and in the future.
I am willing to write these tests this PR can be merged given how much time you have spent on this already -- let me know if you would like help to do so
Ok |
merge official branch
merge offical branch
This PR has been more than 1 month, when to merge? |
@wqc200 -- from my perspective all this PR is waiting for is tests:
I do have writing such tests on my personal work list but I have not yet had a chance to write them |
merge from official branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While trying to write tests for this PR I discovered some inconsistencies:
-
Aggregate functions already (prior to this PR) are checked using case insensitive comparison (e.g.
select MAX(x)
andselect max(x)
both work - the code is here -
scalar functions are checked using case sensitive comparison
This PR has strange behavior that the case_sensitive
setting only applies to scalar functions (and aggregates still always use case insensitive comparisons).
Thus I plan to break this PR into two new PRs -- one for the CaseStyle
setting and one that proposes to make function name lookup always case insensitive (and thus consistent between aggregates and scalar functions)
I spent some time working on case sensitive functions this morning and have a proposed PR: #9827. I think the exercise of writing tests to demonstrate the behavior was a good one. After #9827 is sorted out, I plan to give similar treatment to the part of this PR that defies how identifiers are shown in the query output |
merge from official branch
…r functions and aggregates Broken out from #9600 by @wqc200. Note this does not contain the part of #9600 that controls the output display of functions. # Rationale Aggregate functions are checked using case insensitive comparison (e.g. `select MAX(x)` and `select max(x)` both work - the code is [here](https://github.com/apache/arrow/blob/356c300c5ee1e2b23a83652514af11e3a731d596/rust/datafusion/src/physical_plan/aggregates.rs#L75) However, scalar functions, user defined aggregates, and user defined functions, are checked using case sensitive comparisons (e.g. `select sqrt(x)` works while `select SQRT` does not. Postgres always uses case insensitive comparison: ``` alamb=# select sqrt(x) from foo; sqrt ------ (0 rows) alamb=# select SQRT(x) from foo; sqrt ------ (0 rows) ``` # Changes Always use case insensitive comparisons for unquoted identifier comparison, both for consistency within DataFusion as well as consistency with Postgres (and the SQL standard) Adds tests that demonstrate the behavior # Notes This PR changes how user defined functions are resolved in SQL queries. If a user registers two functions with names `"my_sqrt"` and `"MY_SQRT"` previously they could both be called individually. After this PR `my_sqrt` will be called unless the user specifically put `"SQRT"` (in quotes) in their query. Closes #9827 from alamb/case_insensitive_functions Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
…r functions and aggregates Broken out from apache#9600 by @wqc200. Note this does not contain the part of apache#9600 that controls the output display of functions. # Rationale Aggregate functions are checked using case insensitive comparison (e.g. `select MAX(x)` and `select max(x)` both work - the code is [here](https://github.com/apache/arrow/blob/356c300c5ee1e2b23a83652514af11e3a731d596/rust/datafusion/src/physical_plan/aggregates.rs#L75) However, scalar functions, user defined aggregates, and user defined functions, are checked using case sensitive comparisons (e.g. `select sqrt(x)` works while `select SQRT` does not. Postgres always uses case insensitive comparison: ``` alamb=# select sqrt(x) from foo; sqrt ------ (0 rows) alamb=# select SQRT(x) from foo; sqrt ------ (0 rows) ``` # Changes Always use case insensitive comparisons for unquoted identifier comparison, both for consistency within DataFusion as well as consistency with Postgres (and the SQL standard) Adds tests that demonstrate the behavior # Notes This PR changes how user defined functions are resolved in SQL queries. If a user registers two functions with names `"my_sqrt"` and `"MY_SQRT"` previously they could both be called individually. After this PR `my_sqrt` will be called unless the user specifically put `"SQRT"` (in quotes) in their query. Closes apache#9827 from alamb/case_insensitive_functions Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
merge from official
The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository Please see the mailing-list thread for more details We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs. |
You're making me go back and forth. I've been working on this PR for months, but you just won't merge it.Now you have built a new warehouse, and I need to do this PR again. May I ask what you really want to do? |
Hi @wqc200 I am sorry this has been a frustrating experience for you. The reason I have not merged this PR is because it does not have tests for the new features that were added Also, we implemented the case insensitive comparison this PR describes in #9827 I suggest:
|
#10096 has removed the arrow implementation from this repository (it now resides in https://github.com/apache/arrow-rs and https://github.com/apache/arrow-datafusion) in the hopes of streamlining the development process Please re-target this PR (let us know if you need help doing so) to one/both of the new repositories. Thank you for understanding and helping to make arrow-rs and datafusion better |
…r functions and aggregates Broken out from apache#9600 by @wqc200. Note this does not contain the part of apache#9600 that controls the output display of functions. # Rationale Aggregate functions are checked using case insensitive comparison (e.g. `select MAX(x)` and `select max(x)` both work - the code is [here](https://github.com/apache/arrow/blob/356c300c5ee1e2b23a83652514af11e3a731d596/rust/datafusion/src/physical_plan/aggregates.rs#L75) However, scalar functions, user defined aggregates, and user defined functions, are checked using case sensitive comparisons (e.g. `select sqrt(x)` works while `select SQRT` does not. Postgres always uses case insensitive comparison: ``` alamb=# select sqrt(x) from foo; sqrt ------ (0 rows) alamb=# select SQRT(x) from foo; sqrt ------ (0 rows) ``` # Changes Always use case insensitive comparisons for unquoted identifier comparison, both for consistency within DataFusion as well as consistency with Postgres (and the SQL standard) Adds tests that demonstrate the behavior # Notes This PR changes how user defined functions are resolved in SQL queries. If a user registers two functions with names `"my_sqrt"` and `"MY_SQRT"` previously they could both be called individually. After this PR `my_sqrt` will be called unless the user specifically put `"SQRT"` (in quotes) in their query. Closes apache#9827 from alamb/case_insensitive_functions Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
…r functions and aggregates Broken out from apache#9600 by @wqc200. Note this does not contain the part of apache#9600 that controls the output display of functions. # Rationale Aggregate functions are checked using case insensitive comparison (e.g. `select MAX(x)` and `select max(x)` both work - the code is [here](https://github.com/apache/arrow/blob/356c300c5ee1e2b23a83652514af11e3a731d596/rust/datafusion/src/physical_plan/aggregates.rs#L75) However, scalar functions, user defined aggregates, and user defined functions, are checked using case sensitive comparisons (e.g. `select sqrt(x)` works while `select SQRT` does not. Postgres always uses case insensitive comparison: ``` alamb=# select sqrt(x) from foo; sqrt ------ (0 rows) alamb=# select SQRT(x) from foo; sqrt ------ (0 rows) ``` # Changes Always use case insensitive comparisons for unquoted identifier comparison, both for consistency within DataFusion as well as consistency with Postgres (and the SQL standard) Adds tests that demonstrate the behavior # Notes This PR changes how user defined functions are resolved in SQL queries. If a user registers two functions with names `"my_sqrt"` and `"MY_SQRT"` previously they could both be called individually. After this PR `my_sqrt` will be called unless the user specifically put `"SQRT"` (in quotes) in their query. Closes apache#9827 from alamb/case_insensitive_functions Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
…r functions and aggregates Broken out from apache#9600 by @wqc200. Note this does not contain the part of apache#9600 that controls the output display of functions. # Rationale Aggregate functions are checked using case insensitive comparison (e.g. `select MAX(x)` and `select max(x)` both work - the code is [here](https://github.com/apache/arrow/blob/356c300c5ee1e2b23a83652514af11e3a731d596/rust/datafusion/src/physical_plan/aggregates.rs#L75) However, scalar functions, user defined aggregates, and user defined functions, are checked using case sensitive comparisons (e.g. `select sqrt(x)` works while `select SQRT` does not. Postgres always uses case insensitive comparison: ``` alamb=# select sqrt(x) from foo; sqrt ------ (0 rows) alamb=# select SQRT(x) from foo; sqrt ------ (0 rows) ``` # Changes Always use case insensitive comparisons for unquoted identifier comparison, both for consistency within DataFusion as well as consistency with Postgres (and the SQL standard) Adds tests that demonstrate the behavior # Notes This PR changes how user defined functions are resolved in SQL queries. If a user registers two functions with names `"my_sqrt"` and `"MY_SQRT"` previously they could both be called individually. After this PR `my_sqrt` will be called unless the user specifically put `"SQRT"` (in quotes) in their query. Closes apache#9827 from alamb/case_insensitive_functions Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
SELECT database() and SELECT DATABASE () are the same and can be queried normally