Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11822: [Rust][Datafusion] Support case sensitive for function #9600

Closed
wants to merge 44 commits into from
Closed

Conversation

wqc200
Copy link
Contributor

@wqc200 wqc200 commented Feb 28, 2021

SELECT database() and SELECT DATABASE () are the same and can be queried normally

@github-actions
Copy link

@wqc200 wqc200 changed the title ARROW-11822: [Rust] Support case sensitive for function ARROW-11822: [Rust][datafusion] Support case sensitive for function Feb 28, 2021
@wqc200 wqc200 changed the title ARROW-11822: [Rust][datafusion] Support case sensitive for function ARROW-11822: [Rust][Datafusion] Support case sensitive for function Feb 28, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wqc200 -- this is great functionality to have.

I would personally like to have datafusion follow postgres when it comes to case sensitivity (rather than having it be an option).

@andygrove / @seddonm1 / @Dandandan what do you think about making DF follow postgres's lead (and effectively convert everything to lowercase, except identifiers explicitly quoted.

For example Database --> database but "Database" --> Database?

Here is a blog post that I found that talks about posgres sensitivity
https://blog.xojo.com/2016/09/28/about-postgresql-case-sensitivity/

@Dandandan
Copy link
Contributor

This would be great to have indeed (and start doing some more testing on, I think ATM we are both case sensitive and case insensitive.

I agree it would be best to start having it default to using PostgreSQL case insensitivity without it being an option, to avoid making the code more complex than needed.

@wqc200
Copy link
Contributor Author

wqc200 commented Feb 28, 2021

This would be great to have indeed (and start doing some more testing on, I think ATM we are both case sensitive and case insensitive.

I agree it would be best to start having it default to using PostgreSQL case insensitivity without it being an option, to avoid making the code more complex than needed.

Thank you very much!I'm going to do this in two steps, step one: support case-sensitive functionality, step two: make it case-insensitive by default, and fix the test cases

@wqc200
Copy link
Contributor Author

wqc200 commented Feb 28, 2021

Thanks @wqc200 -- this is great functionality to have.

I would personally like to have datafusion follow postgres when it comes to case sensitivity (rather than having it be an option).

@andygrove / @seddonm1 / @Dandandan what do you think about making DF follow postgres's lead (and effectively convert everything to lowercase, except identifiers explicitly quoted.

For example Database --> database but "Database" --> Database?

Here is a blog post that I found that talks about posgres sensitivity
https://blog.xojo.com/2016/09/28/about-postgresql-case-sensitivity/

@alamb
I'm now using DataFusion which is in the MySQL protocol. If I didn't make the case sensitive option, I would have a problem here.But we can make the default consistent with PostGRE

@andygrove
Copy link
Member

I think it is good that we default to Postgres dialect and case-sensitivity but I would like to make it possible for users to choose other dialects and case-sensitivity if there isn't a huge burden in doing so.

I can see DataFusion being incredibly useful in projects where it is necessary to mimic another database system.

@alamb
Copy link
Contributor

alamb commented Feb 28, 2021

Thank you very much!I'm going to do this in two steps, step one: support case-sensitive functionality, step two: make it case-insensitive by default, and fix the test cases

Makes sense @wqc200 -- I will try and review this PR carefully tomorrow.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the contribution @wqc200 -- this looks like a great start.

I would like to request that you add a test that changes the default value of case_sensitive and demonstrate that errors still occur.

rust/datafusion/src/sql/planner.rs Outdated Show resolved Hide resolved
rust/datafusion/src/execution/context.rs Show resolved Hide resolved
rust/arrow-flight/src/arrow.flight.protocol.rs Outdated Show resolved Hide resolved
rust/datafusion/src/execution/context.rs Outdated Show resolved Hide resolved
@@ -343,7 +346,7 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
}

/// Generate a logic plan from an SQL select
fn select_to_plan(&self, select: &Select) -> Result<LogicalPlan> {
pub fn select_to_plan(&self, select: &Select) -> Result<LogicalPlan> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be made pub?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this has nothing to do with the current PR, I'm using this function in my project, so I added pub

@@ -160,20 +160,26 @@ pub enum Expr {
},
/// Represents the call of a built-in scalar function with a set of arguments.
ScalarFunction {
/// The input name of the function
input_name: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I looked through the diff and I couldn't figure out what input_name is actually used for. Is this new field needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mainly used to display the name of the function entered by the user.
md5 (Utf8("a")) instead of mD5(Utf8("a")) if there is no input_name.

For example:

mysql> select mD5('a');
+----------------------------------+
| mD5(Utf8("a")) |
+----------------------------------+
| 0cc175b9c0f1b6a831c399e269772661 |
+----------------------------------+
1 row in set (0.01 sec)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW here is what postgres does for the same query (uses the lowercase, canonical function name)

alamb=# select mD5('a');
               md5                
----------------------------------
 0cc175b9c0f1b6a831c399e269772661
(1 row)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i fix it, use the lowercase in postgress. I add a enum with two feild, one is LikeMySQL, another is LikePostgreSQL.

select mD5('a');
md5

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @wqc200 -- I think this PR's code is now ready to go except for are some basic tests that show the difference in behavior when setting CaseStyle and cast_sensitive (not changing the defaults).

The purpose of such tests are to ensure no one breaks this feature accidentally when making changes elsewhere and in the future.

I am willing to write these tests this PR can be merged given how much time you have spent on this already -- let me know if you would like help to do so

@wqc200
Copy link
Contributor Author

wqc200 commented Mar 20, 2021

Thank you very much @wqc200 -- I think this PR's code is now ready to go except for are some basic tests that show the difference in behavior when setting CaseStyle and cast_sensitive (not changing the defaults).

The purpose of such tests are to ensure no one breaks this feature accidentally when making changes elsewhere and in the future.

I am willing to write these tests this PR can be merged given how much time you have spent on this already -- let me know if you would like help to do so

Ok

@wqc200
Copy link
Contributor Author

wqc200 commented Mar 24, 2021

@alamb
@Dandandan

This PR has been more than 1 month, when to merge?

@alamb
Copy link
Contributor

alamb commented Mar 24, 2021

@wqc200 -- from my perspective all this PR is waiting for is tests:

some basic tests that show the difference in behavior when setting CaseStyle and cast_sensitive (not changing the defaults).

I do have writing such tests on my personal work list but I have not yet had a chance to write them

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While trying to write tests for this PR I discovered some inconsistencies:

  1. Aggregate functions already (prior to this PR) are checked using case insensitive comparison (e.g. select MAX(x) and select max(x) both work - the code is here

  2. scalar functions are checked using case sensitive comparison

This PR has strange behavior that the case_sensitive setting only applies to scalar functions (and aggregates still always use case insensitive comparisons).

Thus I plan to break this PR into two new PRs -- one for the CaseStyle setting and one that proposes to make function name lookup always case insensitive (and thus consistent between aggregates and scalar functions)

@alamb
Copy link
Contributor

alamb commented Mar 28, 2021

I spent some time working on case sensitive functions this morning and have a proposed PR: #9827. I think the exercise of writing tests to demonstrate the behavior was a good one.

After #9827 is sorted out, I plan to give similar treatment to the part of this PR that defies how identifiers are shown in the query output

alamb added a commit that referenced this pull request Apr 5, 2021
…r functions and aggregates

Broken out from #9600 by @wqc200. Note this does not contain the part of #9600 that controls the output display of functions.

# Rationale

Aggregate functions are checked using case insensitive comparison (e.g. `select MAX(x)` and `select max(x)` both work - the code is [here](https://github.com/apache/arrow/blob/356c300c5ee1e2b23a83652514af11e3a731d596/rust/datafusion/src/physical_plan/aggregates.rs#L75)

However, scalar functions, user defined aggregates, and user defined functions, are checked using case sensitive comparisons (e.g. `select sqrt(x)` works while `select SQRT` does not. Postgres always uses case insensitive comparison:

```
alamb=# select sqrt(x) from foo;
 sqrt
------
(0 rows)

alamb=# select SQRT(x) from foo;
 sqrt
------
(0 rows)
```

# Changes

Always use case insensitive comparisons for unquoted identifier comparison, both for consistency within DataFusion as well as consistency with Postgres (and the SQL standard)

Adds tests that demonstrate the behavior

# Notes

This PR changes how user defined functions are resolved in SQL queries. If a user registers two functions with names
`"my_sqrt"` and `"MY_SQRT"` previously they could both be called
individually. After this PR `my_sqrt` will be called unless the user
specifically put `"SQRT"` (in quotes) in their query.

Closes #9827 from alamb/case_insensitive_functions

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
pachadotdev pushed a commit to pachadotdev/arrow that referenced this pull request Apr 6, 2021
…r functions and aggregates

Broken out from apache#9600 by @wqc200. Note this does not contain the part of apache#9600 that controls the output display of functions.

# Rationale

Aggregate functions are checked using case insensitive comparison (e.g. `select MAX(x)` and `select max(x)` both work - the code is [here](https://github.com/apache/arrow/blob/356c300c5ee1e2b23a83652514af11e3a731d596/rust/datafusion/src/physical_plan/aggregates.rs#L75)

However, scalar functions, user defined aggregates, and user defined functions, are checked using case sensitive comparisons (e.g. `select sqrt(x)` works while `select SQRT` does not. Postgres always uses case insensitive comparison:

```
alamb=# select sqrt(x) from foo;
 sqrt
------
(0 rows)

alamb=# select SQRT(x) from foo;
 sqrt
------
(0 rows)
```

# Changes

Always use case insensitive comparisons for unquoted identifier comparison, both for consistency within DataFusion as well as consistency with Postgres (and the SQL standard)

Adds tests that demonstrate the behavior

# Notes

This PR changes how user defined functions are resolved in SQL queries. If a user registers two functions with names
`"my_sqrt"` and `"MY_SQRT"` previously they could both be called
individually. After this PR `my_sqrt` will be called unless the user
specifically put `"SQRT"` (in quotes) in their query.

Closes apache#9827 from alamb/case_insensitive_functions

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb
Copy link
Contributor

alamb commented Apr 9, 2021

@wqc200 do you still want / need the functionality in this PR to preserve the original case of function calls? (aka show DATABASE in the output rather thandatabase)? The code for the case insensitive identifier comparison part was added in #9827.

If so, I will find some time to get a PR in shape

@alamb
Copy link
Contributor

alamb commented Apr 19, 2021

The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository

Please see the mailing-list thread for more details

We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs.

@wqc200
Copy link
Contributor Author

wqc200 commented Apr 27, 2021

@andygrove @alamb

You're making me go back and forth. I've been working on this PR for months, but you just won't merge it.Now you have built a new warehouse, and I need to do this PR again. May I ask what you really want to do?

@alamb
Copy link
Contributor

alamb commented Apr 27, 2021

Hi @wqc200 I am sorry this has been a frustrating experience for you. The reason I have not merged this PR is because it does not have tests for the new features that were added

Also, we implemented the case insensitive comparison this PR describes in #9827

I suggest:

  1. Review the current version of datafusion (which includes the case comparison ARROW-11822: [Rust][Datafusion] Support case sensitive comparisons for functions and aggregates #9827) and see if it works for you
  2. Pull out the change that allows the user to control the output case into a separate PR (against the new repo) with tests showing how it works

@alamb
Copy link
Contributor

alamb commented May 3, 2021

#10096 has removed the arrow implementation from this repository (it now resides in https://github.com/apache/arrow-rs and https://github.com/apache/arrow-datafusion) in the hopes of streamlining the development process

Please re-target this PR (let us know if you need help doing so) to one/both of the new repositories.

Thank you for understanding and helping to make arrow-rs and datafusion better

@alamb alamb closed this May 3, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…r functions and aggregates

Broken out from apache#9600 by @wqc200. Note this does not contain the part of apache#9600 that controls the output display of functions.

# Rationale

Aggregate functions are checked using case insensitive comparison (e.g. `select MAX(x)` and `select max(x)` both work - the code is [here](https://github.com/apache/arrow/blob/356c300c5ee1e2b23a83652514af11e3a731d596/rust/datafusion/src/physical_plan/aggregates.rs#L75)

However, scalar functions, user defined aggregates, and user defined functions, are checked using case sensitive comparisons (e.g. `select sqrt(x)` works while `select SQRT` does not. Postgres always uses case insensitive comparison:

```
alamb=# select sqrt(x) from foo;
 sqrt
------
(0 rows)

alamb=# select SQRT(x) from foo;
 sqrt
------
(0 rows)
```

# Changes

Always use case insensitive comparisons for unquoted identifier comparison, both for consistency within DataFusion as well as consistency with Postgres (and the SQL standard)

Adds tests that demonstrate the behavior

# Notes

This PR changes how user defined functions are resolved in SQL queries. If a user registers two functions with names
`"my_sqrt"` and `"MY_SQRT"` previously they could both be called
individually. After this PR `my_sqrt` will be called unless the user
specifically put `"SQRT"` (in quotes) in their query.

Closes apache#9827 from alamb/case_insensitive_functions

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 10, 2021
…r functions and aggregates

Broken out from apache#9600 by @wqc200. Note this does not contain the part of apache#9600 that controls the output display of functions.

# Rationale

Aggregate functions are checked using case insensitive comparison (e.g. `select MAX(x)` and `select max(x)` both work - the code is [here](https://github.com/apache/arrow/blob/356c300c5ee1e2b23a83652514af11e3a731d596/rust/datafusion/src/physical_plan/aggregates.rs#L75)

However, scalar functions, user defined aggregates, and user defined functions, are checked using case sensitive comparisons (e.g. `select sqrt(x)` works while `select SQRT` does not. Postgres always uses case insensitive comparison:

```
alamb=# select sqrt(x) from foo;
 sqrt
------
(0 rows)

alamb=# select SQRT(x) from foo;
 sqrt
------
(0 rows)
```

# Changes

Always use case insensitive comparisons for unquoted identifier comparison, both for consistency within DataFusion as well as consistency with Postgres (and the SQL standard)

Adds tests that demonstrate the behavior

# Notes

This PR changes how user defined functions are resolved in SQL queries. If a user registers two functions with names
`"my_sqrt"` and `"MY_SQRT"` previously they could both be called
individually. After this PR `my_sqrt` will be called unless the user
specifically put `"SQRT"` (in quotes) in their query.

Closes apache#9827 from alamb/case_insensitive_functions

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…r functions and aggregates

Broken out from apache#9600 by @wqc200. Note this does not contain the part of apache#9600 that controls the output display of functions.

# Rationale

Aggregate functions are checked using case insensitive comparison (e.g. `select MAX(x)` and `select max(x)` both work - the code is [here](https://github.com/apache/arrow/blob/356c300c5ee1e2b23a83652514af11e3a731d596/rust/datafusion/src/physical_plan/aggregates.rs#L75)

However, scalar functions, user defined aggregates, and user defined functions, are checked using case sensitive comparisons (e.g. `select sqrt(x)` works while `select SQRT` does not. Postgres always uses case insensitive comparison:

```
alamb=# select sqrt(x) from foo;
 sqrt
------
(0 rows)

alamb=# select SQRT(x) from foo;
 sqrt
------
(0 rows)
```

# Changes

Always use case insensitive comparisons for unquoted identifier comparison, both for consistency within DataFusion as well as consistency with Postgres (and the SQL standard)

Adds tests that demonstrate the behavior

# Notes

This PR changes how user defined functions are resolved in SQL queries. If a user registers two functions with names
`"my_sqrt"` and `"MY_SQRT"` previously they could both be called
individually. After this PR `my_sqrt` will be called unless the user
specifically put `"SQRT"` (in quotes) in their query.

Closes apache#9827 from alamb/case_insensitive_functions

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants