Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add an sql-compatible is_in variant #36420

Closed
westonpace opened this issue Jun 30, 2023 · 10 comments · Fixed by #36739
Closed

[C++] Add an sql-compatible is_in variant #36420

westonpace opened this issue Jun 30, 2023 · 10 comments · Fixed by #36739

Comments

@westonpace
Copy link
Member

Describe the enhancement requested

This could be a new compute function or it could be new options for the existing is_in function. Currently, the way is_in treats null does not match SQL semantics which makes it difficult to use when interoperability is expected.

The current behavior is summed up nicely in the SetLookupOptions::skip_nulls description:

/// Whether nulls in `value_set` count for lookup.
///
/// If true, any null in `value_set` is ignored and nulls in the input
/// produce null (IndexIn) or false (IsIn) values in the output.
/// If false, any null in `value_set` is successfully matched in
/// the input.

Or, to put it another way. The behavior we currently get (regardless of skip_nulls) is:

Input Value Lookup Set Skip Nulls Result
5 [1, NULL] any value false
5 [5, NULL] any value true
NULL [5] any value false
NULL [5, NULL] true false
NULL [5, NULL] false true

This makes perfect sense for the "null as sentinel" understanding of null. However, SQL typically takes a "null is unknown" approach. In which case we would want the following behavior:

Input Value Lookup Set Skip Nulls Result
5 [1, NULL] any value NULL
5 [5, NULL] any value true
NULL [5] any value NULL
NULL [5, NULL] any value NULL

Component(s)

C++

@R-JunmingChen
Copy link
Contributor

take

@R-JunmingChen
Copy link
Contributor

This could be a new compute function or it could be new options for the existing is_in function.

Let me investigate, I will leave a message about my choice before I implement it

@westonpace
Copy link
Member Author

Thanks

@R-JunmingChen
Copy link
Contributor

I think adding a new Function named in could be a good way. Let me implement it

@R-JunmingChen
Copy link
Contributor

R-JunmingChen commented Jul 18, 2023

Hi, @westonpace, I have implemented an In Function. However, the test codes for is_in in scalar_set_lookup_test.cc are very large. Do I need to add a coresponding amount of tests or I just need to test the different parts between In and is_in? Besides, do I also need to add a benchmark for In in scalar_set_lookup_benchmark.cc?

@jorisvandenbossche
Copy link
Member

@westonpace essentially, the behaviour you propose is to have a variant of is_in that is actually the equivalent of (arr == val1) or (arr == val2) or ..., right?
(i.e. how you can explain the "is_in" function, being an easier and more performant version of checking equality for multiple values, although at the moment it's not exactly equivalent when it comes to null handling)

@westonpace
Copy link
Member Author

@westonpace essentially, the behaviour you propose is to have a variant of is_in that is actually the equivalent of (arr == val1) or (arr == val2) or ..., right?

Yes. That is correct. I encountered this when working on #34834 because Substrait translates SQL's WHERE x IN [1, 5] to a SingularOrList and I want to translate that SingularOrList to is_in to get better performance.

@ianmcook
Copy link
Member

ianmcook commented Aug 2, 2023

I have some doubt about this one case, in which the lookup set contains a null but the input value is non-null:

Input Value Lookup Set Skip Nulls Result
5 [1, NULL] any value NULL

I think it is not so simple to say that SQL engines return a null result in this case.

For example, DuckDB returns a different result in this case depending on whether you are using IN or list_contains:

duckdb> SELECT 5 IN (VALUES (1), (NULL)) AS result;
┌────────┐
│ result │
╞════════╡
│  NULL  │
└────────┘

duckdb> SELECT list_contains([1,  NULL], 5) AS result;
┌────────┐
│ result │
╞════════╡
│  false │
└────────┘

I wonder whether we really need to add two new options, one to handle the semantics of nulls in the value set and one to handle the semantics of nulls in the lookup set.

@ianmcook
Copy link
Member

ianmcook commented Aug 2, 2023

P.S. to be clear: the behavior of DuckDB's list_contains cannot be reproduced using the existing is_in options. The case that demonstrates that is the last one in Weston's table:

Input Value Lookup Set Skip Nulls Result
NULL [5, NULL] any value NULL
duckdb> SELECT list_contains([5, NULL], NULL) AS result;
┌────────┐
│ result │
╞════════╡
│  NULL  │
└────────┘

This is why I think maybe we need two new options.

@ianmcook
Copy link
Member

ianmcook commented Aug 2, 2023

Regarding my two comments above ☝️
I suggested a solution here: #36739 (comment)

@bkietz bkietz added this to the 14.0.0 milestone Sep 22, 2023
bkietz added a commit that referenced this issue Sep 22, 2023
### Rationale for this change
As #36420 says, we want add an sql-compatible `is_in` variant, which has a different logic handling Null. After a dicussion with @ ianmcook and @ bkietz, we decide to support an enum option `null_matching_behavior` for SetLookup, which actually adds two semantics of null handling for `is_in` and doesn't add an new behavior for `index_in`.

The enum option `null_matching_behavior` will replace `skip_nulls` in the future.

### What changes are included in this PR?
Add an enum parameter `null_matching_behavior` for SetLookupOptions.

### Are these changes tested?
Two kinds of tests are implemented
- Replace default parameter with `null_matching_behavior` instead of `skip_nulls` for `is_in` and `index_in` tests 
- Add tests for `NullMatchingBehavior::EMIT_NULL` and `NullMatchingBehavior::INCONCLUSIVE` for `is_in`

Besides, since the `skip_nulls` is not deprecated now, I still preserve the old tests with `skip_nulls`. When the `skip_nulls` is totally deprecated, we can replace the test parameter `skip_nulls=false` with `null_matching_behavior=MATCH` and `skip_nulls=true` with `null_matching_behavior=SKIP` for these old tests.

### Are there any user-facing changes?
No. Currently we support backward compatibility. In the future, we plan to replace `skip_nulls` with `null_matching_behavior` completely.

* Closes: #36420

Lead-authored-by: Junming Chen <junming.chen.r@outlook.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
etseidl pushed a commit to etseidl/arrow that referenced this issue Sep 28, 2023
…e#36739)

### Rationale for this change
As apache#36420 says, we want add an sql-compatible `is_in` variant, which has a different logic handling Null. After a dicussion with @ ianmcook and @ bkietz, we decide to support an enum option `null_matching_behavior` for SetLookup, which actually adds two semantics of null handling for `is_in` and doesn't add an new behavior for `index_in`.

The enum option `null_matching_behavior` will replace `skip_nulls` in the future.

### What changes are included in this PR?
Add an enum parameter `null_matching_behavior` for SetLookupOptions.

### Are these changes tested?
Two kinds of tests are implemented
- Replace default parameter with `null_matching_behavior` instead of `skip_nulls` for `is_in` and `index_in` tests 
- Add tests for `NullMatchingBehavior::EMIT_NULL` and `NullMatchingBehavior::INCONCLUSIVE` for `is_in`

Besides, since the `skip_nulls` is not deprecated now, I still preserve the old tests with `skip_nulls`. When the `skip_nulls` is totally deprecated, we can replace the test parameter `skip_nulls=false` with `null_matching_behavior=MATCH` and `skip_nulls=true` with `null_matching_behavior=SKIP` for these old tests.

### Are there any user-facing changes?
No. Currently we support backward compatibility. In the future, we plan to replace `skip_nulls` with `null_matching_behavior` completely.

* Closes: apache#36420

Lead-authored-by: Junming Chen <junming.chen.r@outlook.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…e#36739)

### Rationale for this change
As apache#36420 says, we want add an sql-compatible `is_in` variant, which has a different logic handling Null. After a dicussion with @ ianmcook and @ bkietz, we decide to support an enum option `null_matching_behavior` for SetLookup, which actually adds two semantics of null handling for `is_in` and doesn't add an new behavior for `index_in`.

The enum option `null_matching_behavior` will replace `skip_nulls` in the future.

### What changes are included in this PR?
Add an enum parameter `null_matching_behavior` for SetLookupOptions.

### Are these changes tested?
Two kinds of tests are implemented
- Replace default parameter with `null_matching_behavior` instead of `skip_nulls` for `is_in` and `index_in` tests 
- Add tests for `NullMatchingBehavior::EMIT_NULL` and `NullMatchingBehavior::INCONCLUSIVE` for `is_in`

Besides, since the `skip_nulls` is not deprecated now, I still preserve the old tests with `skip_nulls`. When the `skip_nulls` is totally deprecated, we can replace the test parameter `skip_nulls=false` with `null_matching_behavior=MATCH` and `skip_nulls=true` with `null_matching_behavior=SKIP` for these old tests.

### Are there any user-facing changes?
No. Currently we support backward compatibility. In the future, we plan to replace `skip_nulls` with `null_matching_behavior` completely.

* Closes: apache#36420

Lead-authored-by: Junming Chen <junming.chen.r@outlook.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…e#36739)

### Rationale for this change
As apache#36420 says, we want add an sql-compatible `is_in` variant, which has a different logic handling Null. After a dicussion with @ ianmcook and @ bkietz, we decide to support an enum option `null_matching_behavior` for SetLookup, which actually adds two semantics of null handling for `is_in` and doesn't add an new behavior for `index_in`.

The enum option `null_matching_behavior` will replace `skip_nulls` in the future.

### What changes are included in this PR?
Add an enum parameter `null_matching_behavior` for SetLookupOptions.

### Are these changes tested?
Two kinds of tests are implemented
- Replace default parameter with `null_matching_behavior` instead of `skip_nulls` for `is_in` and `index_in` tests 
- Add tests for `NullMatchingBehavior::EMIT_NULL` and `NullMatchingBehavior::INCONCLUSIVE` for `is_in`

Besides, since the `skip_nulls` is not deprecated now, I still preserve the old tests with `skip_nulls`. When the `skip_nulls` is totally deprecated, we can replace the test parameter `skip_nulls=false` with `null_matching_behavior=MATCH` and `skip_nulls=true` with `null_matching_behavior=SKIP` for these old tests.

### Are there any user-facing changes?
No. Currently we support backward compatibility. In the future, we plan to replace `skip_nulls` with `null_matching_behavior` completely.

* Closes: apache#36420

Lead-authored-by: Junming Chen <junming.chen.r@outlook.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…e#36739)

### Rationale for this change
As apache#36420 says, we want add an sql-compatible `is_in` variant, which has a different logic handling Null. After a dicussion with @ ianmcook and @ bkietz, we decide to support an enum option `null_matching_behavior` for SetLookup, which actually adds two semantics of null handling for `is_in` and doesn't add an new behavior for `index_in`.

The enum option `null_matching_behavior` will replace `skip_nulls` in the future.

### What changes are included in this PR?
Add an enum parameter `null_matching_behavior` for SetLookupOptions.

### Are these changes tested?
Two kinds of tests are implemented
- Replace default parameter with `null_matching_behavior` instead of `skip_nulls` for `is_in` and `index_in` tests 
- Add tests for `NullMatchingBehavior::EMIT_NULL` and `NullMatchingBehavior::INCONCLUSIVE` for `is_in`

Besides, since the `skip_nulls` is not deprecated now, I still preserve the old tests with `skip_nulls`. When the `skip_nulls` is totally deprecated, we can replace the test parameter `skip_nulls=false` with `null_matching_behavior=MATCH` and `skip_nulls=true` with `null_matching_behavior=SKIP` for these old tests.

### Are there any user-facing changes?
No. Currently we support backward compatibility. In the future, we plan to replace `skip_nulls` with `null_matching_behavior` completely.

* Closes: apache#36420

Lead-authored-by: Junming Chen <junming.chen.r@outlook.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants