Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Nulls in Database Join Conditions #5241

Closed
wdanilo opened this issue Feb 5, 2023 · 4 comments
Closed

Handling Nulls in Database Join Conditions #5241

wdanilo opened this issue Feb 5, 2023 · 4 comments
Assignees
Labels
-libs Libraries: New libraries to be implemented l-join p-high Should be completed in the next sprint x-new-feature Type: new feature request
Milestone

Comments

@wdanilo
Copy link
Member

wdanilo commented Feb 5, 2023

This task is automatically imported from the old Task Issue Board and it was originally created by Radosław Waśko.
Original issue is here.


We planned to implement operations in the Table library so that Nothing == Nothing evaluates to True (unlike SQL preferred approach which would yield NULL/Undefined for NULL-NULL comparisons).

We also want the Database.Table to be consistent with that. We can achieve that by replacing the a = b operator with one of:

  • a IS NOT DISTINCT FROM b (works in Postgres) or a IS b (only works in SQLite),
  • a <=> b (only works in MySQL)
  • COALESCE(a = b, a IS NULL AND b IS NULL) - should work everwhere.

This however has some performance implications - all approaches I tried in Postgres were making it unable to exploit indexes set on the column on which the join was performed - which can be very bad for performance.
It seems that using the preferred operator (<=> for MySQL, IS for SQLite) did allow the database to use the existing index. But for Postgres I could find no way to make it use the index and handle nulls the way we want.

We need to discuss how we want to approach this null-handling - as there is a potential performance price for NULL=NULL, at least in some databases (Postgres). We also need to make sure that In-Memory table operations (both as Join_Condition.Equals and as Column Operation ==) are consistent with what we do for the database (or if we are not consistent, we need to ensure it is very clearly described to the user so that there are no surprises).

Comments:

References: the db fiddle I used for most of the testing
https://www.db-fiddle.com/f/sehxaqWyQroFmghUE4A4Pm/1
(Radosław Waśko - Dec 27, 2022)


@wdanilo wdanilo added this to the Beta Release milestone Feb 6, 2023
@jdunkerley jdunkerley self-assigned this Feb 14, 2023
@radeusgd
Copy link
Member

Probably related to this is that we may want stuff like Filter_Condition.Not_Equal 100 to be consistent between in-memory and Database.

Currently however, if a column contains a Nothing, in in-memory Not_Equal 100 will evaluate to True on that Nothing in memory, but to Undefined/NULL in Database.

So

t = table_builder [["X", [100, 50, Nothing]]]
r = t.filter "X" (Filter_Condition.Not_Equal 100)

will have different results in both backends (r.at "X" . to_vector will be [50, Nothing] in in-memory but just [50] in the Database backends).

However, since != does not take part in joins, we can probably hack it around by changing [X] != ? into [X] IS NULL OR [X] != ? for ? != Nothing.

But first I'd like to understand if we want this. For now I'm adding tests requiring that but marking them as pending and linking to this issue.

I encountered this a bit randomly when adapting some tests.

@radeusgd
Copy link
Member

radeusgd commented Aug 9, 2023

Note that allowing NULL = NULL can result in some unexpected results.

Let's say we have two tables:

t1:

id col1
1 A
2 B
Nothing C
Nothing D

t2:

id col2
3 X
2 Y
Nothing Z
Nothing W

Then t1.join t2 Join_Kind.Inner on="id", will result in:

id col1 col2
2 B Y
Nothing C Z
Nothing D Z
Nothing C W
Nothing D W

Because the NULL rows will all match with each other. This may sometimes be desirable, although it seems not necessarily good behaviour, as it can create a lot of matches when there are lots of NULLs in the key column.

Also note that this is the current behaviour of the in-memory backend. But whether this is desirable is another thing and remains to be discussed as part of this overall issue.
The example was inspired by confusion we encountered in @farmaazon's solution to 08.08 bookclub.

@jdunkerley jdunkerley added p-high Should be completed in the next sprint and removed p-low Low priority labels Sep 5, 2023
@jdunkerley
Copy link
Member

Agreed that we will match database behaviour here. So closing this.

@jdunkerley jdunkerley closed this as not planned Won't fix, can't repro, duplicate, stale Jan 2, 2024
@radeusgd
Copy link
Member

radeusgd commented Jan 2, 2024

Superseded by #5156

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-libs Libraries: New libraries to be implemented l-join p-high Should be completed in the next sprint x-new-feature Type: new feature request
Projects
Archived in project
Development

No branches or pull requests

3 participants