Handling Nulls in Database Join Conditions #5241

wdanilo · 2023-02-05T22:54:34Z

This task is automatically imported from the old Task Issue Board and it was originally created by Radosław Waśko.
Original issue is here.

We planned to implement operations in the Table library so that Nothing == Nothing evaluates to True (unlike SQL preferred approach which would yield NULL/Undefined for NULL-NULL comparisons).

We also want the Database.Table to be consistent with that. We can achieve that by replacing the a = b operator with one of:

a IS NOT DISTINCT FROM b (works in Postgres) or a IS b (only works in SQLite),
a <=> b (only works in MySQL)
COALESCE(a = b, a IS NULL AND b IS NULL) - should work everwhere.

This however has some performance implications - all approaches I tried in Postgres were making it unable to exploit indexes set on the column on which the join was performed - which can be very bad for performance.
It seems that using the preferred operator (<=> for MySQL, IS for SQLite) did allow the database to use the existing index. But for Postgres I could find no way to make it use the index and handle nulls the way we want.

We need to discuss how we want to approach this null-handling - as there is a potential performance price for NULL=NULL, at least in some databases (Postgres). We also need to make sure that In-Memory table operations (both as Join_Condition.Equals and as Column Operation ==) are consistent with what we do for the database (or if we are not consistent, we need to ensure it is very clearly described to the user so that there are no surprises).

Comments:

References: the db fiddle I used for most of the testing
https://www.db-fiddle.com/f/sehxaqWyQroFmghUE4A4Pm/1
(Radosław Waśko - Dec 27, 2022)

The text was updated successfully, but these errors were encountered:

radeusgd · 2023-03-22T19:50:09Z

Probably related to this is that we may want stuff like Filter_Condition.Not_Equal 100 to be consistent between in-memory and Database.

Currently however, if a column contains a Nothing, in in-memory Not_Equal 100 will evaluate to True on that Nothing in memory, but to Undefined/NULL in Database.

So

t = table_builder [["X", [100, 50, Nothing]]]
r = t.filter "X" (Filter_Condition.Not_Equal 100)

will have different results in both backends (r.at "X" . to_vector will be [50, Nothing] in in-memory but just [50] in the Database backends).

However, since != does not take part in joins, we can probably hack it around by changing [X] != ? into [X] IS NULL OR [X] != ? for ? != Nothing.

But first I'd like to understand if we want this. For now I'm adding tests requiring that but marking them as pending and linking to this issue.

I encountered this a bit randomly when adapting some tests.

radeusgd · 2023-08-09T15:06:46Z

Note that allowing NULL = NULL can result in some unexpected results.

Let's say we have two tables:

t1:

id	col1
1	A
2	B
Nothing	C
Nothing	D

t2:

id	col2
3	X
2	Y
Nothing	Z
Nothing	W

Then t1.join t2 Join_Kind.Inner on="id", will result in:

id	col1	col2
2	B	Y
Nothing	C	Z
Nothing	D	Z
Nothing	C	W
Nothing	D	W

Because the NULL rows will all match with each other. This may sometimes be desirable, although it seems not necessarily good behaviour, as it can create a lot of matches when there are lots of NULLs in the key column.

Also note that this is the current behaviour of the in-memory backend. But whether this is desirable is another thing and remains to be discussed as part of this overall issue.
The example was inspired by confusion we encountered in @farmaazon's solution to 08.08 bookclub.

jdunkerley · 2024-01-02T16:51:34Z

Agreed that we will match database behaviour here. So closing this.

radeusgd · 2024-01-02T16:52:50Z

Superseded by #5156

wdanilo removed the State: unscheduled label Feb 6, 2023

wdanilo added this to the Beta Release milestone Feb 6, 2023

wdanilo removed the x-deprecated-release label Feb 6, 2023

jdunkerley self-assigned this Feb 14, 2023

jdunkerley added p-high Should be completed in the next sprint and removed p-low Low priority labels Sep 5, 2023

jdunkerley closed this as not planned Won't fix, can't repro, duplicate, stale Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling Nulls in Database Join Conditions #5241

Handling Nulls in Database Join Conditions #5241

wdanilo commented Feb 5, 2023

radeusgd commented Mar 22, 2023

radeusgd commented Aug 9, 2023 •

edited

Loading

jdunkerley commented Jan 2, 2024

radeusgd commented Jan 2, 2024

Handling Nulls in Database Join Conditions #5241

Handling Nulls in Database Join Conditions #5241

Comments

wdanilo commented Feb 5, 2023

Comments:

radeusgd commented Mar 22, 2023

radeusgd commented Aug 9, 2023 • edited Loading

jdunkerley commented Jan 2, 2024

radeusgd commented Jan 2, 2024

radeusgd commented Aug 9, 2023 •

edited

Loading