SORTED Constraint #8444

Mytherin · 2021-11-05T13:57:32Z

Mytherin
Nov 5, 2021
Maintainer

We had the idea of allowing a SORTED constraint to be specified on a column, e.g.:

CREATE TABLE sensor_data(
   ts TIMESTAMP,
   measurement INTEGER,
   ...,
   SORTED(ts)
);

The sorted constraint enforces that the data is inserted in sorted order w.r.t. any columns that have a sorted constraint on them (i.e. inserting data in unsorted order throws an error), updating the columns is not allowed.

The sorted constraint can then be propagated through the query plan as part of the statistics propagation, and can be used to optimize e.g. window functions and merge joins (to avoid unnecessarily sorting already sorted data), and potentially even to optimize aggregates (in case of aggregates on a subset of the sorted column, such as e.g. grouping by YEAR(ts), MONTH(ts)).

We could also track "accidental" sortedness during insertion and propagate the same sortedness if columns happen to be sorted, but having a constraint to enforce this behavior and prevent surprises seems like a good idea.

Alex-Monahan · 2021-11-05T14:24:43Z

Alex-Monahan
Nov 5, 2021
Collaborator

This sounds awesome for time series data!

0 replies

lnkuiper · 2021-12-07T14:03:37Z

lnkuiper
Dec 7, 2021
Collaborator

I might pick this up at some point. Just wondering about a few things:

(i.e. inserting data in unsorted order throws an error)

Can we not just sort the data that is inserted if it is not sorted already? Or do we want to force users to do this explicitly by doing e.g. INSERT INTO foo SELECT * FROM bar ORDER BY baz so that they are not surprised when insertion takes longer due to the sort?

If there is already data in the table, then we need to make sure than all the values being inserted are greater than the values that are already in there, which we can check with our statistics, then sort it.

updating the columns is not allowed

Deletions should be fine, but disallowing all updates may be easier.

Additionally, what about ASC/DESC or NULLS FIRST/LAST, or multiple order clauses? Some people model date by multiple columns like the TPC-DS customer table i.e. c_birth_year, c_birth_month, c_birth_day (not saying this is a good idea, but it happens ...). Could the constraint perhaps look like this?

CREATE TABLE sensor_data(
  year INT,
  month INT,
  day INT,
  ...,
  SORTED(year ASC NULLS LAST, month, day) -- or ORDERED for more SQLness?
);

Of course, detecting sortedness of combinations of columns automatically during insertion is combinatorially difficult.

0 replies

Mytherin · 2021-12-07T15:59:08Z

Mytherin
Dec 7, 2021
Maintainer Author

Can we not just sort the data that is inserted if it is not sorted already? Or do we want to force users to do this explicitly by doing e.g. INSERT INTO foo SELECT * FROM bar ORDER BY baz so that they are not surprised when insertion takes longer due to the sort?

That doesn't easily work, because if I have a table with the values [1, 3] and I want to insert the value 2 I need to start restructuring my entire table and moving tuples around. That is more akin to an index than a constraint and more difficult to implement.

Deletions should be fine, but disallowing all updates may be easier.

Deletes are fine indeed, when I say update I mean the SQL UPDATE.

Additionally, what about ASC/DESC or NULLS FIRST/LAST, or multiple order clauses? Some people model date by multiple columns like the TPC-DS customer table i.e. c_birth_year, c_birth_month, c_birth_day (not saying this is a good idea, but it happens ...). Could the constraint perhaps look like this?

Yes, why not. That should be fine, although of course a bit harder to use in a query.

0 replies

ajzo90 · 2022-03-14T11:06:26Z

ajzo90
Mar 14, 2022

Consider using ORDER BY instead of SORTED for defining the constraint. It's consistent with the operator and can share syntax with it.
If it is intentionally different to avoid human interpretation issues, perhaps ORDER BY CONSTRAINT is more clear?

Related:
I would like to have this possibility when reading data from external sources/stream also. In that case it would make sense to force/set the sorted property only for internal propagation, without validation.

0 replies

fabianoliver · 2022-03-14T14:04:14Z

fabianoliver
Mar 14, 2022

This sounds great - as a small follow-up from #3207 , I thought I'd add two small ideas in here:

Rather than/in addition to saying "the entire table is sorted by column X", it could be useful to signify "the data for each subgroup matching on some key X is sorted by Y"
It would be useful to support inserting non-sorted data. And in particular, efficient inserts of almost sorted data

Right, I probably phrased this terribly, so maybe an example is better.

Let's say you have a table that stores prices for financial instruments. You have three columns: Instrument, Time, and Price.

For point no 1: Quite possibly, you'll be able to write data in a time-consecutive order per financial instrument, but not necessarily globally. (Imagine e.g. you're processing all market updates for Microsoft on Thread A, and all market events for Apple on Thread B; it'd be relatively easy to ensure you'll write updates for each security in ascending time-order, but you don't really want/need to synchronise these independent threads/securities; you might end up writing an update for Microsoft that is more recent than Microsoft's last price, but older than Apple's latest inserted price). And in practice, very many queries you'd run later on would be grouped by security as well, potentially benefiting from sort-awareness.
So a useful constraint could be: "When grouping this table by Instrument, then each group's Time-values are sorted".

For point no 2: Staying with the same example, not all market data for financial instruments always arrives in order. But it almost does.
What I mean is, its quite possible you'll receive an update that is a few, say, microseconds older than the most recent value you stored already. So the data isn't quite in order. But it's exceptionally unlikely to receive/insert an update that is several, say, minutes (or hours, days, ...) older than the most recent one. So generally, you may need to re-sort your table/index, but its extremely likely you'd only have to re-sort the last few dozen rows or such at most.

0 replies

hawkfish · 2022-03-15T19:57:56Z

hawkfish
Mar 15, 2022
Collaborator

Consider using ORDER BY instead of SORTED for defining the constraint. It's consistent with the operator and can share syntax with it. If it is intentionally different to avoid human interpretation issues, perhaps ORDER BY CONSTRAINT is more clear?

I think maybe the guiding principle here should be to suggest the semantics? For me, ORDER BY implies that the data actively sorted, and SORTED implies that it is a passive process that fails when new, inserted data is not sorted.

0 replies

hawkfish · 2022-03-15T20:03:37Z

hawkfish
Mar 15, 2022
Collaborator

Thanks for the suggestions @fabianoliver - we love to hear from you!

This sounds great - as a small follow-up from #3207 , I thought I'd add two small ideas in here:

Rather than/in addition to saying "the entire table is sorted by column X", it could be useful to signify "the data for each subgroup matching on some key X is sorted by Y"

I suspect this would be quite difficult to enforce in practice as the insert task would have to scan the entire table to find the previous examples of the key. I think one of the goals here is to leverage "found ordering", so I think this use case would be better served by performing the sort manually into a table after insertion?

It would be useful to support inserting non-sorted data. And in particular, efficient inserts of almost sorted data

This might be more tractable. I can see performing a local sort on the data before appending it. This should catch most cases and not be especially slow. We would have to figure out how to merge the occasional case when the backtracking occurred across a COMMIT boundary, but I think that should be doable.

0 replies

2023-08-01T00:33:31Z

github-actions[bot]
bot Aug 1, 2023

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

0 replies

snth · 2023-08-01T20:21:45Z

snth
Aug 1, 2023

I can't remove the stale label or comment so pinging to hopefully keep this issue alive.

0 replies

hawkfish · 2023-08-01T20:22:57Z

hawkfish
Aug 1, 2023
Collaborator

I think this needs to be converted into a discussion (i.e., feature request). Issues are really bugs.

0 replies

yiteng-guo · 2023-09-20T21:48:20Z

yiteng-guo
Sep 20, 2023

Just want to add a few related points to this:

We can support multiple columns via SORTED BY col1, col2, col3.
parquet has sorted_columns. We can translate this to duckdb sorted constraints.
If we do both, we are able to asof join with parquet data without sorting at all.

0 replies

adriens · 2024-03-07T20:02:49Z

adriens
Mar 7, 2024

🙏 Please make this wonder happen 🙌

0 replies

FrancoisLepoutre · 2024-03-08T03:18:13Z

FrancoisLepoutre
Mar 8, 2024

Thanks Mark and everyone on this thread for this pleasant and instructive conversation! I belong to those who, in the past, have comfortably hacked SQL SELECT calls on tables that are pre-sorted tables for OLAP usages. "Sortedness" certainly does not belong to a typical scenario for row-based update-centric engines à la Oracle. But, conversely, sure enjoying fully "SORTED" tables would possibly be a clear advantage for OLAP engines such as our beloved duck!

That would possibly significantly help the duckdb engine fly - birds do fly! - when the table is essentially a ready-only one built on the fly from read-only data such as csv-s and/or pre-ordered potentially parquet files. I understand this is a very common scenario for current duck breeders.

PS: great to hear parquet files may have sorted columns. I wasn't aware.

0 replies

adriens · 2024-04-05T01:13:06Z

adriens
Apr 5, 2024

0 replies

soerenwolfers · 2024-04-05T09:55:08Z

soerenwolfers
Apr 5, 2024

For just avoiding unnecessary sorting, couldn't that be achieved without constraints by an optimistic sorting algorithm ( so you get an unnecessary O(n) but not an unnecessary O(log(n)) )?

4 replies

adriens Apr 5, 2024

Hmmm, would you share some code ? I could build some benchmark on top of these as they look quite promising 🙏

soerenwolfers Apr 5, 2024

My comment wasn't directed at you but at the duckdb maintainers. As such, I wasn't referring to anything that's currently available in duckdb (at least not that I know of). I was referring to things like https://en.m.wikipedia.org/wiki/Timsort#:~:text=Timsort%20is%20a%20hybrid%2C%20stable,in%20the%20Python%20programming%20language but obviously there are many more considerations going into the choice of sorting algorithm and I don't know which features take priority. (Really, any search algorithm can be made O(n) in presorted data by just prepeding a sorting check, so it's all about the constants)

hawkfish Apr 5, 2024
Collaborator

Laurens and I have talked about getting the sort code to noticed ordered data on insert. It would be really helpful for windowing and range joins. Not the same as a constraint obviously, but probably useful in practice.

Another approach I have used in a previous system is to track partitioning and ordering metadata in the optimiser and leverage it when generating physical plans.

So there are lots of ideas here for "helping the duck fly"! But this should really get converted to a discussion as it is a proposed feature, and those don't get stale.

adriens Apr 5, 2024

track partitioning and ordering metadata in the optimiser and leverage it when generating physical plans.

That looks an exciting approach @hawkfish 🤩

hawkfish · 2024-05-17T20:27:10Z

hawkfish
May 17, 2024
Collaborator

Partition metadata is naturally present for hive partitioning data sets.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SORTED Constraint #8444

{{title}}

Replies: 16 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

SORTED Constraint #8444

Mytherin Nov 5, 2021 Maintainer

Replies: 16 comments · 4 replies

Alex-Monahan Nov 5, 2021 Collaborator

lnkuiper Dec 7, 2021 Collaborator

Mytherin Dec 7, 2021 Maintainer Author

hawkfish Mar 15, 2022 Collaborator

hawkfish Mar 15, 2022 Collaborator

github-actions[bot] bot Aug 1, 2023

hawkfish Aug 1, 2023 Collaborator

hawkfish Apr 5, 2024 Collaborator

hawkfish May 17, 2024 Collaborator

Mytherin
Nov 5, 2021
Maintainer

Replies: 16 comments 4 replies

Alex-Monahan
Nov 5, 2021
Collaborator

lnkuiper
Dec 7, 2021
Collaborator

Mytherin
Dec 7, 2021
Maintainer Author

hawkfish
Mar 15, 2022
Collaborator

hawkfish
Mar 15, 2022
Collaborator

github-actions[bot]
bot Aug 1, 2023

hawkfish
Aug 1, 2023
Collaborator

hawkfish Apr 5, 2024
Collaborator

hawkfish
May 17, 2024
Collaborator