Replies: 16 comments 4 replies
-
This sounds awesome for time series data! |
Beta Was this translation helpful? Give feedback.
-
I might pick this up at some point. Just wondering about a few things:
Can we not just sort the data that is inserted if it is not sorted already? Or do we want to force users to do this explicitly by doing e.g. If there is already data in the table, then we need to make sure than all the values being inserted are greater than the values that are already in there, which we can check with our statistics, then sort it.
Deletions should be fine, but disallowing all updates may be easier. Additionally, what about ASC/DESC or NULLS FIRST/LAST, or multiple order clauses? Some people model date by multiple columns like the TPC-DS customer table i.e. CREATE TABLE sensor_data(
year INT,
month INT,
day INT,
...,
SORTED(year ASC NULLS LAST, month, day) -- or ORDERED for more SQLness?
); Of course, detecting sortedness of combinations of columns automatically during insertion is combinatorially difficult. |
Beta Was this translation helpful? Give feedback.
-
That doesn't easily work, because if I have a table with the values
Deletes are fine indeed, when I say update I mean the SQL
Yes, why not. That should be fine, although of course a bit harder to use in a query. |
Beta Was this translation helpful? Give feedback.
-
Consider using Related: |
Beta Was this translation helpful? Give feedback.
-
This sounds great - as a small follow-up from #3207 , I thought I'd add two small ideas in here:
Right, I probably phrased this terribly, so maybe an example is better. Let's say you have a table that stores prices for financial instruments. You have three columns: Instrument, Time, and Price. For point no 1: Quite possibly, you'll be able to write data in a time-consecutive order per financial instrument, but not necessarily globally. (Imagine e.g. you're processing all market updates for Microsoft on Thread A, and all market events for Apple on Thread B; it'd be relatively easy to ensure you'll write updates for each security in ascending time-order, but you don't really want/need to synchronise these independent threads/securities; you might end up writing an update for Microsoft that is more recent than Microsoft's last price, but older than Apple's latest inserted price). And in practice, very many queries you'd run later on would be grouped by security as well, potentially benefiting from sort-awareness. For point no 2: Staying with the same example, not all market data for financial instruments always arrives in order. But it almost does. |
Beta Was this translation helpful? Give feedback.
-
I think maybe the guiding principle here should be to suggest the semantics? For me, |
Beta Was this translation helpful? Give feedback.
-
Thanks for the suggestions @fabianoliver - we love to hear from you!
I suspect this would be quite difficult to enforce in practice as the insert task would have to scan the entire table to find the previous examples of the key. I think one of the goals here is to leverage "found ordering", so I think this use case would be better served by performing the sort manually into a table after insertion?
This might be more tractable. I can see performing a local sort on the data before appending it. This should catch most cases and not be especially slow. We would have to figure out how to merge the occasional case when the backtracking occurred across a |
Beta Was this translation helpful? Give feedback.
-
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days. |
Beta Was this translation helpful? Give feedback.
-
I can't remove the |
Beta Was this translation helpful? Give feedback.
-
I think this needs to be converted into a discussion (i.e., feature request). Issues are really bugs. |
Beta Was this translation helpful? Give feedback.
-
Just want to add a few related points to this:
|
Beta Was this translation helpful? Give feedback.
-
🙏 Please make this wonder happen 🙌 |
Beta Was this translation helpful? Give feedback.
-
Thanks Mark and everyone on this thread for this pleasant and instructive conversation! I belong to those who, in the past, have comfortably hacked SQL SELECT calls on tables that are pre-sorted tables for OLAP usages. "Sortedness" certainly does not belong to a typical scenario for row-based update-centric engines à la Oracle. But, conversely, sure enjoying fully "SORTED" tables would possibly be a clear advantage for OLAP engines such as our beloved duck! That would possibly significantly help the duckdb engine fly - birds do fly! - when the table is essentially a ready-only one built on the fly from read-only data such as csv-s and/or pre-ordered potentially parquet files. I understand this is a very common scenario for current duck breeders. PS: great to hear parquet files may have sorted columns. I wasn't aware. |
Beta Was this translation helpful? Give feedback.
-
For just avoiding unnecessary sorting, couldn't that be achieved without constraints by an optimistic sorting algorithm ( so you get an unnecessary O(n) but not an unnecessary O(log(n)) )? |
Beta Was this translation helpful? Give feedback.
-
Partition metadata is naturally present for hive partitioning data sets. |
Beta Was this translation helpful? Give feedback.
-
We had the idea of allowing a
SORTED
constraint to be specified on a column, e.g.:The sorted constraint enforces that the data is inserted in sorted order w.r.t. any columns that have a sorted constraint on them (i.e. inserting data in unsorted order throws an error), updating the columns is not allowed.
The sorted constraint can then be propagated through the query plan as part of the statistics propagation, and can be used to optimize e.g. window functions and merge joins (to avoid unnecessarily sorting already sorted data), and potentially even to optimize aggregates (in case of aggregates on a subset of the sorted column, such as e.g. grouping by
YEAR(ts), MONTH(ts)
).We could also track "accidental" sortedness during insertion and propagate the same sortedness if columns happen to be sorted, but having a constraint to enforce this behavior and prevent surprises seems like a good idea.
Beta Was this translation helpful? Give feedback.
All reactions