Refactor order by with non-selected fields #274
Comments
I'm in favor of dropping support for ordering by columns not present in the select statement. Like you said, Cilantro only lets you order by selected columns and the imagined use case of a manually constructed view of this kind seems contrived to me. Since the most common use case is a view created by Cilantro, I think we should focus on that and remove support for ordering by columns not in the select statement unless a project(or preferably projects) has a strong case to make for maintaining support for it. |
Dropping support would require the implementation in Varify to change since it currently sorts the variant IDs based on the user-defined view sort order.. |
Check if any view facets are being sorted by and not viewed: https://github.com/chop-dbhi/serrano/blob/master/serrano/resources/preview.py#L46-L55 |
See explanation chop-dbhi/avocado#274 Signed-off-by: Byron Ruth <b@devel.io>
See explanation chop-dbhi/avocado#274 Signed-off-by: Byron Ruth <b@devel.io>
As noted in the above comment the fix for this issue was implemented in Serrano. This is not ideal, however there is poor encapsulation of a Harvest query which is noted in #275. |
For background, SQL engines do not allow
ORDER BY
columns that do not appear in theSELECT
clause whenDISTINCT
is used. My assumption is becauseDISTINCT
is implemented as an aggregation usingGROUP BY
:The chosen solution was to append the columns in the order by and then trim them off when the data is passed back. The issue is that there could be redundant rows returned if the ordering columns have a many to relationship to any of the other columns. This redundancy is handled by the
Exporter.read
method which filters records based on the subset of columns that are being formatted downstream.This works, but has a fundamental problem when trying to apply a
LIMIT
orOFFSET
. Since it is unknown how many redundant rows may be present, neither the limit or nor offset can actually be applied to the query which means the entire result set needs to be returned by the database. The exporter manually iterates over rows until the offset is reach and then yields unique records until the limit is hit. As one can imagine this is hugely wasteful.A solution that one would assume to work is to order the subquery and return the distinct values in the outer query:
Unfortunately, this does not work either. SQL engines strip
ORDER BY
statements from subqueries since the idea of a "table" in the SQL standard is a set of unordered rows. In this context, the subquery is acting as a table and therefore the rows will not be ordered. More information here.There are a couple directions we can go:
Drop support for ordering by columns that are not present in the select statement. In practice, this requirement is very rare since a view must be constructed manually to do this (Cilantro only supports ordering by already selected columns).
Optimize for the common case. This involves flagging a query when an order by column is not present in the select statement and choosing an execution route appropriate for the query. If all order by columns are in the select statement, the offset and limit can be pushed down to the database (as SQL). If not the application must handle it.
@murphyke @naegelyd @tjrivera Thoughts?
The text was updated successfully, but these errors were encountered: