Skip to content
This repository has been archived by the owner on Jan 18, 2020. It is now read-only.

Refactor order by with non-selected fields #274

Closed
Tracked by #41
bruth opened this issue Jan 1, 2015 · 4 comments
Closed
Tracked by #41

Refactor order by with non-selected fields #274

bruth opened this issue Jan 1, 2015 · 4 comments

Comments

@bruth
Copy link
Contributor

bruth commented Jan 1, 2015

For background, SQL engines do not allow ORDER BY columns that do not appear in the SELECT clause when DISTINCT is used. My assumption is because DISTINCT is implemented as an aggregation using GROUP BY:

-- this
SELECT DISTINCT foo, bar FROM baz

-- same as
SELECT foo, bar FROM baz GROUP BY foo, bar

The chosen solution was to append the columns in the order by and then trim them off when the data is passed back. The issue is that there could be redundant rows returned if the ordering columns have a many to relationship to any of the other columns. This redundancy is handled by the Exporter.read method which filters records based on the subset of columns that are being formatted downstream.

This works, but has a fundamental problem when trying to apply a LIMIT or OFFSET. Since it is unknown how many redundant rows may be present, neither the limit or nor offset can actually be applied to the query which means the entire result set needs to be returned by the database. The exporter manually iterates over rows until the offset is reach and then yields unique records until the limit is hit. As one can imagine this is hugely wasteful.

A solution that one would assume to work is to order the subquery and return the distinct values in the outer query:

SELECT DISTINCT foo, bar FROM (
    SELECT foo, bar, qux
    FROM baz
    ORDER BY qux
)

Unfortunately, this does not work either. SQL engines strip ORDER BY statements from subqueries since the idea of a "table" in the SQL standard is a set of unordered rows. In this context, the subquery is acting as a table and therefore the rows will not be ordered. More information here.

There are a couple directions we can go:

Drop support for ordering by columns that are not present in the select statement. In practice, this requirement is very rare since a view must be constructed manually to do this (Cilantro only supports ordering by already selected columns).

Optimize for the common case. This involves flagging a query when an order by column is not present in the select statement and choosing an execution route appropriate for the query. If all order by columns are in the select statement, the offset and limit can be pushed down to the database (as SQL). If not the application must handle it.

@murphyke @naegelyd @tjrivera Thoughts?

@naegelyd
Copy link
Collaborator

naegelyd commented Jan 1, 2015

I'm in favor of dropping support for ordering by columns not present in the select statement. Like you said, Cilantro only lets you order by selected columns and the imagined use case of a manually constructed view of this kind seems contrived to me. Since the most common use case is a view created by Cilantro, I think we should focus on that and remove support for ordering by columns not in the select statement unless a project(or preferably projects) has a strong case to make for maintaining support for it.

@bruth
Copy link
Contributor Author

bruth commented Jan 1, 2015

Dropping support would require the implementation in Varify to change since it currently sorts the variant IDs based on the user-defined view sort order..

@bruth
Copy link
Contributor Author

bruth commented Jan 5, 2015

Check if any view facets are being sorted by and not viewed: https://github.com/chop-dbhi/serrano/blob/master/serrano/resources/preview.py#L46-L55

bruth added a commit to chop-dbhi/serrano that referenced this issue Jan 6, 2015
See explanation chop-dbhi/avocado#274

Signed-off-by: Byron Ruth <b@devel.io>
bruth added a commit to chop-dbhi/serrano that referenced this issue Jan 6, 2015
See explanation chop-dbhi/avocado#274

Signed-off-by: Byron Ruth <b@devel.io>
@bruth
Copy link
Contributor Author

bruth commented Jan 7, 2015

As noted in the above comment the fix for this issue was implemented in Serrano. This is not ideal, however there is poor encapsulation of a Harvest query which is noted in #275.

@bruth bruth closed this as completed Jan 7, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants