Refactor order by with non-selected fields #274

bruth · 2015-01-01T17:03:07Z

For background, SQL engines do not allow ORDER BY columns that do not appear in the SELECT clause when DISTINCT is used. My assumption is because DISTINCT is implemented as an aggregation using GROUP BY:

-- this
SELECT DISTINCT foo, bar FROM baz

-- same as
SELECT foo, bar FROM baz GROUP BY foo, bar

The chosen solution was to append the columns in the order by and then trim them off when the data is passed back. The issue is that there could be redundant rows returned if the ordering columns have a many to relationship to any of the other columns. This redundancy is handled by the Exporter.read method which filters records based on the subset of columns that are being formatted downstream.

This works, but has a fundamental problem when trying to apply a LIMIT or OFFSET. Since it is unknown how many redundant rows may be present, neither the limit or nor offset can actually be applied to the query which means the entire result set needs to be returned by the database. The exporter manually iterates over rows until the offset is reach and then yields unique records until the limit is hit. As one can imagine this is hugely wasteful.

A solution that one would assume to work is to order the subquery and return the distinct values in the outer query:

SELECT DISTINCT foo, bar FROM (
    SELECT foo, bar, qux
    FROM baz
    ORDER BY qux
)

Unfortunately, this does not work either. SQL engines strip ORDER BY statements from subqueries since the idea of a "table" in the SQL standard is a set of unordered rows. In this context, the subquery is acting as a table and therefore the rows will not be ordered. More information here.

There are a couple directions we can go:

Drop support for ordering by columns that are not present in the select statement. In practice, this requirement is very rare since a view must be constructed manually to do this (Cilantro only supports ordering by already selected columns).

Optimize for the common case. This involves flagging a query when an order by column is not present in the select statement and choosing an execution route appropriate for the query. If all order by columns are in the select statement, the offset and limit can be pushed down to the database (as SQL). If not the application must handle it.

@murphyke @naegelyd @tjrivera Thoughts?

The text was updated successfully, but these errors were encountered:

naegelyd · 2015-01-01T18:48:12Z

I'm in favor of dropping support for ordering by columns not present in the select statement. Like you said, Cilantro only lets you order by selected columns and the imagined use case of a manually constructed view of this kind seems contrived to me. Since the most common use case is a view created by Cilantro, I think we should focus on that and remove support for ordering by columns not in the select statement unless a project(or preferably projects) has a strong case to make for maintaining support for it.

bruth · 2015-01-01T22:32:19Z

Dropping support would require the implementation in Varify to change since it currently sorts the variant IDs based on the user-defined view sort order..

bruth · 2015-01-05T18:26:45Z

Check if any view facets are being sorted by and not viewed: https://github.com/chop-dbhi/serrano/blob/master/serrano/resources/preview.py#L46-L55

See explanation chop-dbhi/avocado#274 Signed-off-by: Byron Ruth <b@devel.io>

bruth · 2015-01-07T18:46:15Z

As noted in the above comment the fix for this issue was implemented in Serrano. This is not ideal, however there is poor encapsulation of a Harvest query which is noted in #275.

bruth added type:refactor version:2.4 labels Jan 1, 2015

bruth mentioned this issue Jan 2, 2015

Implement formal query structure #275

Closed

bruth added the area:perf label Jan 5, 2015

bruth mentioned this issue Jan 5, 2015

Performance Retreat chop-dbhi/harvest#41

Open

5 tasks

bruth added a commit to chop-dbhi/serrano that referenced this issue Jan 6, 2015

Optimize the result endpoints for limit and offset

6086128

See explanation chop-dbhi/avocado#274 Signed-off-by: Byron Ruth <b@devel.io>

bruth added a commit to chop-dbhi/serrano that referenced this issue Jan 6, 2015

Optimize the result endpoints for limit and offset

e7e5c1d

See explanation chop-dbhi/avocado#274 Signed-off-by: Byron Ruth <b@devel.io>

bruth closed this as completed Jan 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor order by with non-selected fields #274

Refactor order by with non-selected fields #274

bruth commented Jan 1, 2015

naegelyd commented Jan 1, 2015

bruth commented Jan 1, 2015

bruth commented Jan 5, 2015

bruth commented Jan 7, 2015

Refactor order by with non-selected fields #274

Refactor order by with non-selected fields #274

Comments

bruth commented Jan 1, 2015

naegelyd commented Jan 1, 2015

bruth commented Jan 1, 2015

bruth commented Jan 5, 2015

bruth commented Jan 7, 2015