Handle sorted vs unsorted inputs #501

aswan · 2020-04-01T19:11:29Z

zq currently has some half-baked pieces for dealing with sorted inputs:

when scanner.Combiner is merging multiple files it peeks at the timestamps of the next record from each file and returns the lowest one. This only produces a useful result when all inputs have ascending timestamps.
The proc.Context object includes a Reverse flag to indicate that records are sorted with descending timestamps rather than ascending. This is a global property of a query and it has no way to indicate that a particular stream is unsorted.
The groupby proc when it is time-binned assumes that timestamps are sorted (using the Reverse flag described above to distinguish ascending or descending). If this flag is wrong (either because the stream isn't sorted at all or because it isn't set correctly), groupby generates incorrect results.

This is working well enough right now in the app since all.bzng is always sorted by descending timestamps during ingest and the app always passes "dir":-1 in its queries (incidentally, we should remove that from the zqd api since any other value will produce incorrect results).

The zql spec includes ordering hints which are directives that can be inserted in a (b)zng file to indicate if/how the contents of the file/stream are sorted. This issue is to generalize the current "Reverse" flag into a more complete solution that handles streams that can be in one of 3 states: sorted ascending, sorted descending, or unsorted. Note that this property might be different at different points in a query graph (e.g., if points are sorted by timestamp and then they enter a sort foo proc, they are no longer sorted by timestamp). For now, it is probably sufficient to just implement this for the timestamp field and ignore zng sorting hints if they indicate sorting by any other field.

When this was discussed in the past, there was some disagreement about exactly how to implement it. The controversial aspect was how to handle procs that might change the sorting order as described above.

@aswan impelemented a solution in which sorting information was communicated from readers to procs and between procs dynamically at runtime. @nwt, @henridf, and @mccanne all disliked this and proposed a different solution: the introduction of a separate static analysis phase that would analyze an entire proc graph and determine the sorting properties at each point in the graph. It is unclear how this would work when the sortedness of a (b)zng input is unknown until the stream has been read.

The text was updated successfully, but these errors were encountered:

henridf · 2020-06-10T13:38:09Z

For future reference, @nwt added some ideas about the implementation of sortedness analysis here: #869 (comment)

alfred-landrum · 2020-07-20T19:41:05Z

The analysis mentioned above will likely happen in the query planner, though I don't know if it will happen in this initial epic for it: #1004

mccanne · 2021-11-09T02:54:33Z

Closing this since we decided zq doesn't try to take advantage of sorted data in any way and such optimization will all be handled at scale in the lake. Instead, zq just processes the values in order of the files.

aswan mentioned this issue Apr 1, 2020

Handle search Dir parameters properly #503

Closed

mattnibs self-assigned this Apr 29, 2020

philrz unassigned mattnibs Apr 30, 2020

henridf mentioned this issue Jun 2, 2020

groupby "every" assumes forward sorted data #847

Closed

henridf mentioned this issue Jun 10, 2020

zql: remove special-case timebin support #869

Closed

philrz added the needs discussion label Oct 2, 2020

mccanne closed this as completed Nov 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle sorted vs unsorted inputs #501

Handle sorted vs unsorted inputs #501

aswan commented Apr 1, 2020

henridf commented Jun 10, 2020

alfred-landrum commented Jul 20, 2020

mccanne commented Nov 9, 2021

Handle sorted vs unsorted inputs #501

Handle sorted vs unsorted inputs #501

Comments

aswan commented Apr 1, 2020

henridf commented Jun 10, 2020

alfred-landrum commented Jul 20, 2020

mccanne commented Nov 9, 2021