Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle sorted vs unsorted inputs #501

Closed
aswan opened this issue Apr 1, 2020 · 3 comments
Closed

Handle sorted vs unsorted inputs #501

aswan opened this issue Apr 1, 2020 · 3 comments

Comments

@aswan
Copy link

aswan commented Apr 1, 2020

zq currently has some half-baked pieces for dealing with sorted inputs:

  • when scanner.Combiner is merging multiple files it peeks at the timestamps of the next record from each file and returns the lowest one. This only produces a useful result when all inputs have ascending timestamps.
  • The proc.Context object includes a Reverse flag to indicate that records are sorted with descending timestamps rather than ascending. This is a global property of a query and it has no way to indicate that a particular stream is unsorted.
  • The groupby proc when it is time-binned assumes that timestamps are sorted (using the Reverse flag described above to distinguish ascending or descending). If this flag is wrong (either because the stream isn't sorted at all or because it isn't set correctly), groupby generates incorrect results.

This is working well enough right now in the app since all.bzng is always sorted by descending timestamps during ingest and the app always passes "dir":-1 in its queries (incidentally, we should remove that from the zqd api since any other value will produce incorrect results).

The zql spec includes ordering hints which are directives that can be inserted in a (b)zng file to indicate if/how the contents of the file/stream are sorted. This issue is to generalize the current "Reverse" flag into a more complete solution that handles streams that can be in one of 3 states: sorted ascending, sorted descending, or unsorted. Note that this property might be different at different points in a query graph (e.g., if points are sorted by timestamp and then they enter a sort foo proc, they are no longer sorted by timestamp). For now, it is probably sufficient to just implement this for the timestamp field and ignore zng sorting hints if they indicate sorting by any other field.

When this was discussed in the past, there was some disagreement about exactly how to implement it. The controversial aspect was how to handle procs that might change the sorting order as described above.

@aswan impelemented a solution in which sorting information was communicated from readers to procs and between procs dynamically at runtime. @nwt, @henridf, and @mccanne all disliked this and proposed a different solution: the introduction of a separate static analysis phase that would analyze an entire proc graph and determine the sorting properties at each point in the graph. It is unclear how this would work when the sortedness of a (b)zng input is unknown until the stream has been read.

@henridf
Copy link
Contributor

henridf commented Jun 10, 2020

For future reference, @nwt added some ideas about the implementation of sortedness analysis here: #869 (comment)

@alfred-landrum
Copy link
Contributor

The analysis mentioned above will likely happen in the query planner, though I don't know if it will happen in this initial epic for it: #1004

@mccanne
Copy link
Collaborator

mccanne commented Nov 9, 2021

Closing this since we decided zq doesn't try to take advantage of sorted data in any way and such optimization will all be handled at scale in the lake. Instead, zq just processes the values in order of the files.

@mccanne mccanne closed this as completed Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants