Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby "every" assumes forward sorted data #847

Closed
henridf opened this issue Jun 2, 2020 · 4 comments
Closed

groupby "every" assumes forward sorted data #847

henridf opened this issue Jun 2, 2020 · 4 comments
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@henridf
Copy link
Contributor

henridf commented Jun 2, 2020

If records presented to a every X groupby are not in forward order, then that groupby will emit multiple groups for a given time bin. For example:

$ zq -t "every 1h count()" ~/Downloads/sampledata/wrccdc-year1/zng-logs/conn.zng | head -n 10
#0:record[ts:time,count:uint64]
0:[1490382000;88795;]
0:[1490382000;100;]
0:[1490382000;100;]
0:[1490382000;100;]
0:[1490382000;92;]
0:[1490382000;98;]
0:[1490382000;100;]
0:[1490382000;92;]
0:[1490382000;100;]
@henridf
Copy link
Contributor Author

henridf commented Jun 2, 2020

(For anyone trying to repro this: it requires a suitable large input. A small handcrafted input that fits into one zbuf.Batch will be processed correctly.).

@henridf henridf added the bug Something isn't working label Jun 2, 2020
@henridf
Copy link
Contributor Author

henridf commented Jun 2, 2020

I now notice that this is related to the broader issue filed at #501. In any case worth having as a standalone issue as it should be fixed by a forthcoming PR.

@philrz philrz changed the title groupby -every assumes forward sorted data groupby "every" assumes forward sorted data Jun 15, 2020
@philrz
Copy link
Contributor

philrz commented Jun 15, 2020

I've confirmed with @henridf that this issue is no longer necessary, so I'm closing it. What's happened is that the change in #893 makes sure this problem is avoided by not emitting any bins until the end.

Here's verification. The original issue still existed as of zq commit 905504e that came right before the #893 change:

$ zq -version
Version: v0.14.0-21-g905504e

$ zq -t "every 1h count()" ~/Downloads/sampledata/wrccdc-year1/zng-logs/conn.zng | head -n 10
#0:record[ts:time,count:uint64]
0:[1490382000;88795;]
0:[1490382000;100;]
0:[1490382000;100;]
0:[1490382000;100;]
0:[1490382000;92;]
0:[1490382000;98;]
0:[1490382000;100;]
0:[1490382000;92;]
0:[1490382000;100;]

Then with the commit 9af46a0 (that coincides with #893), now we have unique bins as desired:

$ zq -version
Version: v0.14.0-22-g9af46a0

$ zq -t "every 1h count()" ~/Downloads/sampledata/wrccdc-year1/zng-logs/conn.zng | head -n 10
#0:record[ts:time,count:uint64]
0:[1490428800;104544;]
0:[1490486400;10427;]
0:[1490414400;116826;]
0:[1490472000;13587879;]
0:[1490400000;3702075;]
0:[1490457600;892137;]
0:[1490385600;8155901;]
0:[1490443200;103451;]
0:[1490461200;1287525;]

While unique, note that they are not sorted by default, so a downstream sort may be justified depending on what the user intends to do with the data next.

@philrz philrz closed this as completed Jun 15, 2020
@philrz philrz added the wontfix This will not be worked on label Jun 15, 2020
@henridf
Copy link
Contributor Author

henridf commented Jun 16, 2020

What's happened is that the change in #893 makes sure this problem is avoided by not emitting any bins until the end.

This is all correct, but adding one additional piece of context just in case anyone came across this issue without other context on what we're doing with groupby: in general, #893 does indeed wait till EOF to emit all bins. But that PR also included a feature for early-emitting bins when the incoming stream is sorted in the primary grouping key. In the case of every, that primary grouping key is ts, and groupby will early-emit bins if it knows that the input is time-sorted (currently, this happens only in zqd).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
2 participants