Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations, Windowing and Distinct - in-memory #46

Merged
merged 22 commits into from
Nov 18, 2021
Merged

Aggregations, Windowing and Distinct - in-memory #46

merged 22 commits into from
Nov 18, 2021

Conversation

dcmoura
Copy link
Owner

@dcmoura dcmoura commented Oct 7, 2021

New functionalities:

  • aggregation functions
  • GROUP BY clause: allow specifying aggregation groups
  • SELECT DISTINCT: filters out duplicate rows
  • SELECT PARTIALS: outputs partial/intermediate results of aggregates for each input row (instead of only outputting the totals - default behaviour). This emulates window functions where the window is always the input data on its original order, partitioned using the GROUP BY criteria
  • --unbuffered command-line argument to allow users to disable output buffering (useful, for instance, when running SELECT PARTIALS over streaming data)
  • fixes for minor bugs found while testing

This is an in-memory implementation for providing basic aggregation functionality. The following limitations require follow-up reimplementations:

  • GROUP BY: groups are stored into memory: when calculating sums, counts, averages, etc, over a reasonable number of groups this is not a problem. Memory consumption might be too large when the number of groups is very large
  • array_agg: this aggregate function collects all elements into memory
  • SELECT DISTINCT, set_agg, count_disintct_agg: these operations store all unique values into memory
  • The current mechanism for tracking aggregations is based on the order of aggregate function calls in the query. This might fail if there are flow control statements, which is not checked by the parser! e.g. SELECT max_agg(x) if x>0 else 0, count_agg(*) produces unpredictable results because in some rows the max_agg function is executed and in others is not. Therefore, in some rows the count_agg will be the second to be called and on others would be the first, mixing up cumulative values.
    An alternative approach should be discussed, either for this PR or for a follow-up reimplementation.

Closes #44 .
Closes #45 .

@dcmoura dcmoura added the core Core feature label Oct 7, 2021
@dcmoura dcmoura requested a review from recharte October 7, 2021 22:05
@codecov
Copy link

codecov bot commented Oct 7, 2021

Codecov Report

Merging #46 (2b6af00) into master (d5aa490) will increase coverage by 0.61%.
The diff coverage is 98.99%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #46      +/-   ##
==========================================
+ Coverage   95.25%   95.87%   +0.61%     
==========================================
  Files           9       10       +1     
  Lines         864     1018     +154     
==========================================
+ Hits          823      976     +153     
- Misses         41       42       +1     
Impacted Files Coverage Δ
spyql/agg.py 98.07% <98.07%> (ø)
spyql/cli.py 98.98% <98.50%> (-0.36%) ⬇️
spyql/nulltype.py 89.70% <100.00%> (+0.93%) ⬆️
spyql/output_handler.py 98.83% <100.00%> (+0.92%) ⬆️
spyql/processor.py 95.91% <100.00%> (+0.10%) ⬆️
spyql/utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5aa490...2b6af00. Read the comment docs.

This was linked to issues Oct 18, 2021
Base automatically changed from mem_sort to master October 23, 2021 21:27
@dcmoura dcmoura merged commit afc4d8c into master Nov 18, 2021
@dcmoura dcmoura deleted the agg_mem branch November 18, 2021 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SELECT DISTINCT Aggregations
1 participant