Skip to content

Conversation

@Ted-Jiang
Copy link
Member

@Ted-Jiang Ted-Jiang commented Aug 15, 2022

Which issue does this PR close?

Closes #3142
Closes #3145

Now we support approx_percentile_cont(col_name, percentile value, number of centroids)

DataFusion CLI v10.0.0
❯ create external table order stored as parquet location '/Users/yangjiang/test-data/tpch-1g/orders';
0 rows in set. Query took 0.016 seconds.
❯ select o_orderstatus, approx_percentile_cont(o_totalprice, 0.5) from order group by o_orderstatus  order by 1 limit 3;
+---------------+-------------------------------------------------------+
| o_orderstatus | APPROXPERCENTILECONT(order.o_totalprice,Float64(0.5)) |
+---------------+-------------------------------------------------------+
| F             | 143415.09074751436                                    |
| O             | 143177.55673372833                                    |
| P             | 181407.62680871075                                    |
+---------------+-------------------------------------------------------+
3 rows in set. Query took 0.728 seconds.
❯ select o_orderstatus, approx_percentile_cont(o_totalprice, 0.5, 10000) from order group by o_orderstatus  order by 1 limit 3;
+---------------+--------------------------------------------------------------------+
| o_orderstatus | APPROXPERCENTILECONT(order.o_totalprice,Float64(0.5),Int64(10000)) |
+---------------+--------------------------------------------------------------------+
| F             | 143289.22569288642                                                 |
| O             | 143184.3388358634                                                  |
| P             | 181451.707578125                                                   |
+---------------+--------------------------------------------------------------------+
3 rows in set. Query took 1.168 seconds.
❯ 


Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates labels Aug 15, 2022
})?
.value();
let max_size = match lit {
ScalarValue::UInt8(Some(q)) => *q as usize,
Copy link
Member Author

@Ted-Jiang Ted-Jiang Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now i think we can not input unsigned int from sql 🤔
So need these Int.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should file a ticket to allow creating unsigned int types from SQL (like in CREATE TABLE) statements.

I ran into this limitation while I was fixing #3167 -- https://github.com/apache/arrow-datafusion/pull/3167/files#r946145318

@codecov-commenter
Copy link

codecov-commenter commented Aug 15, 2022

Codecov Report

Merging #3146 (7d67a2f) into master (48f9b7a) will decrease coverage by 0.09%.
The diff coverage is 79.22%.

@@            Coverage Diff             @@
##           master    #3146      +/-   ##
==========================================
- Coverage   85.95%   85.85%   -0.10%     
==========================================
  Files         291      291              
  Lines       52382    52844     +462     
==========================================
+ Hits        45025    45370     +345     
- Misses       7357     7474     +117     
Impacted Files Coverage Δ
...sical-expr/src/aggregate/approx_percentile_cont.rs 81.43% <65.21%> (-3.53%) ⬇️
datafusion/core/tests/sql/aggregates.rs 99.37% <100.00%> (+<0.01%) ⬆️
datafusion/expr/src/aggregate_function.rs 93.43% <100.00%> (+1.12%) ⬆️
datafusion/physical-expr/src/aggregate/build_in.rs 90.01% <100.00%> (+0.09%) ⬆️
...on/physical-expr/src/expressions/binary/kernels.rs 52.17% <0.00%> (-10.99%) ⬇️
...ore/src/physical_plan/file_format/chunked_store.rs 62.26% <0.00%> (-5.09%) ⬇️
...tafusion/physical-expr/src/expressions/datetime.rs 84.22% <0.00%> (-2.58%) ⬇️
datafusion/core/tests/path_partition.rs 84.92% <0.00%> (-0.87%) ⬇️
datafusion/core/tests/sql/mod.rs 96.94% <0.00%> (-0.85%) ⬇️
datafusion/expr/src/window_frame.rs 92.43% <0.00%> (-0.85%) ⬇️
... and 16 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@github-actions github-actions bot added the core Core DataFusion crate label Aug 15, 2022
@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Aug 15, 2022

@alamb PTAL😊
Using large histogram bins will bring higher accuracy in large amounts of data.
In hive and spark default is 10,000.

@Ted-Jiang Ted-Jiang changed the title Support number of histogram bins in approx_percentile_cont Support number of centroids in approx_percentile_cont Aug 15, 2022
@andygrove
Copy link
Member

We should update the documentation at https://github.com/apache/arrow-datafusion/blob/master/docs/source/user-guide/sql/aggregate_functions.md as part of this PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Ted-Jiang -- @domodwyer / @realno I wonder if you have any interest in reviewing this PR.

@Ted-Jiang
Copy link
Member Author

We should update the documentation at https://github.com/apache/arrow-datafusion/blob/master/docs/source/user-guide/sql/aggregate_functions.md as part of this PR

Thanks your remind @andygrove add in 7d67a2f

@Ted-Jiang Ted-Jiang requested a review from alamb August 16, 2022 08:19
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @Ted-Jiang -- I think the code and tests look quite nice.

Sorry for the delay in review

got
)))
};
let percentile = validate_input_percentile_expr(&expr[1])?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

})?
.value();
let max_size = match lit {
ScalarValue::UInt8(Some(q)) => *q as usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should file a ticket to allow creating unsigned int types from SQL (like in CREATE TABLE) statements.

I ran into this limitation while I was fixing #3167 -- https://github.com/apache/arrow-datafusion/pull/3167/files#r946145318

let sql = "SELECT c1, approx_percentile_cont(c3, 0.95, 200) AS c3_p95 FROM aggregate_test_100 GROUP BY 1 ORDER BY 1";
let actual = execute_to_batches(&ctx, sql).await;
let expected = vec![
"+----+--------+",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a statistics expert but I verified that this is different than the output generated by SELECT c1, approx_percentile_cont_with_weight(c3, c2, 0.95) AS c3_p95 FROM aggregate_test_100 GROUP BY 1 ORDER BY 1 a few lines above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Thanks for your review 😊
this sql should have with the same result as
SELECT c1, approx_percentile_cont(c3, 0.95) AS c3_p95 FROM aggregate_test_100 GROUP BY 1 ORDER BY 1
in line 867

@alamb alamb merged commit 929eb6d into apache:master Aug 17, 2022
@ursabot
Copy link

ursabot commented Aug 17, 2022

Benchmark runs are scheduled for baseline = 2aa0a98 and contender = 929eb6d. 929eb6d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support number of histogram bins in approx_percentile_cont Support create ApproxPercentileAccumulator with TDigest max_size

5 participants