Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Aug 6, 2022

Which issue does this PR close?

Closes #3049

Rationale for this change

Matching PostgreSQL, Spark and improve performance. Overall simplification.

What changes are included in this PR?

Moves most functions, except rpad, lpad and translate to use code points rather than grapheme clusters.
Also code simplification & avoiding some allocations.

Are there any user-facing changes?

@Dandandan Dandandan marked this pull request as ready for review August 6, 2022 12:48
@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Aug 6, 2022
@Dandandan Dandandan changed the title Calculate string length based on code points [WIP] Calculate string length based on code points Aug 6, 2022
@Dandandan Dandandan changed the title [WIP] Calculate string length based on code points Calculate string length based on code points Aug 6, 2022
@codecov-commenter
Copy link

codecov-commenter commented Aug 6, 2022

Codecov Report

Merging #3054 (12d7366) into master (a8ed874) will increase coverage by 0.03%.
The diff coverage is 79.31%.

@@            Coverage Diff             @@
##           master    #3054      +/-   ##
==========================================
+ Coverage   85.84%   85.88%   +0.03%     
==========================================
  Files         289      289              
  Lines       51862    51844      -18     
==========================================
+ Hits        44520    44525       +5     
+ Misses       7342     7319      -23     
Impacted Files Coverage Δ
...atafusion/physical-expr/src/unicode_expressions.rs 77.45% <76.92%> (+4.35%) ⬆️
datafusion/core/tests/sql/unicode.rs 100.00% <100.00%> (ø)
datafusion/physical-expr/src/functions.rs 92.72% <100.00%> (ø)
datafusion/expr/src/logical_plan/plan.rs 77.43% <0.00%> (-0.18%) ⬇️
datafusion/core/tests/sql/timestamp.rs 100.00% <0.00%> (ø)
datafusion/expr/src/window_frame.rs 93.27% <0.00%> (+0.84%) ⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@Dandandan Dandandan changed the title Calculate string length based on code points Use code points instead of grapheme clusters for string functions Aug 6, 2022
@Dandandan Dandandan changed the title Use code points instead of grapheme clusters for string functions [WIP] Use code points instead of grapheme clusters for string functions Aug 6, 2022
@alamb
Copy link
Contributor

alamb commented Aug 6, 2022

cc @ovr

@github-actions github-actions bot added the core Core DataFusion crate label Aug 7, 2022
@Dandandan Dandandan changed the title [WIP] Use code points instead of grapheme clusters for string functions Use code points instead of grapheme clusters for string functions Aug 7, 2022
@Dandandan Dandandan added the performance Make DataFusion faster label Aug 7, 2022
@Dandandan
Copy link
Contributor Author

@alamb @andygrove this is reviewable now.

@andygrove
Copy link
Member

LGTM. I assume there should be a documentation update as well as part of this PR? Perhaps the rustdoc for fn character_length should explain how this is now implemented?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice simplification and improvement to me ❤️ Thank you @Dandandan

cc @ovr who I think contributed the original implementation

@ovr
Copy link
Contributor

ovr commented Aug 8, 2022

cc @ovr who I think contributed the original implementation

LGTM, but I am not an original author 😄

@Dandandan
Copy link
Contributor Author

I think this was @seddonm1

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @Dandandan

@andygrove andygrove merged commit 88f6548 into apache:master Aug 8, 2022
@ursabot
Copy link

ursabot commented Aug 8, 2022

Benchmark runs are scheduled for baseline = a4fa44f and contender = 88f6548. 88f6548 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate performance Make DataFusion faster physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Simplify / speed up implementation of character_length to unicode points

6 participants