Skip to content

[SPARK-41159][SQL] Optimize like any and like all expressions#38672

Closed
wankunde wants to merge 7 commits intoapache:masterfrom
wankunde:likeany
Closed

[SPARK-41159][SQL] Optimize like any and like all expressions#38672
wankunde wants to merge 7 commits intoapache:masterfrom
wankunde:likeany

Conversation

@wankunde
Copy link
Contributor

@wankunde wankunde commented Nov 16, 2022

What changes were proposed in this pull request?

Optimize like any and like all expressions with startWith, endWith, contains, equalTo methods.

Why are the changes needed?

Now like any and like all expressions will be very slow whether enable or disable LikeSimplification rule.
Refer to org.apache.spark.sql.execution.benchmark.LikeAnyBenchmark

OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1022-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
[info] Multi like query:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Query with multi like                              1439           1518          78          0.0     1438904.6       1.0X
[info] Query with LikeAny simplification                  1392           1427          30          0.0     1392103.7       1.0X
[info] Query without LikeAny simplification                368            374           5          0.0      368485.2       3.9X

Does this PR introduce any user-facing change?

No

How was this patch tested?

Exists UT

@github-actions github-actions bot added the SQL label Nov 16, 2022
@wankunde wankunde force-pushed the likeany branch 2 times, most recently from 339217c to 351a584 Compare November 16, 2022 13:01
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@wankunde wankunde changed the title [WIP][SPARK-41159][SQL] Optimize like any and like all expressions [SPARK-41159][SQL] Optimize like any and like all expressions Nov 18, 2022
@wankunde
Copy link
Contributor Author

wankunde commented Dec 7, 2022

Hi, @beliefer @cloud-fan @wangyum Could you help to review this PR? Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LikeSimplification have the similar optimization. Why need this class ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark cannot prove the performance improvement. Could you test with or without MatchMultiHelper ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR:

[info] Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
[info] Multi like query:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Query with multi like                              1393           1469         119          0.0     1392586.7       1.0X
[info] Query with LikeAny simplification                  1244           1309          97          0.0     1244382.5       1.1X
[info] Query without LikeAny simplification                400            407           8          0.0      399924.3       3.5X

[info] Multi like query:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Query with multi like                              1476           1576         149          0.0     1475710.1       1.0X
[info] Query with LikeAny simplification                  1387           1429          37          0.0     1386669.1       1.1X
[info] Query without LikeAny simplification                430            470          35          0.0      430435.8       3.4X

After this PR:

[info] Multi like query:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Query with multi like                              1441           1516          78          0.0     1441335.8       1.0X
[info] Query with LikeAny simplification                  1401           1431          44          0.0     1400743.9       1.0X
[info] Query without LikeAny simplification                357            369          10          0.0      357419.8       4.0X

[info] Multi like query:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Query with multi like                              1524           1628         117          0.0     1524119.6       1.0X
[info] Query with LikeAny simplification                  1405           1418          18          0.0     1405258.7       1.1X
[info] Query without LikeAny simplification                362            372          12          0.0      361654.4       4.2X

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means regression ?

@wankunde
Copy link
Contributor Author

wankunde commented Dec 7, 2022

After LikeSimplification, the combination of multiple like expressions with OR can be pushdown to parquet reader, while like any can not.
So close this PR.

@wankunde wankunde closed this Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments