Skip to content

[SPARK-41132][SQL] Convert LikeAny and NotLikeAny to InSet if no pattern contains wildcards#38649

Closed
wankunde wants to merge 3 commits intoapache:masterfrom
wankunde:like_set
Closed

[SPARK-41132][SQL] Convert LikeAny and NotLikeAny to InSet if no pattern contains wildcards#38649
wankunde wants to merge 3 commits intoapache:masterfrom
wankunde:like_set

Conversation

@wankunde
Copy link
Contributor

@wankunde wankunde commented Nov 14, 2022

What changes were proposed in this pull request?

Improve likeAny and notLikeAny performance.

We can optimize query SELECT * FROM tab WHERE trim(addr) LIKE ANY ('5001', '5002%', '%5003', '5004', '5001') to SELECT * FROM tab WHERE trim(addr) IN ('5001', '5004') OR trim(addr) like '5002%' OR trim(addr) like '%5003'

Before this PR

Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Multi like query:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Query with LikeAny simplification                  2335           2456         184          0.0      233496.6       1.0X

After this PR

Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Multi like query:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Query with LikeAny simplification                  1912           1966          50          0.0      191230.9       1.0X

Why are the changes needed?

Match with set values will be faster than regex expressions.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add UT

@github-actions github-actions bot added the SQL label Nov 14, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

if (remainPatterns.nonEmpty) And(and, l.copy(patterns = remainPatterns)) else and
case l: LikeAny =>
val or = buildBalancedPredicate(replacements, Or)
val equalPatterns = MutableHashset[Any]()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have some benchmark number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added benchmark result.

benchmark code:

object LikeAnyBenchmark extends SqlBasedBenchmark {
  import spark.implicits._

  private val numRows = 10000
  private val width = 5

  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    try f finally tableNames.foreach(spark.catalog.dropTempView)
  }

  private def saveAsTable(df: DataFrame, dir: File): Unit = {
    val parquetPath = dir.getCanonicalPath + "/parquet"
    df.write.mode("overwrite").parquet(parquetPath)
    spark.read.parquet(parquetPath).createOrReplaceTempView("parquetTable")
  }

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    withTempPath { dir =>
      withTempTable("parquetTable") {
        val selectExpr = (1 to width).map(i => s"CAST(value + 1000000 AS STRING) c$i")
        val df = spark.range(0, numRows, 1, 100)
          .map(_ => Random.nextLong).selectExpr(selectExpr: _*)
        saveAsTable(df, dir)

        val benchmark =
          new Benchmark("Multi like query", numRows, minNumIters = 3, output = output)

        benchmark.addCase("Query with LikeAny simplification", numIters = 3) { _ =>
          val likeAnyExpr =
            Random.shuffle(Range(1000, 1300).map(i =>
              if (i < 1100) s"'$i%'" else if (i < 1200) s"'%$i'" else s"'$i'"
            )).mkString("c1 like any(", ", ", ")")
          spark.sql(s"SELECT * FROM parquetTable WHERE $likeAnyExpr").noop()
        }
        benchmark.run()
      }
    }
  }
}

@wankunde wankunde requested a review from wangyum January 12, 2023 08:25
@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Apr 23, 2023
@github-actions github-actions bot closed this Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments