[SPARK-38444][SQL]Automatically calculate the upper and lower bounds of partitions when no specified partition related params by caican00 · Pull Request #35764 · apache/spark

caican00 · 2022-03-08T08:47:16Z

What changes were proposed in this pull request?

when access rdbms, such as mysql, this patch can automatically calculate upper and lower bounds according to the primary key to improve parallelism and speed up query.

Why are the changes needed?

when access rdbms, such as mysql, if partitionColumn, lowerBound, upperBound, numPartitions are not specified, by default only one partition to scan database is working.

It makes load data from database slow and makes it difficult for users to configure multiple parameters to improve parallelism.

This patch can automatically calculate upper and lower bounds according to the primary key to improve parallelism and speed up query.

Does this PR introduce any user-facing change?

yes. new config defaultNumPartitions in JDBCOptions. It is used to set the default parallelism.

How was this patch tested?

new testing

…of partitions when no specified partition related params

AmplabJenkins · 2022-03-08T09:54:20Z

Can one of the admins verify this patch?

caican00 · 2022-03-08T10:09:17Z

@MaxGekk HI, could you help to review this patch? thanks

HyukjinKwon · 2022-03-08T13:38:09Z

cc @maropu too FYI who looked into these code paths.

caican00 · 2022-03-09T03:47:42Z

cc @cloud-fan

wangyum · 2022-03-12T14:16:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

  val numPartitions = parameters.get(JDBC_NUM_PARTITIONS).map(_.toInt)

+  // the default number of partitions
+  val defaultNumPartitions = parameters.getOrElse(DEFAULT_NUM_PARTITIONS, "10").toInt


What is the difference between JDBC_NUM_PARTITIONS?

Same question here :)
Do we really need a default? If the users want to have multiple partitions, shouldn't they specify this explicitly?

What is the difference between JDBC_NUM_PARTITIONS?

@wangyum

partitionColumn, lowerBound, upperBound and numPartitions must be specified together. If an unreasonable numPartitions was specified by users, such as 1, the parallelism is still very small.

therefore, we(jdbc) should rezoning partition nums using another config, not JDBC_NUM_PARTITIONS(it is specified by users and its value maybe very small)

Same question here :) Do we really need a default? If the users want to have multiple partitions, shouldn't they specify this explicitly?

@huaxingao

if users want to change partition nums, they only need to specify DEFAULT_NUM_PARTITIONS explicitly and do not need to specify any more parameters, jdbc will automatically calculate the upper and lower bounds of partitions using primary key

If users do not specify any parameters, then we(jdbc) need a default value to determine the base number of partitions

I replied below. I think if this is about default number of partitions, we can just set the default to JDBC_NUM_PARTITIONS flag.

huaxingao · 2022-03-13T23:48:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

+   * @return JDBCPartitioningInfo
+   */
+  def getPartitionBound(
+                         schema: StructType,


nit: 4 space indentation

thanks, updated it.

huaxingao · 2022-03-13T23:55:38Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala

+      sql(s"insert into h2.test.employee values($id, 'a')")
+    }
+    val df = sql("select id, name from h2.test.employee")
+    // default partition num is 15


you mean default partition num is 10?

my mistake. updated it, thanks

github-actions · 2022-09-26T00:26:05Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

cloud-fan · 2022-09-26T07:39:08Z

@sadikovi can you take a look?

sadikovi · 2022-09-27T05:40:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

  }

+  /**
+   * get the min and max value by the column


nit: Get the min and max values for the column.

sadikovi · 2022-09-27T05:41:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

+      resolver: Resolver,
+      timeZoneId: String,
+      jdbcOptions: JDBCOptions,
+      filters: Array[Filter] = Array.empty): JDBCPartitioningInfo = {


Shall we return Option[JDBCPartitioningInfo] instead?

sadikovi · 2022-09-27T05:43:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

+      val dataType = prk.dataType
+      var lBound: String = null
+      var uBound: String = null
+      val sql = s"select min(${prk.name}) as lBound, max(${prk.name}) as uBound " +


Can you explain this logic in the javadoc for this method? Also, what happens if the table is empty?

sadikovi · 2022-09-27T05:43:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

+          statement.setQueryTimeout(jdbcOptions.queryTimeout)
+          val resultSet = statement.executeQuery()
+          while (resultSet.next()) {
+            lBound = resultSet.getString("lBound")


Would it work for primary keys that are integers or timestamps?

sadikovi · 2022-09-27T05:44:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

+            uBound = resultSet.getString("uBound")
+          }
+        } catch {
+          case _: SQLException =>


Maybe it is worth at least logging the exception but I would consider re-throwing it.

sadikovi · 2022-09-27T05:45:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

+    filters.map(filter => filter.references.distinct.map(r => filterColumns.add(r)))
+    // primary keys used for partitioning
+    val prks = schema.fields.filter(
+      f => f.metadata.getBoolean("isIndexKey") &&


Does the code handle composite primary keys or any multi-column indexes, e.g. with 2 or more columns?

sadikovi · 2022-09-27T05:47:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

      val columnType =
        dialect.getCatalystType(dataType, typeName, fieldSize, metadata).getOrElse(
          getCatalystType(dataType, fieldSize, fieldScale, isSigned))
+      list.contains(columnName) match {


Is it the same as:

metadata.putBoolean("isIndexKey", list.contains(columnName))

Also, can we make list a set?

sadikovi · 2022-09-27T05:50:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

  // the number of partitions
  val numPartitions = parameters.get(JDBC_NUM_PARTITIONS).map(_.toInt)

+  // the default number of partitions


Can you update this comment? It is unclear what default number of partition it is - is it for overall number of partitions in the RDD or is it specifically for primary keys in the table and pushed filters?

sadikovi

Can you update the PR title and description to reflect the changes? I think we should just have a flag to enable/disable partitioning based on available primary keys.

sadikovi · 2022-09-27T05:54:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

  val numPartitions = parameters.get(JDBC_NUM_PARTITIONS).map(_.toInt)

+  // the default number of partitions
+  val defaultNumPartitions = parameters.getOrElse(DEFAULT_NUM_PARTITIONS, "10").toInt


I think the name of the config is misleading, this is essentially the default value of numPartitions configs:

val numPartitions = parameters.get(JDBC_NUM_PARTITIONS).map(_.toInt).getOrElse(10)

github-actions · 2023-01-06T00:19:45Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-38444][SQL]Automatically calculate the upper and lower bounds …

137ebe7

…of partitions when no specified partition related params

github-actions bot added the SQL label Mar 8, 2022

Merge branch 'master' into optimize-jdbc-scan

1b78565

[SPARK-38444][SQL]update

8fffe8e

caican00 added 3 commits March 9, 2022 16:09

Merge branch 'master' into optimize-jdbc-scan

aafc990

[SPARK-38444][SQL]update

0ff4481

[SPARK-38444][SQL]update

94c9e51

wangyum reviewed Mar 12, 2022

View reviewed changes

huaxingao reviewed Mar 13, 2022

View reviewed changes

update

d9a6b1d

github-actions bot added the Stale label Sep 26, 2022

cloud-fan removed the Stale label Sep 26, 2022

sadikovi reviewed Sep 27, 2022

View reviewed changes

github-actions bot added the Stale label Jan 6, 2023

github-actions bot closed this Jan 7, 2023

Conversation

caican00 commented Mar 8, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Mar 8, 2022

Uh oh!

caican00 commented Mar 8, 2022

Uh oh!

HyukjinKwon commented Mar 8, 2022

Uh oh!

caican00 commented Mar 9, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao Mar 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 26, 2022

Uh oh!

cloud-fan commented Sep 26, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadikovi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Comments

huaxingao Mar 13, 2022 •

edited

Loading