[SPARK-16056] [SPARK-16057] [SPARK-16058] [SQL] Fix Multiple Bugs in Column Partitioning in JDBC Source by gatorsmile · Pull Request #13773 · apache/spark

gatorsmile · 2016-06-19T17:22:03Z

What changes were proposed in this pull request?

This PR is to fix the following bugs:

Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning

spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 4,
  upperBound = 0,
  numPartitions = 3,
  connectionProperties = new Properties)

Before code changes:
The returned results are wrong and the generated partitions are wrong:

  Part 0 id < 3 or id is null
  Part 1 id >= 3 AND id < 2
  Part 2 id >= 2

After code changes:
Issue an IllegalArgumentException exception:

Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1

Issue 2: numPartitions is more than the number of key values between upper and lower bounds

spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 1,
  upperBound = 5,
  numPartitions = 10,
  connectionProperties = new Properties)

Before code changes:
Returned correct results but the generated partitions are very inefficient, like:

Partition 0: id < 1 or id is null
Partition 1: id >= 1 AND id < 1
Partition 2: id >= 1 AND id < 1
Partition 3: id >= 1 AND id < 1
Partition 4: id >= 1 AND id < 1
Partition 5: id >= 1 AND id < 1
Partition 6: id >= 1 AND id < 1
Partition 7: id >= 1 AND id < 1
Partition 8: id >= 1 AND id < 1
Partition 9: id >= 1

After code changes:
Adjust numPartitions and can return the correct answers:

Partition 0: id < 2 or id is null
Partition 1: id >= 2 AND id < 3
Partition 2: id >= 3 AND id < 4
Partition 3: id >= 4

Issue 3: java.lang.ArithmeticException when numPartitions is zero

spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 0,
  upperBound = 4,
  numPartitions = 0,
  connectionProperties = new Properties)

Before code changes:
Got the following exception:

  java.lang.ArithmeticException: / by zero

After code changes:
Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero

How was this patch tested?

Added test cases to verify the results

srowen · 2016-06-19T18:22:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

+    val lowerBound = partitioning.lowerBound
+    val upperBound = partitioning.upperBound
+    if (lowerBound > upperBound) {
+      throw new IllegalArgumentException("Operation not allowed: the lower bound of partitioning " +


Use require

I see. Let me change it.

srowen · 2016-06-19T18:23:27Z

I wouldn't open three JIRAs for what's basically one logical problem. The fact that you have one PR for all of them is indicative.

gatorsmile · 2016-06-19T18:38:22Z

When opening the JIRAs, I do not know how to write a proper title. An obscure JIRA title is hard for others to track the issue in the future. That is why I opened three. I saw we also did the same in the other JIRAs: https://issues.apache.org/jira/browse/SPARK-16032.

SparkQA · 2016-06-19T18:48:53Z

Test build #60814 has finished for PR 13773 at commit 2198f59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-19T20:19:43Z

Test build #60816 has finished for PR 13773 at commit 94bc09c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-06-19T22:29:32Z

@srowen Will submit more PRs about JDBC. It sounds like the interface of DataFrameReader and DataFrameWriter are not designed for JDBC data sources. For Spark SQL beginners, they might hit various strange errors.

Anyway, will try to create less JIRAs, but, to be honest, in my previous team, the JIRA-like defect tracking system is used to record the defects. We always create multiple defects when they have different external impacts, even if it is caused by the multiple bugs in the same internal function. It is very bad for us to combine multiple issues into the same one. When each fixpack or release is published, our customers, L2 and L3 might use it to know what are included in the specific fixpack. Below is an example: http://www-01.ibm.com/support/docview.wss?uid=swg21633303 There are a long list of defects. In Spark, all the JDBC related JIRAs can be classified into the same group, but we should not combine multiple defects into the same one. In my previous team, we always have to provide very clear titles for each JIRA/defect. Users might not be patient to click the link to read the details. I think the same logics is also applicable to Spark.

gatorsmile · 2016-06-19T22:35:50Z

@rxin @liancheng @clockfly @yhuai Could you please also review this PR? I found all of you recently reviewed the JDBC-related PRs. Thanks!

srowen · 2016-06-20T07:59:04Z

Of course, you can always describe several impacts of one logical issue in one JIRA. Nobody disagrees that clear and complete descriptions are good, but that's orthogonal. Surely you provide title and description in release notes?

The real test is, would you ever fix one but not the other? apply a patch for one and not another? If so, then yeah they should probably be different issues, and different patches to address them. But here you made one change, so, I can't take the fix for one without the other, which is why I don't quite get the argument. Also, if they're tracked as separate issues, users could miss the connection between them (though you can link JIRAs).

We also have a minor problem with some organizations rewarding JIRA and PR count. I don't think that's an issue here, but would like to set a certain standard about this practice.

It's up to your judgment here, but keep it in mind.

gatorsmile · 2016-06-20T15:16:10Z

In the code implementation, these three issues are fixed in different parts although they are in the same function. If I care the PR counts, I can submit and fix them one by one, right? If the JIRA counts are your concern, you do not need to assign all of them to my name and leave two of them blank. I am completely fine with it.

Regarding the release notes, in my previous team, I have to write the APAR for each defect fixed by myself. Each APAR should have at least three parts: error description, problem summary, and problem conclusion. The customers will read them. If they think it is related to them, they might directly contact our L2 for understanding more details about internal changes. I do not know who wrote the release notes for Spark, but I think enterprise customers might have similar requirements.

When reading the source codes, I just found the bugs. What are the best practices after finding the bugs? Most developers are kind of lazy. : ) For developers, the easiest way is to simply fix them and write down what are bugs. However, this does not benefit the end users. We need to see what are the external impact for each line of changes, write separate test cases and document them at our best. This is how my previous team works. I also think we should follow this in Spark if we really care the conservative enterprise customers.

You know, I just joined the community less than one year and switched to open source development from software development for Mainframe. Let me know if my understanding is not right. Thanks!

srowen · 2016-06-20T15:25:41Z

No, I don't mean you're gaming anything, on the contrary. It's a minor, separate problem with however a related implication for standards. Although the presentation of release notes is probably secondary, I agree there is some non-zero value in three entries instead of one in a huge list of changes. I actually think anyone who cares would search JIRA anyway for details to find all related issues in a release. Please go ahead as you are here, it's not worth splitting hairs about.

gatorsmile · 2016-06-20T16:12:14Z

Thanks for your suggestions!

Like what Kurt Brown said in the summit, it is going to be harder to attract the enterprise customers who do not want to be pioneers. Spark 2.0 is GREAT! So many new features, but, at the same time, we have to speed up to improve the quality and make it ready for enterprise customers. A lot of dirty jobs are in front of us. You might see I could submit some PRs only for test cases. That is for improving test case coverage. As what I said above, personally, I do not care the JIRA counts. When you merge the PRs, you can leave the assignee blank if you have any concern. : )

rxin · 2016-06-20T19:40:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

+    val upperBound = partitioning.upperBound
+    require (lowerBound <= upperBound,
+      "Operation not allowed: the lower bound of partitioning column is larger than the upper " +
+      s"bound. lowerBound: $lowerBound; higherBound: $upperBound")


Don't use "lowerBound" and "higherBound", Just say "Lower bound" and "Upper bound".

Sure, will do it.

rxin · 2016-06-20T22:06:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

-                      - partitioning.lowerBound / numPartitions)
+    val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
+    // The automatic adjustment of numPartitions can ensure the following checking condition.
+    assert(stride >= 1, "The specified number of partitions should be greater than " +


actually i think this assert is pretty dangerous -- doesn't this make it fail if the number of partitions is 10, and the total value difference is less than 10?

I'd argue there is nothing inherently wrong in that case and break old code. I'd consider logging a warning (and include the actual value for stride and lower/upper bound)

This assert is useless. Just the extra checking. We already guarantee that the value is always larger than 0 by the changes made by this PR: https://github.com/gatorsmile/spark/blob/da3720b7a70949e09c3562e6f3a168a690243b6c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala#L66-L71

If the stride value is zero, it is pretty dangerous. We are generating useless partitions. That is the issue 2 we are trying to resolve:

Partition 0: id < 1 or id is null Partition 1: id >= 1 AND id < 1 Partition 2: id >= 1 AND id < 1 Partition 3: id >= 1 AND id < 1 Partition 4: id >= 1 AND id < 1 Partition 5: id >= 1 AND id < 1 Partition 6: id >= 1 AND id < 1 Partition 7: id >= 1 AND id < 1 Partition 8: id >= 1 AND id < 1 Partition 9: id >= 1

I think your idea is good. We should log a warning message. Let me do it. Thanks!

BTW, a few years ago, I wrote codes for multi-column column partitioning. It is pretty complicated, but I am not sure if this is a good fit here. When the num of columns is large, the where clause could be very long. The major issues are the JDBC data sources might not always perform very fastly when the where clause is very complicated.

SparkQA · 2016-06-20T22:27:14Z

Test build #60870 has finished for PR 13773 at commit da3720b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-06-21T01:39:21Z

After the latest changes, we log a warning message when the specified numPartitions is more than the number of key values between upper and lower bounds:

18:37:08.848 WARN org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation: The number of partitions is reduced because the specified number of partitions is less than the difference between upper bound and lower bound. Updated number of partitions: 4; Input number of partitions: 10; Lower bound: 1; Upper bound: 5.

SparkQA · 2016-06-21T03:24:10Z

Test build #60895 has finished for PR 13773 at commit 7d2ae9a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-06-21T04:46:18Z

Merging in master/2.0.

…umn Partitioning in JDBC Source #### What changes were proposed in this pull request? This PR is to fix the following bugs: **Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning** ```scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 4, upperBound = 0, numPartitions = 3, connectionProperties = new Properties) ``` **Before code changes:** The returned results are wrong and the generated partitions are wrong: ``` Part 0 id < 3 or id is null Part 1 id >= 3 AND id < 2 Part 2 id >= 2 ``` **After code changes:** Issue an `IllegalArgumentException` exception: ``` Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1 ``` **Issue 2: numPartitions is more than the number of key values between upper and lower bounds** ```scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 1, upperBound = 5, numPartitions = 10, connectionProperties = new Properties) ``` **Before code changes:** Returned correct results but the generated partitions are very inefficient, like: ``` Partition 0: id < 1 or id is null Partition 1: id >= 1 AND id < 1 Partition 2: id >= 1 AND id < 1 Partition 3: id >= 1 AND id < 1 Partition 4: id >= 1 AND id < 1 Partition 5: id >= 1 AND id < 1 Partition 6: id >= 1 AND id < 1 Partition 7: id >= 1 AND id < 1 Partition 8: id >= 1 AND id < 1 Partition 9: id >= 1 ``` **After code changes:** Adjust `numPartitions` and can return the correct answers: ``` Partition 0: id < 2 or id is null Partition 1: id >= 2 AND id < 3 Partition 2: id >= 3 AND id < 4 Partition 3: id >= 4 ``` **Issue 3: java.lang.ArithmeticException when numPartitions is zero** ```Scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 0, upperBound = 4, numPartitions = 0, connectionProperties = new Properties) ``` **Before code changes:** Got the following exception: ``` java.lang.ArithmeticException: / by zero ``` **After code changes:** Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero #### How was this patch tested? Added test cases to verify the results Author: gatorsmile <gatorsmile@gmail.com> Closes #13773 from gatorsmile/jdbcPartitioning. (cherry picked from commit d9a3a2a) Signed-off-by: Reynold Xin <rxin@databricks.com>

…s in Column Partitioning in JDBC Source apache#13773

fix

2198f59

srowen reviewed Jun 19, 2016
View reviewed changes

address comments

94bc09c

rxin reviewed Jun 20, 2016
View reviewed changes

address comments

da3720b

rxin reviewed Jun 20, 2016
View reviewed changes

address comments

7d2ae9a

asfgit closed this in d9a3a2a Jun 21, 2016

zzcclp added a commit to zzcclp/spark that referenced this pull request Aug 19, 2016

[EXT][SPARK-16056] [SPARK-16057] [SPARK-16058] [SQL] Fix Multiple Bug…

9bbd0f7

…s in Column Partitioning in JDBC Source apache#13773

Conversation

gatorsmile commented Jun 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen Jun 19, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jun 19, 2016

Choose a reason for hiding this comment

Uh oh!

srowen commented Jun 19, 2016

Uh oh!

gatorsmile commented Jun 19, 2016

Uh oh!

SparkQA commented Jun 19, 2016

Uh oh!

SparkQA commented Jun 19, 2016

Uh oh!

gatorsmile commented Jun 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jun 19, 2016

Uh oh!

srowen commented Jun 20, 2016

Uh oh!

gatorsmile commented Jun 20, 2016

Uh oh!

srowen commented Jun 20, 2016

Uh oh!

gatorsmile commented Jun 20, 2016

Uh oh!

rxin Jun 20, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jun 20, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Jun 20, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jun 20, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jun 20, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 20, 2016

Uh oh!

gatorsmile commented Jun 21, 2016

Uh oh!

SparkQA commented Jun 21, 2016

Uh oh!

rxin commented Jun 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gatorsmile commented Jun 19, 2016 •

edited

Loading

gatorsmile commented Jun 19, 2016 •

edited

Loading