Skip to content

[SPARK-16056] [SPARK-16057] [SPARK-16058] [SQL] Fix Multiple Bugs in Column Partitioning in JDBC Source #13773

Closed
gatorsmile wants to merge 4 commits intoapache:masterfrom
gatorsmile:jdbcPartitioning
Closed

[SPARK-16056] [SPARK-16057] [SPARK-16058] [SQL] Fix Multiple Bugs in Column Partitioning in JDBC Source #13773
gatorsmile wants to merge 4 commits intoapache:masterfrom
gatorsmile:jdbcPartitioning

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Jun 19, 2016

What changes were proposed in this pull request?

This PR is to fix the following bugs:

Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning

spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 4,
  upperBound = 0,
  numPartitions = 3,
  connectionProperties = new Properties)

Before code changes:
The returned results are wrong and the generated partitions are wrong:

  Part 0 id < 3 or id is null
  Part 1 id >= 3 AND id < 2
  Part 2 id >= 2

After code changes:
Issue an IllegalArgumentException exception:

Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1

Issue 2: numPartitions is more than the number of key values between upper and lower bounds

spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 1,
  upperBound = 5,
  numPartitions = 10,
  connectionProperties = new Properties)

Before code changes:
Returned correct results but the generated partitions are very inefficient, like:

Partition 0: id < 1 or id is null
Partition 1: id >= 1 AND id < 1
Partition 2: id >= 1 AND id < 1
Partition 3: id >= 1 AND id < 1
Partition 4: id >= 1 AND id < 1
Partition 5: id >= 1 AND id < 1
Partition 6: id >= 1 AND id < 1
Partition 7: id >= 1 AND id < 1
Partition 8: id >= 1 AND id < 1
Partition 9: id >= 1

After code changes:
Adjust numPartitions and can return the correct answers:

Partition 0: id < 2 or id is null
Partition 1: id >= 2 AND id < 3
Partition 2: id >= 3 AND id < 4
Partition 3: id >= 4

Issue 3: java.lang.ArithmeticException when numPartitions is zero

spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 0,
  upperBound = 4,
  numPartitions = 0,
  connectionProperties = new Properties)

Before code changes:
Got the following exception:

  java.lang.ArithmeticException: / by zero

After code changes:
Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero

How was this patch tested?

Added test cases to verify the results

val lowerBound = partitioning.lowerBound
val upperBound = partitioning.upperBound
if (lowerBound > upperBound) {
throw new IllegalArgumentException("Operation not allowed: the lower bound of partitioning " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use require

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Let me change it.

@srowen
Copy link
Member

srowen commented Jun 19, 2016

I wouldn't open three JIRAs for what's basically one logical problem. The fact that you have one PR for all of them is indicative.

@gatorsmile
Copy link
Member Author

When opening the JIRAs, I do not know how to write a proper title. An obscure JIRA title is hard for others to track the issue in the future. That is why I opened three. I saw we also did the same in the other JIRAs: https://issues.apache.org/jira/browse/SPARK-16032.

@SparkQA
Copy link

SparkQA commented Jun 19, 2016

Test build #60814 has finished for PR 13773 at commit 2198f59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 19, 2016

Test build #60816 has finished for PR 13773 at commit 94bc09c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

gatorsmile commented Jun 19, 2016

@srowen Will submit more PRs about JDBC. It sounds like the interface of DataFrameReader and DataFrameWriter are not designed for JDBC data sources. For Spark SQL beginners, they might hit various strange errors.

Anyway, will try to create less JIRAs, but, to be honest, in my previous team, the JIRA-like defect tracking system is used to record the defects. We always create multiple defects when they have different external impacts, even if it is caused by the multiple bugs in the same internal function. It is very bad for us to combine multiple issues into the same one. When each fixpack or release is published, our customers, L2 and L3 might use it to know what are included in the specific fixpack. Below is an example: http://www-01.ibm.com/support/docview.wss?uid=swg21633303 There are a long list of defects. In Spark, all the JDBC related JIRAs can be classified into the same group, but we should not combine multiple defects into the same one. In my previous team, we always have to provide very clear titles for each JIRA/defect. Users might not be patient to click the link to read the details. I think the same logics is also applicable to Spark.

@gatorsmile
Copy link
Member Author

@rxin @liancheng @clockfly @yhuai Could you please also review this PR? I found all of you recently reviewed the JDBC-related PRs. Thanks!

@srowen
Copy link
Member

srowen commented Jun 20, 2016

Of course, you can always describe several impacts of one logical issue in one JIRA. Nobody disagrees that clear and complete descriptions are good, but that's orthogonal. Surely you provide title and description in release notes?

The real test is, would you ever fix one but not the other? apply a patch for one and not another? If so, then yeah they should probably be different issues, and different patches to address them. But here you made one change, so, I can't take the fix for one without the other, which is why I don't quite get the argument. Also, if they're tracked as separate issues, users could miss the connection between them (though you can link JIRAs).

We also have a minor problem with some organizations rewarding JIRA and PR count. I don't think that's an issue here, but would like to set a certain standard about this practice.

It's up to your judgment here, but keep it in mind.

@gatorsmile
Copy link
Member Author

In the code implementation, these three issues are fixed in different parts although they are in the same function. If I care the PR counts, I can submit and fix them one by one, right? If the JIRA counts are your concern, you do not need to assign all of them to my name and leave two of them blank. I am completely fine with it.

Regarding the release notes, in my previous team, I have to write the APAR for each defect fixed by myself. Each APAR should have at least three parts: error description, problem summary, and problem conclusion. The customers will read them. If they think it is related to them, they might directly contact our L2 for understanding more details about internal changes. I do not know who wrote the release notes for Spark, but I think enterprise customers might have similar requirements.

When reading the source codes, I just found the bugs. What are the best practices after finding the bugs? Most developers are kind of lazy. : ) For developers, the easiest way is to simply fix them and write down what are bugs. However, this does not benefit the end users. We need to see what are the external impact for each line of changes, write separate test cases and document them at our best. This is how my previous team works. I also think we should follow this in Spark if we really care the conservative enterprise customers.

You know, I just joined the community less than one year and switched to open source development from software development for Mainframe. Let me know if my understanding is not right. Thanks!

@srowen
Copy link
Member

srowen commented Jun 20, 2016

No, I don't mean you're gaming anything, on the contrary. It's a minor, separate problem with however a related implication for standards. Although the presentation of release notes is probably secondary, I agree there is some non-zero value in three entries instead of one in a huge list of changes. I actually think anyone who cares would search JIRA anyway for details to find all related issues in a release. Please go ahead as you are here, it's not worth splitting hairs about.

@gatorsmile
Copy link
Member Author

Thanks for your suggestions!

Like what Kurt Brown said in the summit, it is going to be harder to attract the enterprise customers who do not want to be pioneers. Spark 2.0 is GREAT! So many new features, but, at the same time, we have to speed up to improve the quality and make it ready for enterprise customers. A lot of dirty jobs are in front of us. You might see I could submit some PRs only for test cases. That is for improving test case coverage. As what I said above, personally, I do not care the JIRA counts. When you merge the PRs, you can leave the assignee blank if you have any concern. : )

val upperBound = partitioning.upperBound
require (lowerBound <= upperBound,
"Operation not allowed: the lower bound of partitioning column is larger than the upper " +
s"bound. lowerBound: $lowerBound; higherBound: $upperBound")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use "lowerBound" and "higherBound", Just say "Lower bound" and "Upper bound".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do it.

- partitioning.lowerBound / numPartitions)
val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
// The automatic adjustment of numPartitions can ensure the following checking condition.
assert(stride >= 1, "The specified number of partitions should be greater than " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i think this assert is pretty dangerous -- doesn't this make it fail if the number of partitions is 10, and the total value difference is less than 10?

I'd argue there is nothing inherently wrong in that case and break old code. I'd consider logging a warning (and include the actual value for stride and lower/upper bound)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert is useless. Just the extra checking. We already guarantee that the value is always larger than 0 by the changes made by this PR: https://github.com/gatorsmile/spark/blob/da3720b7a70949e09c3562e6f3a168a690243b6c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala#L66-L71

If the stride value is zero, it is pretty dangerous. We are generating useless partitions. That is the issue 2 we are trying to resolve:

Partition 0: id < 1 or id is null
Partition 1: id >= 1 AND id < 1
Partition 2: id >= 1 AND id < 1
Partition 3: id >= 1 AND id < 1
Partition 4: id >= 1 AND id < 1
Partition 5: id >= 1 AND id < 1
Partition 6: id >= 1 AND id < 1
Partition 7: id >= 1 AND id < 1
Partition 8: id >= 1 AND id < 1
Partition 9: id >= 1

I think your idea is good. We should log a warning message. Let me do it. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, a few years ago, I wrote codes for multi-column column partitioning. It is pretty complicated, but I am not sure if this is a good fit here. When the num of columns is large, the where clause could be very long. The major issues are the JDBC data sources might not always perform very fastly when the where clause is very complicated.

@SparkQA
Copy link

SparkQA commented Jun 20, 2016

Test build #60870 has finished for PR 13773 at commit da3720b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

After the latest changes, we log a warning message when the specified numPartitions is more than the number of key values between upper and lower bounds:

18:37:08.848 WARN org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation: The number of partitions is reduced because the specified number of partitions is less than the difference between upper bound and lower bound. Updated number of partitions: 4; Input number of partitions: 10; Lower bound: 1; Upper bound: 5.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60895 has finished for PR 13773 at commit 7d2ae9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jun 21, 2016

Merging in master/2.0.

@asfgit asfgit closed this in d9a3a2a Jun 21, 2016
asfgit pushed a commit that referenced this pull request Jun 21, 2016
…umn Partitioning in JDBC Source

#### What changes were proposed in this pull request?
This PR is to fix the following bugs:

**Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning**
```scala
spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 4,
  upperBound = 0,
  numPartitions = 3,
  connectionProperties = new Properties)
```
**Before code changes:**
The returned results are wrong and the generated partitions are wrong:
```
  Part 0 id < 3 or id is null
  Part 1 id >= 3 AND id < 2
  Part 2 id >= 2
```
**After code changes:**
Issue an `IllegalArgumentException` exception:
```
Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1
```
**Issue 2: numPartitions is more than the number of key values between upper and lower bounds**
```scala
spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 1,
  upperBound = 5,
  numPartitions = 10,
  connectionProperties = new Properties)
```
**Before code changes:**
Returned correct results but the generated partitions are very inefficient, like:
```
Partition 0: id < 1 or id is null
Partition 1: id >= 1 AND id < 1
Partition 2: id >= 1 AND id < 1
Partition 3: id >= 1 AND id < 1
Partition 4: id >= 1 AND id < 1
Partition 5: id >= 1 AND id < 1
Partition 6: id >= 1 AND id < 1
Partition 7: id >= 1 AND id < 1
Partition 8: id >= 1 AND id < 1
Partition 9: id >= 1
```
**After code changes:**
Adjust `numPartitions` and can return the correct answers:
```
Partition 0: id < 2 or id is null
Partition 1: id >= 2 AND id < 3
Partition 2: id >= 3 AND id < 4
Partition 3: id >= 4
```
**Issue 3: java.lang.ArithmeticException when numPartitions is zero**
```Scala
spark.read.jdbc(
  url = urlWithUserAndPass,
  table = "TEST.seq",
  columnName = "id",
  lowerBound = 0,
  upperBound = 4,
  numPartitions = 0,
  connectionProperties = new Properties)
```
**Before code changes:**
Got the following exception:
```
  java.lang.ArithmeticException: / by zero
```
**After code changes:**
Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero

#### How was this patch tested?
Added test cases to verify the results

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13773 from gatorsmile/jdbcPartitioning.

(cherry picked from commit d9a3a2a)
Signed-off-by: Reynold Xin <rxin@databricks.com>
zzcclp added a commit to zzcclp/spark that referenced this pull request Aug 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants