[SPARK-16056] [SPARK-16057] [SPARK-16058] [SQL] Fix Multiple Bugs in Column Partitioning in JDBC Source #13773
[SPARK-16056] [SPARK-16057] [SPARK-16058] [SQL] Fix Multiple Bugs in Column Partitioning in JDBC Source #13773gatorsmile wants to merge 4 commits intoapache:masterfrom
Conversation
| val lowerBound = partitioning.lowerBound | ||
| val upperBound = partitioning.upperBound | ||
| if (lowerBound > upperBound) { | ||
| throw new IllegalArgumentException("Operation not allowed: the lower bound of partitioning " + |
There was a problem hiding this comment.
I see. Let me change it.
|
I wouldn't open three JIRAs for what's basically one logical problem. The fact that you have one PR for all of them is indicative. |
|
When opening the JIRAs, I do not know how to write a proper title. An obscure JIRA title is hard for others to track the issue in the future. That is why I opened three. I saw we also did the same in the other JIRAs: https://issues.apache.org/jira/browse/SPARK-16032. |
|
Test build #60814 has finished for PR 13773 at commit
|
|
Test build #60816 has finished for PR 13773 at commit
|
|
@srowen Will submit more PRs about Anyway, will try to create less JIRAs, but, to be honest, in my previous team, the JIRA-like defect tracking system is used to record the defects. We always create multiple defects when they have different external impacts, even if it is caused by the multiple bugs in the same internal function. It is very bad for us to combine multiple issues into the same one. When each fixpack or release is published, our customers, L2 and L3 might use it to know what are included in the specific fixpack. Below is an example: http://www-01.ibm.com/support/docview.wss?uid=swg21633303 There are a long list of defects. In Spark, all the JDBC related JIRAs can be classified into the same group, but we should not combine multiple defects into the same one. In my previous team, we always have to provide very clear titles for each JIRA/defect. Users might not be patient to click the link to read the details. I think the same logics is also applicable to Spark. |
|
@rxin @liancheng @clockfly @yhuai Could you please also review this PR? I found all of you recently reviewed the JDBC-related PRs. Thanks! |
|
Of course, you can always describe several impacts of one logical issue in one JIRA. Nobody disagrees that clear and complete descriptions are good, but that's orthogonal. Surely you provide title and description in release notes? The real test is, would you ever fix one but not the other? apply a patch for one and not another? If so, then yeah they should probably be different issues, and different patches to address them. But here you made one change, so, I can't take the fix for one without the other, which is why I don't quite get the argument. Also, if they're tracked as separate issues, users could miss the connection between them (though you can link JIRAs). We also have a minor problem with some organizations rewarding JIRA and PR count. I don't think that's an issue here, but would like to set a certain standard about this practice. It's up to your judgment here, but keep it in mind. |
|
In the code implementation, these three issues are fixed in different parts although they are in the same function. If I care the PR counts, I can submit and fix them one by one, right? If the JIRA counts are your concern, you do not need to assign all of them to my name and leave two of them blank. I am completely fine with it. Regarding the release notes, in my previous team, I have to write the APAR for each defect fixed by myself. Each APAR should have at least three parts: error description, problem summary, and problem conclusion. The customers will read them. If they think it is related to them, they might directly contact our L2 for understanding more details about internal changes. I do not know who wrote the release notes for Spark, but I think enterprise customers might have similar requirements. When reading the source codes, I just found the bugs. What are the best practices after finding the bugs? Most developers are kind of lazy. : ) For developers, the easiest way is to simply fix them and write down what are bugs. However, this does not benefit the end users. We need to see what are the external impact for each line of changes, write separate test cases and document them at our best. This is how my previous team works. I also think we should follow this in Spark if we really care the conservative enterprise customers. You know, I just joined the community less than one year and switched to open source development from software development for Mainframe. Let me know if my understanding is not right. Thanks! |
|
No, I don't mean you're gaming anything, on the contrary. It's a minor, separate problem with however a related implication for standards. Although the presentation of release notes is probably secondary, I agree there is some non-zero value in three entries instead of one in a huge list of changes. I actually think anyone who cares would search JIRA anyway for details to find all related issues in a release. Please go ahead as you are here, it's not worth splitting hairs about. |
|
Thanks for your suggestions! Like what Kurt Brown said in the summit, it is going to be harder to attract the enterprise customers who do not want to be pioneers. Spark 2.0 is GREAT! So many new features, but, at the same time, we have to speed up to improve the quality and make it ready for enterprise customers. A lot of dirty jobs are in front of us. You might see I could submit some PRs only for test cases. That is for improving test case coverage. As what I said above, personally, I do not care the JIRA counts. When you merge the PRs, you can leave the assignee blank if you have any concern. : ) |
| val upperBound = partitioning.upperBound | ||
| require (lowerBound <= upperBound, | ||
| "Operation not allowed: the lower bound of partitioning column is larger than the upper " + | ||
| s"bound. lowerBound: $lowerBound; higherBound: $upperBound") |
There was a problem hiding this comment.
Don't use "lowerBound" and "higherBound", Just say "Lower bound" and "Upper bound".
| - partitioning.lowerBound / numPartitions) | ||
| val stride: Long = upperBound / numPartitions - lowerBound / numPartitions | ||
| // The automatic adjustment of numPartitions can ensure the following checking condition. | ||
| assert(stride >= 1, "The specified number of partitions should be greater than " + |
There was a problem hiding this comment.
actually i think this assert is pretty dangerous -- doesn't this make it fail if the number of partitions is 10, and the total value difference is less than 10?
I'd argue there is nothing inherently wrong in that case and break old code. I'd consider logging a warning (and include the actual value for stride and lower/upper bound)
There was a problem hiding this comment.
This assert is useless. Just the extra checking. We already guarantee that the value is always larger than 0 by the changes made by this PR: https://github.com/gatorsmile/spark/blob/da3720b7a70949e09c3562e6f3a168a690243b6c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala#L66-L71
If the stride value is zero, it is pretty dangerous. We are generating useless partitions. That is the issue 2 we are trying to resolve:
Partition 0: id < 1 or id is null
Partition 1: id >= 1 AND id < 1
Partition 2: id >= 1 AND id < 1
Partition 3: id >= 1 AND id < 1
Partition 4: id >= 1 AND id < 1
Partition 5: id >= 1 AND id < 1
Partition 6: id >= 1 AND id < 1
Partition 7: id >= 1 AND id < 1
Partition 8: id >= 1 AND id < 1
Partition 9: id >= 1
I think your idea is good. We should log a warning message. Let me do it. Thanks!
There was a problem hiding this comment.
BTW, a few years ago, I wrote codes for multi-column column partitioning. It is pretty complicated, but I am not sure if this is a good fit here. When the num of columns is large, the where clause could be very long. The major issues are the JDBC data sources might not always perform very fastly when the where clause is very complicated.
|
Test build #60870 has finished for PR 13773 at commit
|
|
After the latest changes, we log a warning message when the specified |
|
Test build #60895 has finished for PR 13773 at commit
|
|
Merging in master/2.0. |
…umn Partitioning in JDBC Source #### What changes were proposed in this pull request? This PR is to fix the following bugs: **Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning** ```scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 4, upperBound = 0, numPartitions = 3, connectionProperties = new Properties) ``` **Before code changes:** The returned results are wrong and the generated partitions are wrong: ``` Part 0 id < 3 or id is null Part 1 id >= 3 AND id < 2 Part 2 id >= 2 ``` **After code changes:** Issue an `IllegalArgumentException` exception: ``` Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1 ``` **Issue 2: numPartitions is more than the number of key values between upper and lower bounds** ```scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 1, upperBound = 5, numPartitions = 10, connectionProperties = new Properties) ``` **Before code changes:** Returned correct results but the generated partitions are very inefficient, like: ``` Partition 0: id < 1 or id is null Partition 1: id >= 1 AND id < 1 Partition 2: id >= 1 AND id < 1 Partition 3: id >= 1 AND id < 1 Partition 4: id >= 1 AND id < 1 Partition 5: id >= 1 AND id < 1 Partition 6: id >= 1 AND id < 1 Partition 7: id >= 1 AND id < 1 Partition 8: id >= 1 AND id < 1 Partition 9: id >= 1 ``` **After code changes:** Adjust `numPartitions` and can return the correct answers: ``` Partition 0: id < 2 or id is null Partition 1: id >= 2 AND id < 3 Partition 2: id >= 3 AND id < 4 Partition 3: id >= 4 ``` **Issue 3: java.lang.ArithmeticException when numPartitions is zero** ```Scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 0, upperBound = 4, numPartitions = 0, connectionProperties = new Properties) ``` **Before code changes:** Got the following exception: ``` java.lang.ArithmeticException: / by zero ``` **After code changes:** Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero #### How was this patch tested? Added test cases to verify the results Author: gatorsmile <gatorsmile@gmail.com> Closes #13773 from gatorsmile/jdbcPartitioning. (cherry picked from commit d9a3a2a) Signed-off-by: Reynold Xin <rxin@databricks.com>
…s in Column Partitioning in JDBC Source apache#13773
What changes were proposed in this pull request?
This PR is to fix the following bugs:
Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning
Before code changes:
The returned results are wrong and the generated partitions are wrong:
After code changes:
Issue an
IllegalArgumentExceptionexception:Issue 2: numPartitions is more than the number of key values between upper and lower bounds
Before code changes:
Returned correct results but the generated partitions are very inefficient, like:
After code changes:
Adjust
numPartitionsand can return the correct answers:Issue 3: java.lang.ArithmeticException when numPartitions is zero
Before code changes:
Got the following exception:
After code changes:
Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero
How was this patch tested?
Added test cases to verify the results