New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-3618] Update query should throw exception if key has more than one value #3509
Conversation
Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1172/ |
Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1181/ |
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1191/ |
8244ada
to
c354187
Compare
Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1187/ |
Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1178/ |
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1197/ |
@@ -135,7 +136,22 @@ private[sql] case class CarbonProjectForUpdateCommand( | |||
else { | |||
Dataset.ofRows(sparkSession, plan) | |||
} | |||
|
|||
// If more than one value present for the update key, should fail the update | |||
val ds = dataSet.select(CarbonCommonConstants.CARBON_IMPLICIT_COLUMN_TUPLEID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataSet here includes all data which will be loaded, if the size of data which are to be updated is huge, I think this check will impact update performance.
It better check the count of updated data in 'CarbonIUDAnalysisRule.processUpdateQuery', before parse 'selectStmt' to 'selectPlan', for example:
line 115 in CarbonAnalysisRules.scala, check the count:
if (sparkSession.sql(selectStmt).count() > 1) {
throw exception
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackylk @ravipesala what do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zzcclp : I am filtering only tupleID column for this reason only, that too I am collecting only one row at driver. (collecting the count of tupleID which has more than one occurrence)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I know that, but I still worry about the update performance when update huge data, even though just filter only tupleID column, but it still do 'group by' operation.
The way I mentioned above just needs to do a 'count' operation, and it can check this in earlier step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zzcclp : I just modified like your solution and checked In "UpdateCarbonTableTestCase" itself 40 testcase failed.
I will explain why it fails now,
Here we don't want to throw exception if the selectStmt
has more than one result, we need to throw exception when the it has more than one duplicate result.
Example:
src has id = 1, 2. It need to update all the id, But the subquery results that 1,1,2. Now we don't know the 1 matches with the first or second. so only in this case we need to throw exception.
Tell me if you have any other solution for this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I got it. this way is ok to me, can you test some huge data to show how much this way will impact the update performance.
I will continue to try some way to check duplicated data on 'processUpdateQuery' step.
sql("insert into t2 select 1, 'Andy'") | ||
sql("insert into t2 select 2, 'Andy'") | ||
val exception = intercept[RuntimeException] { | ||
sql("update t1 set (age) = (select t2.age from t2 where t2.name = 'Andy') where t1.age = 1 ").show(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add more cases, for example:
update t1 set () = (select .. from t2 where t2.id = t1.id) where ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will add test later and if required we can add carbon property
a. by default this property is true (this may have performance degrade) and carbon validate each update.
user can set this property to false if he don't want validation and he knows what he is doing.
what do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added carbon property, updated document and test case
0db0f2a
to
a6868f2
Compare
Build Failed with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1238/ |
Build Failed with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1248/ |
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1257/ |
a6868f2
to
c362252
Compare
Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1261/ |
Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1251/ |
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1272/ |
docs/configuration-parameters.md
Outdated
@@ -144,6 +144,7 @@ This section provides the details of all the configurations required for the Car | |||
| carbon.heap.memory.pooling.threshold.bytes | 1048576 | CarbonData supports unsafe operations of Java to avoid GC overhead for certain operations. Using unsafe, memory can be allocated on Java Heap or off heap. This configuration controls the allocation mechanism on Java HEAP. If the heap memory allocations of the given size is greater or equal than this value,it should go through the pooling mechanism. But if set this size to -1, it should not go through the pooling mechanism. Default value is 1048576(1MB, the same as Spark). Value to be specified in bytes. | | |||
| carbon.push.rowfilters.for.vector | false | When enabled complete row filters will be handled by carbon in case of vector. If it is disabled then only page level pruning will be done by carbon and row level filtering will be done by spark for vector. And also there are scan optimizations in carbon to avoid multiple data copies when this parameter is set to false. There is no change in flow for non-vector based queries. | | |||
| carbon.query.prefetch.enable | true | By default this property is true, so prefetch is used in query to read next blocklet asynchronously in other thread while processing current blocklet in main thread. This can help to reduce CPU idle time. Setting this property false will disable this prefetch feature in query. | | |||
| carbon.update.validate.key.to.value.mapping | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved to mutation related configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. moved
c362252
to
ce70cec
Compare
Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1275/ |
Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1285/ |
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1296/ |
@@ -135,7 +136,24 @@ private[sql] case class CarbonProjectForUpdateCommand( | |||
else { | |||
Dataset.ofRows(sparkSession, plan) | |||
} | |||
|
|||
if (CarbonProperties.getValidateUpdateKeyValueMapping) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (CarbonProperties.getValidateUpdateKeyValueMapping) { | |
if (CarbonProperties.isUniqueValueCheckEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/configuration-parameters.md
Outdated
@@ -152,6 +152,7 @@ This section provides the details of all the configurations required for the Car | |||
| carbon.insert.storage.level | MEMORY_AND_DISK | Storage level to persist dataset of a RDD/dataframe. Applicable when ***carbon.insert.persist.enable*** is **true**, if user's executor has less memory, set this parameter to 'MEMORY_AND_DISK_SER' or other storage level to correspond to different environment. [See detail](http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence). | | |||
| carbon.update.persist.enable | true | Configuration to enable the dataset of RDD/dataframe to persist data. Enabling this will reduce the execution time of UPDATE operation. | | |||
| carbon.update.storage.level | MEMORY_AND_DISK | Storage level to persist dataset of a RDD/dataframe. Applicable when ***carbon.update.persist.enable*** is **true**, if user's executor has less memory, set this parameter to 'MEMORY_AND_DISK_SER' or other storage level to correspond to different environment. [See detail](http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence). | | |||
| carbon.update.validate.key.to.value.mapping | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| carbon.update.validate.key.to.value.mapping | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. | | |
| carbon.update.check.unique.value | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
.select("count") | ||
.filter(col("count") > lit(1)) | ||
.limit(1) | ||
.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zzcclp @ajantha-bhat One suggestion is to try to move this check to the join subquery in the update operation itself, then we do not need to do this check separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackylk : yes, it is now in the same place as you are suggesting.
line 135, we are making Dataset.ofRows(sparkSession, plan)
, on top of this only I am collecting tupleId.
ce70cec
to
2b44a83
Compare
Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1312/ |
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1335/ |
Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1322/ |
… than one value problem: Update query should throw exception if key has more than one value cause : Currently update command is adding multiple entries of key instead of throwing exception when update result key has more than one value to replace. This closes apache#3509
… than one value problem: Update query should throw exception if key has more than one value cause : Currently update command is adding multiple entries of key instead of throwing exception when update result key has more than one value to replace. This closes apache#3509
… than one value problem: Update query should throw exception if key has more than one value cause : Currently update command is adding multiple entries of key instead of throwing exception when update result key has more than one value to replace. This closes apache#3509
problem: Update query should throw exception if key has more than one value
cause : Currently update command is adding multiple entries of key instead of throwing exception when update result key has more than one value to replace.
Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:
Any interfaces changed? NA
Any backward compatibility impacted? NA
Document update required? NA
Testing done
yes. Added UT
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA