[CARBONDATA-3618] Update query should throw exception if key has more than one value #3509

ajantha-bhat · 2019-12-12T12:03:28Z

problem: Update query should throw exception if key has more than one value

cause : Currently update command is adding multiple entries of key instead of throwing exception when update result key has more than one value to replace.

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

Any interfaces changed? NA
Any backward compatibility impacted? NA
Document update required? NA
Testing done
yes. Added UT
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA

CarbonDataQA1 · 2019-12-12T12:27:05Z

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1172/

CarbonDataQA1 · 2019-12-12T13:38:47Z

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1181/

CarbonDataQA1 · 2019-12-12T14:03:52Z

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1191/

CarbonDataQA1 · 2019-12-13T04:59:38Z

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1187/

CarbonDataQA1 · 2019-12-13T05:03:53Z

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1178/

CarbonDataQA1 · 2019-12-13T05:19:22Z

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1197/

zzcclp · 2019-12-15T02:15:26Z

...in/scala/org/apache/spark/sql/execution/command/mutation/CarbonProjectForUpdateCommand.scala

@@ -135,7 +136,22 @@ private[sql] case class CarbonProjectForUpdateCommand(
          else {
            Dataset.ofRows(sparkSession, plan)
          }
-
+          // If more than one value present for the update key, should fail the update
+          val ds = dataSet.select(CarbonCommonConstants.CARBON_IMPLICIT_COLUMN_TUPLEID)


dataSet here includes all data which will be loaded, if the size of data which are to be updated is huge, I think this check will impact update performance.
It better check the count of updated data in 'CarbonIUDAnalysisRule.processUpdateQuery', before parse 'selectStmt' to 'selectPlan', for example:
line 115 in CarbonAnalysisRules.scala, check the count:

if (sparkSession.sql(selectStmt).count() > 1) { throw exception }

@jackylk @ravipesala what do you think about this?

@zzcclp : I am filtering only tupleID column for this reason only, that too I am collecting only one row at driver. (collecting the count of tupleID which has more than one occurrence)

yeah, I know that, but I still worry about the update performance when update huge data, even though just filter only tupleID column, but it still do 'group by' operation.
The way I mentioned above just needs to do a 'count' operation, and it can check this in earlier step.

@zzcclp : I just modified like your solution and checked In "UpdateCarbonTableTestCase" itself 40 testcase failed.

I will explain why it fails now,

Here we don't want to throw exception if the selectStmt has more than one result, we need to throw exception when the it has more than one duplicate result.

Example:
src has id = 1, 2. It need to update all the id, But the subquery results that 1,1,2. Now we don't know the 1 matches with the first or second. so only in this case we need to throw exception.

Tell me if you have any other solution for this problem.

OK, I got it. this way is ok to me, can you test some huge data to show how much this way will impact the update performance.
I will continue to try some way to check duplicated data on 'processUpdateQuery' step.

zzcclp · 2019-12-19T06:57:47Z

...est/src/test/scala/org/apache/carbondata/spark/testsuite/iud/UpdateCarbonTableTestCase.scala

+    sql("insert into t2 select 1, 'Andy'")
+    sql("insert into t2 select 2, 'Andy'")
+    val exception = intercept[RuntimeException] {
+      sql("update t1 set (age) = (select t2.age from t2 where t2.name = 'Andy') where t1.age = 1 ").show(false)


please add more cases, for example:
update t1 set () = (select .. from t2 where t2.id = t1.id) where ...

ok, I will add test later and if required we can add carbon property
a. by default this property is true (this may have performance degrade) and carbon validate each update.
user can set this property to false if he don't want validation and he knows what he is doing.

what do you think ?

added carbon property, updated document and test case

ajantha-bhat · 2019-12-22T04:40:47Z

@jackylk , @zzcclp : PR is ready

CarbonDataQA1 · 2019-12-22T05:03:45Z

Build Failed with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1238/

CarbonDataQA1 · 2019-12-22T05:26:16Z

Build Failed with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1248/

CarbonDataQA1 · 2019-12-22T05:26:37Z

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1257/

CarbonDataQA1 · 2019-12-23T01:53:04Z

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1261/

CarbonDataQA1 · 2019-12-23T03:41:05Z

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1251/

CarbonDataQA1 · 2019-12-23T03:46:43Z

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1272/

jackylk · 2019-12-25T06:21:16Z

docs/configuration-parameters.md

@@ -144,6 +144,7 @@ This section provides the details of all the configurations required for the Car
 | carbon.heap.memory.pooling.threshold.bytes | 1048576 | CarbonData supports unsafe operations of Java to avoid GC overhead for certain operations. Using unsafe, memory can be allocated on Java Heap or off heap. This configuration controls the allocation mechanism on Java HEAP. If the heap memory allocations of the given size is greater or equal than this value,it should go through the pooling mechanism. But if set this size to -1, it should not go through the pooling mechanism. Default value is 1048576(1MB, the same as Spark). Value to be specified in bytes. |
 | carbon.push.rowfilters.for.vector | false | When enabled complete row filters will be handled by carbon in case of vector. If it is disabled then only page level pruning will be done by carbon and row level filtering will be done by spark for vector. And also there are scan optimizations in carbon to avoid multiple data copies when this parameter is set to false. There is no change in flow for non-vector based queries. |
 | carbon.query.prefetch.enable | true | By default this property is true, so prefetch is used in query to read next blocklet asynchronously in other thread while processing current blocklet in main thread. This can help to reduce CPU idle time. Setting this property false will disable this prefetch feature in query. |
+| carbon.update.validate.key.to.value.mapping | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. |


This should be moved to mutation related configuration

CarbonDataQA1 · 2019-12-25T07:00:11Z

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1275/

CarbonDataQA1 · 2019-12-25T07:58:31Z

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1285/

CarbonDataQA1 · 2019-12-25T08:07:15Z

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1296/

jackylk · 2019-12-28T02:24:23Z

...in/scala/org/apache/spark/sql/execution/command/mutation/CarbonProjectForUpdateCommand.scala

@@ -135,7 +136,24 @@ private[sql] case class CarbonProjectForUpdateCommand(
          else {
            Dataset.ofRows(sparkSession, plan)
          }
-
+          if (CarbonProperties.getValidateUpdateKeyValueMapping) {


Suggested change

if (CarbonProperties.getValidateUpdateKeyValueMapping) {

if (CarbonProperties.isUniqueValueCheckEnabled) {

jackylk · 2019-12-28T02:25:07Z

docs/configuration-parameters.md

@@ -152,6 +152,7 @@ This section provides the details of all the configurations required for the Car
 | carbon.insert.storage.level | MEMORY_AND_DISK | Storage level to persist dataset of a RDD/dataframe. Applicable when ***carbon.insert.persist.enable*** is **true**, if user's executor has less memory, set this parameter to 'MEMORY_AND_DISK_SER' or other storage level to correspond to different environment. [See detail](http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence). |
 | carbon.update.persist.enable | true | Configuration to enable the dataset of RDD/dataframe to persist data. Enabling this will reduce the execution time of UPDATE operation. |
 | carbon.update.storage.level | MEMORY_AND_DISK | Storage level to persist dataset of a RDD/dataframe. Applicable when ***carbon.update.persist.enable*** is **true**, if user's executor has less memory, set this parameter to 'MEMORY_AND_DISK_SER' or other storage level to correspond to different environment. [See detail](http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence). |
+| carbon.update.validate.key.to.value.mapping | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. |


Suggested change

| carbon.update.validate.key.to.value.mapping | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. |

| carbon.update.check.unique.value | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. |

jackylk · 2019-12-28T02:28:08Z

...in/scala/org/apache/spark/sql/execution/command/mutation/CarbonProjectForUpdateCommand.scala

+              .select("count")
+              .filter(col("count") > lit(1))
+              .limit(1)
+              .collect()


@zzcclp @ajantha-bhat One suggestion is to try to move this check to the join subquery in the update operation itself, then we do not need to do this check separately.

@jackylk : yes, it is now in the same place as you are suggesting.
line 135, we are making Dataset.ofRows(sparkSession, plan), on top of this only I am collecting tupleId.

CarbonDataQA1 · 2019-12-28T08:16:15Z

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1312/

CarbonDataQA1 · 2019-12-28T09:23:52Z

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1335/

CarbonDataQA1 · 2019-12-28T09:35:38Z

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1322/

… than one value problem: Update query should throw exception if key has more than one value cause : Currently update command is adding multiple entries of key instead of throwing exception when update result key has more than one value to replace. This closes apache#3509

ajantha-bhat force-pushed the issue_fix branch from 8244ada to c354187 Compare December 13, 2019 03:27

ajantha-bhat changed the title ~~[WIP] Update query should throw exception if key has more than one value~~ [CARBONDATA-3618] Update query should throw exception if key has more than one value Dec 13, 2019

zzcclp reviewed Dec 15, 2019

View reviewed changes

zzcclp reviewed Dec 19, 2019

View reviewed changes

ajantha-bhat force-pushed the issue_fix branch 2 times, most recently from 0db0f2a to a6868f2 Compare December 22, 2019 04:39

ajantha-bhat force-pushed the issue_fix branch from a6868f2 to c362252 Compare December 23, 2019 00:22

jackylk reviewed Dec 25, 2019

View reviewed changes

ajantha-bhat force-pushed the issue_fix branch from c362252 to ce70cec Compare December 25, 2019 06:40

jackylk reviewed Dec 28, 2019

View reviewed changes

Update query should throw exception if key has more than one value

2b44a83

ajantha-bhat force-pushed the issue_fix branch from ce70cec to 2b44a83 Compare December 28, 2019 07:56

asfgit closed this in 01d0614 Dec 29, 2019

	if (CarbonProperties.getValidateUpdateKeyValueMapping) {
	if (CarbonProperties.isUniqueValueCheckEnabled) {

	\| carbon.update.validate.key.to.value.mapping \| true \| By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. \|
	\| carbon.update.check.unique.value \| true \| By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. \|

[CARBONDATA-3618] Update query should throw exception if key has more than one value #3509

[CARBONDATA-3618] Update query should throw exception if key has more than one value #3509

Conversation

ajantha-bhat commented Dec 12, 2019 • edited

CarbonDataQA1 commented Dec 12, 2019

CarbonDataQA1 commented Dec 12, 2019

CarbonDataQA1 commented Dec 12, 2019

CarbonDataQA1 commented Dec 13, 2019

CarbonDataQA1 commented Dec 13, 2019

CarbonDataQA1 commented Dec 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajantha-bhat commented Dec 22, 2019

CarbonDataQA1 commented Dec 22, 2019

CarbonDataQA1 commented Dec 22, 2019

CarbonDataQA1 commented Dec 22, 2019

CarbonDataQA1 commented Dec 23, 2019

CarbonDataQA1 commented Dec 23, 2019

CarbonDataQA1 commented Dec 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA1 commented Dec 25, 2019

CarbonDataQA1 commented Dec 25, 2019

CarbonDataQA1 commented Dec 25, 2019

jackylk Dec 28, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackylk Dec 28, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA1 commented Dec 28, 2019

CarbonDataQA1 commented Dec 28, 2019

CarbonDataQA1 commented Dec 28, 2019

ajantha-bhat commented Dec 12, 2019 •

edited

jackylk Dec 28, 2019 •

edited

jackylk Dec 28, 2019 •

edited