Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-3618] Update query should throw exception if key has more than one value #3509

Closed
wants to merge 1 commit into from

Conversation

ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Dec 12, 2019

problem: Update query should throw exception if key has more than one value

cause : Currently update command is adding multiple entries of key instead of throwing exception when update result key has more than one value to replace.

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

  • Any interfaces changed? NA

  • Any backward compatibility impacted? NA

  • Document update required? NA

  • Testing done
    yes. Added UT

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA

@CarbonDataQA1
Copy link

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1172/

@CarbonDataQA1
Copy link

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1181/

@CarbonDataQA1
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1191/

@ajantha-bhat ajantha-bhat changed the title [WIP] Update query should throw exception if key has more than one value [CARBONDATA-3618] Update query should throw exception if key has more than one value Dec 13, 2019
@CarbonDataQA1
Copy link

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1187/

@CarbonDataQA1
Copy link

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1178/

@CarbonDataQA1
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1197/

@@ -135,7 +136,22 @@ private[sql] case class CarbonProjectForUpdateCommand(
else {
Dataset.ofRows(sparkSession, plan)
}

// If more than one value present for the update key, should fail the update
val ds = dataSet.select(CarbonCommonConstants.CARBON_IMPLICIT_COLUMN_TUPLEID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataSet here includes all data which will be loaded, if the size of data which are to be updated is huge, I think this check will impact update performance.
It better check the count of updated data in 'CarbonIUDAnalysisRule.processUpdateQuery', before parse 'selectStmt' to 'selectPlan', for example:
line 115 in CarbonAnalysisRules.scala, check the count:

if (sparkSession.sql(selectStmt).count() > 1) {
  throw exception
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackylk @ravipesala what do you think about this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zzcclp : I am filtering only tupleID column for this reason only, that too I am collecting only one row at driver. (collecting the count of tupleID which has more than one occurrence)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I know that, but I still worry about the update performance when update huge data, even though just filter only tupleID column, but it still do 'group by' operation.
The way I mentioned above just needs to do a 'count' operation, and it can check this in earlier step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zzcclp : I just modified like your solution and checked In "UpdateCarbonTableTestCase" itself 40 testcase failed.

I will explain why it fails now,

Here we don't want to throw exception if the selectStmt has more than one result, we need to throw exception when the it has more than one duplicate result.

Example:
src has id = 1, 2. It need to update all the id, But the subquery results that 1,1,2. Now we don't know the 1 matches with the first or second. so only in this case we need to throw exception.

Tell me if you have any other solution for this problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I got it. this way is ok to me, can you test some huge data to show how much this way will impact the update performance.
I will continue to try some way to check duplicated data on 'processUpdateQuery' step.

sql("insert into t2 select 1, 'Andy'")
sql("insert into t2 select 2, 'Andy'")
val exception = intercept[RuntimeException] {
sql("update t1 set (age) = (select t2.age from t2 where t2.name = 'Andy') where t1.age = 1 ").show(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add more cases, for example:
update t1 set () = (select .. from t2 where t2.id = t1.id) where ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will add test later and if required we can add carbon property
a. by default this property is true (this may have performance degrade) and carbon validate each update.
user can set this property to false if he don't want validation and he knows what he is doing.

what do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added carbon property, updated document and test case

@ajantha-bhat ajantha-bhat force-pushed the issue_fix branch 2 times, most recently from 0db0f2a to a6868f2 Compare December 22, 2019 04:39
@ajantha-bhat
Copy link
Member Author

@jackylk , @zzcclp : PR is ready

@CarbonDataQA1
Copy link

Build Failed with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1238/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1248/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1257/

@CarbonDataQA1
Copy link

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1261/

@CarbonDataQA1
Copy link

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1251/

@CarbonDataQA1
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1272/

@@ -144,6 +144,7 @@ This section provides the details of all the configurations required for the Car
| carbon.heap.memory.pooling.threshold.bytes | 1048576 | CarbonData supports unsafe operations of Java to avoid GC overhead for certain operations. Using unsafe, memory can be allocated on Java Heap or off heap. This configuration controls the allocation mechanism on Java HEAP. If the heap memory allocations of the given size is greater or equal than this value,it should go through the pooling mechanism. But if set this size to -1, it should not go through the pooling mechanism. Default value is 1048576(1MB, the same as Spark). Value to be specified in bytes. |
| carbon.push.rowfilters.for.vector | false | When enabled complete row filters will be handled by carbon in case of vector. If it is disabled then only page level pruning will be done by carbon and row level filtering will be done by spark for vector. And also there are scan optimizations in carbon to avoid multiple data copies when this parameter is set to false. There is no change in flow for non-vector based queries. |
| carbon.query.prefetch.enable | true | By default this property is true, so prefetch is used in query to read next blocklet asynchronously in other thread while processing current blocklet in main thread. This can help to reduce CPU idle time. Setting this property false will disable this prefetch feature in query. |
| carbon.update.validate.key.to.value.mapping | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved to mutation related configuration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. moved

@CarbonDataQA1
Copy link

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1275/

@CarbonDataQA1
Copy link

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1285/

@CarbonDataQA1
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1296/

@@ -135,7 +136,24 @@ private[sql] case class CarbonProjectForUpdateCommand(
else {
Dataset.ofRows(sparkSession, plan)
}

if (CarbonProperties.getValidateUpdateKeyValueMapping) {
Copy link
Contributor

@jackylk jackylk Dec 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (CarbonProperties.getValidateUpdateKeyValueMapping) {
if (CarbonProperties.isUniqueValueCheckEnabled) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -152,6 +152,7 @@ This section provides the details of all the configurations required for the Car
| carbon.insert.storage.level | MEMORY_AND_DISK | Storage level to persist dataset of a RDD/dataframe. Applicable when ***carbon.insert.persist.enable*** is **true**, if user's executor has less memory, set this parameter to 'MEMORY_AND_DISK_SER' or other storage level to correspond to different environment. [See detail](http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence). |
| carbon.update.persist.enable | true | Configuration to enable the dataset of RDD/dataframe to persist data. Enabling this will reduce the execution time of UPDATE operation. |
| carbon.update.storage.level | MEMORY_AND_DISK | Storage level to persist dataset of a RDD/dataframe. Applicable when ***carbon.update.persist.enable*** is **true**, if user's executor has less memory, set this parameter to 'MEMORY_AND_DISK_SER' or other storage level to correspond to different environment. [See detail](http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence). |
| carbon.update.validate.key.to.value.mapping | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. |
Copy link
Contributor

@jackylk jackylk Dec 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| carbon.update.validate.key.to.value.mapping | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. |
| carbon.update.check.unique.value | true | By default this property is true, so update will validate key value mapping. This validation might have slight degrade in performance of update query. If user knows that key value mapping is correct, can disable this validation for better update performance by setting this property to false. |

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.select("count")
.filter(col("count") > lit(1))
.limit(1)
.collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zzcclp @ajantha-bhat One suggestion is to try to move this check to the join subquery in the update operation itself, then we do not need to do this check separately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackylk : yes, it is now in the same place as you are suggesting.
line 135, we are making Dataset.ofRows(sparkSession, plan), on top of this only I am collecting tupleId.

@CarbonDataQA1
Copy link

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1312/

@CarbonDataQA1
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1335/

@CarbonDataQA1
Copy link

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1322/

@asfgit asfgit closed this in 01d0614 Dec 29, 2019
marchpure pushed a commit to marchpure/carbondata that referenced this pull request Dec 30, 2019
… than one value

problem: Update query should throw exception if key has more than one value

cause : Currently update command is adding multiple entries of key instead of throwing exception when update result key has more than one value to replace.

This closes apache#3509
niuge01 pushed a commit to niuge01/carbondata that referenced this pull request Dec 30, 2019
… than one value

problem: Update query should throw exception if key has more than one value

cause : Currently update command is adding multiple entries of key instead of throwing exception when update result key has more than one value to replace.

This closes apache#3509
MarvinLitt pushed a commit to MarvinLitt/carbondata that referenced this pull request Jan 3, 2020
… than one value

problem: Update query should throw exception if key has more than one value

cause : Currently update command is adding multiple entries of key instead of throwing exception when update result key has more than one value to replace.

This closes apache#3509
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants