Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-1305] Add limit for external dictionary values #1166

Closed
wants to merge 1 commit into from

Conversation

kunal642
Copy link
Contributor

Analysis: During dictionary creation the dictionary values are kept in a HashSet. When the size of hashset reaches more than 500000000 this exception is thrown.

Solution: Limit the dictionary values to 10000000

@asfgit
Copy link

asfgit commented Jul 12, 2017

Can one of the admins verify this patch?

1 similar comment
@asfgit
Copy link

asfgit commented Jul 12, 2017

Can one of the admins verify this patch?

@CarbonDataQA
Copy link

Build Failed with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/462/

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3050/

@kunal642 kunal642 changed the title [CARBONDATA-1305] Added limit for external dictionary values [CARBONDATA-1305] Add limit for external dictionary values Jul 12, 2017
@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3051/

@CarbonDataQA
Copy link

Build Success with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/463/

/**
* Max number of dictionary values that can be given with external dictionary
*/
public static final int MAX_EXTERNAL_DICTIONARY_SIZE = 10000000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this constant to specific file usage as this is not configurable

@@ -83,7 +84,12 @@ case class PrimitiveParser(dimension: CarbonDimension,

def parseString(input: String): Unit = {
if (hasDictEncoding && input != null) {
set.add(input)
if (set.size < CarbonCommonConstants.MAX_EXTERNAL_DICTIONARY_SIZE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check will be done for each value for each column...please ensure there is no performance degradation in dictionary generation due to this check...try with some 100 columns with and without this change

@CarbonDataQA
Copy link

Build Success with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/471/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3060/


package org.apache.carbondata.processing.newflow.exception;

public class DictionaryLimitExceededException extends CarbonDataLoadingException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use no retry exception instead of specific exception.

@@ -460,6 +467,10 @@ class CarbonGlobalDictionaryGenerateRDD(
s"\n write sort info: $sortIndexWriteTime")
}
} catch {
case dictionaryException: DictionaryLimitExceededException =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a generic NoRetryExceptionClass instead of Specific exception class.

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3078/

@CarbonDataQA
Copy link

Build Failed with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/489/

@CarbonDataQA
Copy link

Build Failed with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/490/

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3079/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3081/

@CarbonDataQA
Copy link

Build Success with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/492/

@kunal642 kunal642 force-pushed the dictionary_limit branch 2 times, most recently from 15d32db to 1dab411 Compare July 24, 2017 14:19
@CarbonDataQA
Copy link

Build Success with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/590/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3185/

@CarbonDataQA
Copy link

Build Success with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/591/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3186/

@CarbonDataQA
Copy link

Can one of the admins verify this patch?

@CarbonDataQA
Copy link

SDV Build Success with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/327/

@CarbonDataQA
Copy link

SDV Build Success with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/336/

@gvramana
Copy link
Contributor

retest this please

1 similar comment
@kunal642
Copy link
Contributor Author

kunal642 commented Sep 3, 2017

retest this please

@ravipesala
Copy link
Contributor

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/489/

if (set.size < CarbonLoadOptionConstants.MAX_EXTERNAL_DICTIONARY_SIZE) {
set.add(input)
} else {
throw new NoRetryException(s"Cannot provide more than ${
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also handle NoRetryException for update also inset in case of badrecords

@gvramana
Copy link
Contributor

gvramana commented Sep 4, 2017

retest this please

@ravipesala
Copy link
Contributor

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/501/

@kunal642
Copy link
Contributor Author

kunal642 commented Sep 4, 2017

retest this please

@ravipesala
Copy link
Contributor

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/508/

@gvramana
Copy link
Contributor

gvramana commented Sep 4, 2017

SDV one failure is common across other PRs

@gvramana
Copy link
Contributor

gvramana commented Sep 4, 2017

LGTM

@asfgit asfgit closed this in 2e04c35 Sep 4, 2017
xubo245 pushed a commit to xubo245/carbondata that referenced this pull request Sep 17, 2017
Analysis: During dictionary creation the dictionary values are kept in a HashSet.

 When the size of hashset reaches more than 500000000 this exception is thrown.

Solution: Limit the dictionary values to 10000000

This closes apache#1166
@kunal642 kunal642 deleted the dictionary_limit branch October 6, 2017 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants