-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-2633][BloomDataMap] Fix bugs in bloomfilter for dictionary/sort/date/TimeStamp column #2403
[CARBONDATA-2633][BloomDataMap] Fix bugs in bloomfilter for dictionary/sort/date/TimeStamp column #2403
Conversation
retest this please |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5411/ |
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5321/ |
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6490/ |
retest this please |
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6508/ |
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5340/ |
retest this please |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6516/ |
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5348/ |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5433/ |
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5351/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6520/ |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5434/ |
@@ -175,6 +297,7 @@ public boolean isScanRequired(FilterResolverIntf filterExp) { | |||
public void clear() { | |||
bloomIndexList.clear(); | |||
bloomIndexList = null; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove it
|
||
private BloomQueryModel(String columnName, DataType dataType, Object filterValue) { | ||
private BloomQueryModel(String columnName, byte[] filterValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please describe the filterValue
6c33eb4
to
009e333
Compare
@jackylk I refactored the commit based on our discussion, please check. Some tests are added to clarify the scenarios |
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5488/ |
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6661/ |
SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5521/ |
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6664/ |
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5491/ |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5524/ |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5525/ |
retest this please |
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6671/ |
854598c
to
b9d0976
Compare
retest this please |
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5498/ |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5529/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6674/ |
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5501/ |
@@ -69,6 +86,27 @@ | |||
indexBloomFilters = new ArrayList<>(indexColumns.size()); | |||
initDataMapFile(); | |||
resetBloomFilters(); | |||
|
|||
keyGenerator = segmentProperties.getDimensionKeyGenerator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we optimize this instead of passing the whole SegmentProperties
into this Writer class? Please check @ravipesala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we use the keyGenerator, ColumnarSpitter and dimensions for this segment.
datamap/bloom/pom.xml
Outdated
<version>${project.version}</version> | ||
</dependency> | ||
<!--note: guava 14.0.1 is omitted during assembly. | ||
The compile scope here is for building and running test--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have not added compiler scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this line is to be removed. It is used for guava-cache previously. will fix it
row.update(1, index); | ||
} else if (value.equals(nullFormat)) { | ||
row.update(1, index); | ||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to create a constant for null value (1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK~
For dictionary column, carbon convert literal value to dict value, then convert dict value to mdk value, at last it stores the mdk value as internal value in carbonfile. For sort column and date column, the value has also been encoded. Here in bloomfilter datamap, we will index on the encoded data, that is to say: For dictionary/date column, we use the surrogate key as bloom index key; For sort column and ordinary dimensions, we use the plain bytes as bloom index key; For measures, we convert the value to bytes and use it as the bloom index key. Changes are made: 1. FieldConverters were refactored to extract common value convert methods. 2. BloomQueryModel was optimized to support converting literal value to internal value. 2. fix bugs for int/float/date/timestamp as bloom index column 3. fix bugs in dictionary/sort column as bloom index column 4. add tests 5. block (deferred) rebuild for bloom datamap (contains bugs that does not fix in this commit, another PR has been raised)
b9d0976
to
569e738
Compare
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6696/ |
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5523/ |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5551/ |
LGTM |
[CARBONDATA-2587][CARBONDATA-2588] Local Dictionary Data Loading support What changes are proposed in this PR Added code to support Local Dictionary Data Loading for primitive type Added code to support Local Dictionary Data Loading for complex type. How this PR is tested Manual testing is done in 3 Node setup. UT will be raised in different PR This closes apache#2402 [CARBONDATA-2647] [CARBONDATA-2648] Add support for COLUMN_META_CACHE and CACHE_LEVEL in create table and alter table properties Things done as part of this PR Support for configuring COLUMN_META_CACHE in create and alter table set properties DDL. Support for configuring CACHE_LEVEL in create and alter table set properties DDL. Describe formatted display support for COLUMN_META_CACHE and CACHE_LEVEL Any interfaces changed? Create Table Syntax CREATE TABLE [dbName].tableName (col1 String, col2 String, col3 int,…) STORED BY ‘carbondata’ TBLPROPERTIES (‘COLUMN_META_CACHE’=’col1,col2,…’, 'CACHE_LEVEL'='BLOCKLET') Alter Table set properties Syntax ALTER TABLE [dbName].tableName SET TBLPROPERTIES (‘COLUMN_META_CACHE’=’col1,col2,…’, 'CACHE_LEVEL'='BLOCKLET') This closs apache#2418 [CARBONDATA-2549] Bloom remove guava cache and use CarbonCache Currently, bloom cache is implemented using guava cache, carbon has its own lru cache interfaces and complete sysytem it controls the cache intstead of controlling feature wise. So replace guava cache with carbon lru cache. This closes apache#2327 [CARBONDATA-2608]Document update about Json Writer with examples. Document update about Json Writer with examples. This closes apache#2409 [CARBONDATA-2634][BloomDataMap] Add datamap properties in show datamap outputs add datamap properties in show datamap outputs This closes apache#2404 [CARBONDATA-2647] [CARBONDATA-2648] Fix cache level display in describe formatted command 1. Correct CACHE_LEVEL display in describe formatted command. It was always displays BLOCK even though val was configured BLOCKLET. 2. Correct the method arguments to pass dbName first and then tableName. 3. Added test case for blocking column_meta_cache and cache_level on child dataMaps. This closes apache#2426 [CARBONDATA-2669] Local Dictionary Store Size optimisation and other function issues Problems Local dictionary store size issue. When all column data is empty and columns are not present in sort columns local dictionary size was more than no dictionary dictionary store size. Page level dictionary merging Issue While merging the page used dictionary values in a blocklet it was missing some of the dictionary values, this is because, AND operation was done on bitset Local Dictionary null values Null value was not added in LV because of this new dictionary values was getting generated for null values Local dictionary generator thread specific Solution: Added rle for unsorted dictionary values to reduce the size. Now OR operation is performed while merging the dictionary values Added LV for null values Local dictionary generator task specific This closes apache#2427 [CARBONDATA-2585][CARBONDATA-2586][Local Dictionary]Local dictionary support for alter table, preaggregate, varchar datatype, alter set and unset What changes were proposed in this pull request? In this PR, local dictionary support is added for alter table, preaggregate, varChar datatype, alter table set and unset command UTs are added for local dictionary load support All the validations related to above features are taken care in this PR How was this patch tested? All the tests were executed in 3 node cluster. UTs and SDV test cases are added in the same PR This closes apache#2401 [HOTFIX] Fixed compilation issues and bloom clear issue Fixed test This closes apache#2428 [CARBONDATA-2635][BloomDataMap] Support different index datamaps on same column User can create different provider based index datamaps on one column, for example user can create bloomfilter datamap and lucene datamap on one column, but not able to create two bloomfilter datamap on one column. This closes apache#2405 [CARBONDATA-2646][DataLoad]change the log level while loading data into a table with 'sort_column_bounds' property,'ERROR' flag change to 'WARN' flag for some expected tasks. change the log level while loading data into a table with 'sort_column_bounds' property,'ERROR' flag change to 'WARN' flag for some expected tasks. This closes apache#2407 [CARBONDATA-2545] Fix some spell error in CarbonData This closes apache#2419 [CARBONDATA-2629] Support SDK carbon reader read data from HDFS and S3 with filter function Now SDK carbon reader only support read data from local with filter function, it will throw exception when read data from HDFS and S3 with filter function This PR support it: Support SDK carbon reader read data from HDFS and S3 with filter function This closes apache#2399 [CARBONDATA-2644][DataLoad]ADD carbon.load.sortMemory.spill.percentage parameter invalid value check This closes apache#2397 [CARBONDATA-2653][BloomDataMap] Fix bugs in incorrect blocklet number in bloomfilter In non-deferred reuibuild scenario, the last bloomfilter index file has already been written onBlockletEnd, no need to write again, otherwise an extra blocklet number will be generated in the bloom index file. This closes apache#2408 [CARBONDATA-2674][Streaming]Streaming with merge index enabled does not consider the merge index file while pruning This closes apache#2429 [CARBONDATA-2606][Complex DataType Enhancements]Fix for ComplexDataType Projection PushDown Problem1: Fix for ComplexDataType Projection PushDown when Table Schema contains ColumnName in UpperCase Solution: Change ColumnName to Lowercase Problem2: If Struct contains Array, pushdown only parent column Solution: Check for ArrayType or GetArrayItem in the Complex Column, if any ArrayType is found, then pushdown parent column This closes apache#2421 [CARBONDATA-2633][BloomDataMap] Fix bugs in bloomfilter for dictionary/sort/date/TimeStamp column for dictionary column, carbon convert literal value to dict value, then convert dict value to mdk value, at last it stores the mdk value as internal value in carbonfile. for other columns, carbon convert literal value to internal value using field-converter. Since bloomfilter datamap stores the internal value, during query we should convert the literal value in filter to internal value in order to match the value stored in bloomfilter datamap. Changes are made: 1.FieldConverters were refactored to extract common value convert methods. 2.BloomQueryModel was optimized to support converting literal value to internal value. 3.fix bugs for int/float/date/timestamp as bloom index column 4.fix bugs in dictionary/sort column as bloom index column 5.add tests 6.block (deferred) rebuild for bloom datamap (contains bugs that does not fix in this commit) This closes apache#2403 [HOTFIX][32K]maintain proper mapping for varChar Columns and noDictionary Columns for all the dimensions while creating sort data rows instance Problem: when creating the column mapping for varChar columns and no dictionary columns for existing dimensions, the mapping is incorrect. Solution: remove unwanted variable counter and map correct index to varChar columns and noDictionary columns based on the number of dimensions This closes apache#2395 [CARBONDATA-2650][Datamap] Fix bugs in negative number of skipped blocklets Currently in carbondata, default blocklet datamap will be used to prune blocklets. Then other indexdatamap will be used. But the other index datamap works for segment scope, which in some scenarios, the size of pruned result will be bigger than that of default datamap, thus causing negative number of skipped blocklets in explain query output. Here we add intersection after pruning. If the pruned result size is zero, we will finish the pruning. This closes apache#2410 [CARBONDATA-2654][Datamap] Optimize output for explaining querying with datamap Currently if we have multiple datamaps and the query hits all the datamaps, carbondata explain command will only show the first datamap and all the datamaps are not shown. In this commit, we show all the datamaps that are hitted in this query. This closes apache#2411 [CARBONDATA-2687][BloomDataMap][Doc] Update document for bloomfilter datamap In previous PR, cache behaviour for bloomfilter datamap has been changed: changed from guava-cache to carbon-cache. This PR update the document for bloomfilter datamap and remove the description for cache. This closes apache#2446 Code Generator Error is thrown when Select filter contains more than one count of distinct of ComplexColumn with group by Clause [CARBONDATA-2684] [PR-2442] Distinct count fails on complex columns This PR fixes Code Generator Error thrown when Select filter contains more than one count of distinct of ComplexColumn with group by Clause This closes apache#2449 [CARBONDATA-2645] Segregate block and blocklet cache Things done as part of this PR Segregate block and blocklet cache. In this driver will cache the metadata based on CACHE_LEVEL. If CACHE_LEVEL is set to BLOCK then only carbondata files metadata will be cached in driver. If CACHE_LEVEL is set to BLOCKLET thenmetadata for number of carbondata files * number of blocklets in each carbondata file will be cached in driver. This closes apache#2437 [CARBONDATA-2675][32K] Support config long_string_columns when create datamap Create datamap use select statement, but long string column is defined with StringType in the result dataframe if this column is selected. This PR allows to set long_string_columns property in dmproperties. This closes apache#2432 [CARBONDATA-2683][32K] fix data convertion problem for Varchar Spark uses org.apache.spark.unsafe.types.UTF8String for string datatype internally. In carbon, varchar datatype should do the same convertion as string datatype. Or it may throw exception This closes apache#2438 [CARBONDATA-2657][BloomDataMap] Fix bugs in loading and querying on bloom column with empty values Fix bugs in loading and querying on bloom column … Fix bugs in loading and querying with empty values on bloom index columns. Convert null values to corresponding values. This closes apache#2413 [CARBONDATA-2585][CARBONDATA-2586][Local Dictionary]Added test cases for local dictionary support for alter table, set, unset and preaggregate Added test cases for local dictionary support for alter table, set, unset and pre-aggregate All the validations related to above features are taken care in this PR This closes apache#2422 [CARBONDATA-2606][Complex DataType Enhancements] Fixed Projection Pushdown when Select filter contains Struct column. Problem: If Select filter contains Struct Column which is not in Projection list, then only null value is stored for struct column given in filter and select query result is null. Solution: Pushdown Parent column of corresponding struct type if any struct column is present in Filter list. This closes apache#2439 [CARBONDATA-2642] Added configurable Lock path property A new property is being exposed which will allow the user to configure the lock path "carbon.lock.path" Refactored code to create a separate implementation for S3CarbonFile. This closes apache#2642 [CARBONDATA-2686] Implement Left join on MV datamap This closes apache#2444 [CARBONDATA-2660][BloomDataMap] Add test for querying on longstring bloom index column Filtering on longstring bloom index column is already supported in PR apache#2403, here we only add test for it. This closes apache#2416 [CARBONDATA-2689] Added validations for complex columns in alter set statements Issue: Alter set statements were not validating complex dataType columns correctly. Fix: Added a recursive method to validate string and varchar child columns of complex dataType columns. This closes apache#2450 [CARBONDATA-2681][32K] Fix loading problem using global/batch sort fails when table has long string columns In SortStepRowHandler, global/batch sort use convertRawRowTo3Parts instead of convertIntermediateSortTempRowTo3Parted. varcharDimCnt was not add up to noDictArray cause error: Problem while converting row to 3 parts. This closes apache#2435 [CARBONDATA-2658][DataLoad] Fix bugs in spilling in-memory pages the parameter carbon.load.sortMemory.spill.percentage configured the value range 0-100,according to configuration merge and spill in-memory pages to disk This closes apache#2414 [CARBONDATA-2666] updated rename command so that table directory is not renamed rename will not rename table folder but only changes metadata This closes apache#2420 [CARBONDATA-2637][BloomDataMap] Fix bugs for deferred rebuild for bloomfilter datamap Previously when we implement ISSUE-2633, deferred rebuild for bloom datamap is disabled for bloomfilter datamap due to unhandled bugs. In this commit, we fixed the bugs and bring this feature back. Since bloomfilter datamap create index on the carbon native raw bytes, we have to convert original literal value to carbon native bytes both in loading and querying. This closes apache#2425 [CARBONDATA-2701] Refactor code to store minimal required info in Block and Blocklet Cache 1. Refactored code to keep only minimal information in block and blocklet cache. 2. Introduced segment properties holder at JVM level to hold the segment properties. As it is heavy object, new segment properties object will be created only when schema or cardinality is changed for a table. This closes apache#2454 [CARBONDATA-2589][CARBONDATA-2590][CARBONDATA-2602]Local dictionary query Support Supported Non filter query for local dictionary Supported Filter query on local dictionary Supported Query on complex column for primitive type local dictionary columns Local Dictionary support on Varchar columns Supported Vector reader on local dictionary [CARBONDATA-2589][CARBONDATA-2590][CARBONDATA-2602]Local dictionary query Support Supported Non filter query for local dictionary Supported Filter query on local dictionary Supported Query on complex column for primitive type local dictionary columns Local Dictionary support on Varchar columns Supported Vector reader on local dictionary This closes apache#2447 [CARBONDATA-2585][CARBONDATA-2586]Fix local dictionary support for preagg and set localdict info in column schema This PR fixes local dictionary support for preaggregate and set the column dict info of each column in column schema read and write for backward compatibility. This closes apache#2451 [CARBONDATA-2711] carbonFileList is not initalized when updatetablelist call bug fix: carbon is not initalized within updatetablelist method when we execute 'SELECT table_name FROM information_schema.tables WHERE table_schema = 'tmp_sbu_vadmdb' from command line This closes apache#2468 [CARBONDATA-2685][DataMap] Parallize datamap rebuild processing for segments Currently in carbondata, while rebuilding datamap, one spark job will be started for each segment and all the jobs are executed serailly. If we have many historical segments, the rebuild will takes a lot of time. Here we optimize the procedure for datamap rebuild and start one start for each segments, all the tasks can be done in parallel in one spark job. This closes apache#2443 [CARBONDATA-2706][BloomDataMap] clear bloom index files after segment is deleted clear bloom index files after corresponding segment is deleted and cleaned This closes apache#2461 [CARBONDATA-2715][LuceneDataMap] Fix bug in search mode with lucene datamap in windows While comparing two pathes, the file separator is different in windows, thus causing empty pruned blocklets. This PR will ignore the file separator This closes apache#2470 [CARBONDATA-2703][Tests] Clear up env after tests 1.reset session parameters after test 2.clean up output after test This closes apache#2458 [CARBONDATA-2607][Complex Column Enhancements] Complex Primitive DataType Adaptive Encoding In this PR the improvement was done to save the complex type more effectively so that reading becomes more efficient. The changes are: Primitive types inside complex types are separate pages. Previously it was a single byte array column page for a complex column. Now all sub-levels inside the complex data types are stored as separate pages with their respective datatypes. No Dictionary Primitive DataTypes inside Complex Columns will be processed through Adaptive Encoding. Previously only snappy compression was applied. All Primitive datatypes inside complex if it is now dictionary, only value will be saved except String, Varchar which is saved as ByteArray. Previously all sub-levels are saved as Length And Value Format inside a single Byte Array. Currently only Struct And Array type column pages are saved in ByteArray. All other primitive except String and varchar are saved in respective fixed datatype length. Support for the Safe and Unsafe Fixed length Column Page to support growing dynamic array implementation. This is done to support Array datatype. Co-authored-by: sounakr <sounakr@gmail.com> This closes apache#2417
[CARBONDATA-2587][CARBONDATA-2588] Local Dictionary Data Loading support What changes are proposed in this PR Added code to support Local Dictionary Data Loading for primitive type Added code to support Local Dictionary Data Loading for complex type. How this PR is tested Manual testing is done in 3 Node setup. UT will be raised in different PR This closes apache#2402 [CARBONDATA-2647] [CARBONDATA-2648] Add support for COLUMN_META_CACHE and CACHE_LEVEL in create table and alter table properties Things done as part of this PR Support for configuring COLUMN_META_CACHE in create and alter table set properties DDL. Support for configuring CACHE_LEVEL in create and alter table set properties DDL. Describe formatted display support for COLUMN_META_CACHE and CACHE_LEVEL Any interfaces changed? Create Table Syntax CREATE TABLE [dbName].tableName (col1 String, col2 String, col3 int,…) STORED BY ‘carbondata’ TBLPROPERTIES (‘COLUMN_META_CACHE’=’col1,col2,…’, 'CACHE_LEVEL'='BLOCKLET') Alter Table set properties Syntax ALTER TABLE [dbName].tableName SET TBLPROPERTIES (‘COLUMN_META_CACHE’=’col1,col2,…’, 'CACHE_LEVEL'='BLOCKLET') This closs apache#2418 [CARBONDATA-2549] Bloom remove guava cache and use CarbonCache Currently, bloom cache is implemented using guava cache, carbon has its own lru cache interfaces and complete sysytem it controls the cache intstead of controlling feature wise. So replace guava cache with carbon lru cache. This closes apache#2327 [CARBONDATA-2608]Document update about Json Writer with examples. Document update about Json Writer with examples. This closes apache#2409 [CARBONDATA-2634][BloomDataMap] Add datamap properties in show datamap outputs add datamap properties in show datamap outputs This closes apache#2404 [CARBONDATA-2647] [CARBONDATA-2648] Fix cache level display in describe formatted command 1. Correct CACHE_LEVEL display in describe formatted command. It was always displays BLOCK even though val was configured BLOCKLET. 2. Correct the method arguments to pass dbName first and then tableName. 3. Added test case for blocking column_meta_cache and cache_level on child dataMaps. This closes apache#2426 [CARBONDATA-2669] Local Dictionary Store Size optimisation and other function issues Problems Local dictionary store size issue. When all column data is empty and columns are not present in sort columns local dictionary size was more than no dictionary dictionary store size. Page level dictionary merging Issue While merging the page used dictionary values in a blocklet it was missing some of the dictionary values, this is because, AND operation was done on bitset Local Dictionary null values Null value was not added in LV because of this new dictionary values was getting generated for null values Local dictionary generator thread specific Solution: Added rle for unsorted dictionary values to reduce the size. Now OR operation is performed while merging the dictionary values Added LV for null values Local dictionary generator task specific This closes apache#2427 [CARBONDATA-2585][CARBONDATA-2586][Local Dictionary]Local dictionary support for alter table, preaggregate, varchar datatype, alter set and unset What changes were proposed in this pull request? In this PR, local dictionary support is added for alter table, preaggregate, varChar datatype, alter table set and unset command UTs are added for local dictionary load support All the validations related to above features are taken care in this PR How was this patch tested? All the tests were executed in 3 node cluster. UTs and SDV test cases are added in the same PR This closes apache#2401 [HOTFIX] Fixed compilation issues and bloom clear issue Fixed test This closes apache#2428 [CARBONDATA-2635][BloomDataMap] Support different index datamaps on same column User can create different provider based index datamaps on one column, for example user can create bloomfilter datamap and lucene datamap on one column, but not able to create two bloomfilter datamap on one column. This closes apache#2405 [CARBONDATA-2646][DataLoad]change the log level while loading data into a table with 'sort_column_bounds' property,'ERROR' flag change to 'WARN' flag for some expected tasks. change the log level while loading data into a table with 'sort_column_bounds' property,'ERROR' flag change to 'WARN' flag for some expected tasks. This closes apache#2407 [CARBONDATA-2545] Fix some spell error in CarbonData This closes apache#2419 [CARBONDATA-2629] Support SDK carbon reader read data from HDFS and S3 with filter function Now SDK carbon reader only support read data from local with filter function, it will throw exception when read data from HDFS and S3 with filter function This PR support it: Support SDK carbon reader read data from HDFS and S3 with filter function This closes apache#2399 [CARBONDATA-2644][DataLoad]ADD carbon.load.sortMemory.spill.percentage parameter invalid value check This closes apache#2397 [CARBONDATA-2653][BloomDataMap] Fix bugs in incorrect blocklet number in bloomfilter In non-deferred reuibuild scenario, the last bloomfilter index file has already been written onBlockletEnd, no need to write again, otherwise an extra blocklet number will be generated in the bloom index file. This closes apache#2408 [CARBONDATA-2674][Streaming]Streaming with merge index enabled does not consider the merge index file while pruning This closes apache#2429 [CARBONDATA-2606][Complex DataType Enhancements]Fix for ComplexDataType Projection PushDown Problem1: Fix for ComplexDataType Projection PushDown when Table Schema contains ColumnName in UpperCase Solution: Change ColumnName to Lowercase Problem2: If Struct contains Array, pushdown only parent column Solution: Check for ArrayType or GetArrayItem in the Complex Column, if any ArrayType is found, then pushdown parent column This closes apache#2421 [CARBONDATA-2633][BloomDataMap] Fix bugs in bloomfilter for dictionary/sort/date/TimeStamp column for dictionary column, carbon convert literal value to dict value, then convert dict value to mdk value, at last it stores the mdk value as internal value in carbonfile. for other columns, carbon convert literal value to internal value using field-converter. Since bloomfilter datamap stores the internal value, during query we should convert the literal value in filter to internal value in order to match the value stored in bloomfilter datamap. Changes are made: 1.FieldConverters were refactored to extract common value convert methods. 2.BloomQueryModel was optimized to support converting literal value to internal value. 3.fix bugs for int/float/date/timestamp as bloom index column 4.fix bugs in dictionary/sort column as bloom index column 5.add tests 6.block (deferred) rebuild for bloom datamap (contains bugs that does not fix in this commit) This closes apache#2403 [HOTFIX][32K]maintain proper mapping for varChar Columns and noDictionary Columns for all the dimensions while creating sort data rows instance Problem: when creating the column mapping for varChar columns and no dictionary columns for existing dimensions, the mapping is incorrect. Solution: remove unwanted variable counter and map correct index to varChar columns and noDictionary columns based on the number of dimensions This closes apache#2395 [CARBONDATA-2650][Datamap] Fix bugs in negative number of skipped blocklets Currently in carbondata, default blocklet datamap will be used to prune blocklets. Then other indexdatamap will be used. But the other index datamap works for segment scope, which in some scenarios, the size of pruned result will be bigger than that of default datamap, thus causing negative number of skipped blocklets in explain query output. Here we add intersection after pruning. If the pruned result size is zero, we will finish the pruning. This closes apache#2410 [CARBONDATA-2654][Datamap] Optimize output for explaining querying with datamap Currently if we have multiple datamaps and the query hits all the datamaps, carbondata explain command will only show the first datamap and all the datamaps are not shown. In this commit, we show all the datamaps that are hitted in this query. This closes apache#2411 [CARBONDATA-2687][BloomDataMap][Doc] Update document for bloomfilter datamap In previous PR, cache behaviour for bloomfilter datamap has been changed: changed from guava-cache to carbon-cache. This PR update the document for bloomfilter datamap and remove the description for cache. This closes apache#2446 Code Generator Error is thrown when Select filter contains more than one count of distinct of ComplexColumn with group by Clause [CARBONDATA-2684] [PR-2442] Distinct count fails on complex columns This PR fixes Code Generator Error thrown when Select filter contains more than one count of distinct of ComplexColumn with group by Clause This closes apache#2449 [CARBONDATA-2645] Segregate block and blocklet cache Things done as part of this PR Segregate block and blocklet cache. In this driver will cache the metadata based on CACHE_LEVEL. If CACHE_LEVEL is set to BLOCK then only carbondata files metadata will be cached in driver. If CACHE_LEVEL is set to BLOCKLET thenmetadata for number of carbondata files * number of blocklets in each carbondata file will be cached in driver. This closes apache#2437 [CARBONDATA-2675][32K] Support config long_string_columns when create datamap Create datamap use select statement, but long string column is defined with StringType in the result dataframe if this column is selected. This PR allows to set long_string_columns property in dmproperties. This closes apache#2432 [CARBONDATA-2683][32K] fix data convertion problem for Varchar Spark uses org.apache.spark.unsafe.types.UTF8String for string datatype internally. In carbon, varchar datatype should do the same convertion as string datatype. Or it may throw exception This closes apache#2438 [CARBONDATA-2657][BloomDataMap] Fix bugs in loading and querying on bloom column with empty values Fix bugs in loading and querying on bloom column … Fix bugs in loading and querying with empty values on bloom index columns. Convert null values to corresponding values. This closes apache#2413 [CARBONDATA-2585][CARBONDATA-2586][Local Dictionary]Added test cases for local dictionary support for alter table, set, unset and preaggregate Added test cases for local dictionary support for alter table, set, unset and pre-aggregate All the validations related to above features are taken care in this PR This closes apache#2422 [CARBONDATA-2606][Complex DataType Enhancements] Fixed Projection Pushdown when Select filter contains Struct column. Problem: If Select filter contains Struct Column which is not in Projection list, then only null value is stored for struct column given in filter and select query result is null. Solution: Pushdown Parent column of corresponding struct type if any struct column is present in Filter list. This closes apache#2439 [CARBONDATA-2642] Added configurable Lock path property A new property is being exposed which will allow the user to configure the lock path "carbon.lock.path" Refactored code to create a separate implementation for S3CarbonFile. This closes apache#2642 [CARBONDATA-2686] Implement Left join on MV datamap This closes apache#2444 [CARBONDATA-2660][BloomDataMap] Add test for querying on longstring bloom index column Filtering on longstring bloom index column is already supported in PR apache#2403, here we only add test for it. This closes apache#2416 [CARBONDATA-2689] Added validations for complex columns in alter set statements Issue: Alter set statements were not validating complex dataType columns correctly. Fix: Added a recursive method to validate string and varchar child columns of complex dataType columns. This closes apache#2450 [CARBONDATA-2681][32K] Fix loading problem using global/batch sort fails when table has long string columns In SortStepRowHandler, global/batch sort use convertRawRowTo3Parts instead of convertIntermediateSortTempRowTo3Parted. varcharDimCnt was not add up to noDictArray cause error: Problem while converting row to 3 parts. This closes apache#2435 [CARBONDATA-2658][DataLoad] Fix bugs in spilling in-memory pages the parameter carbon.load.sortMemory.spill.percentage configured the value range 0-100,according to configuration merge and spill in-memory pages to disk This closes apache#2414 [CARBONDATA-2666] updated rename command so that table directory is not renamed rename will not rename table folder but only changes metadata This closes apache#2420 [CARBONDATA-2637][BloomDataMap] Fix bugs for deferred rebuild for bloomfilter datamap Previously when we implement ISSUE-2633, deferred rebuild for bloom datamap is disabled for bloomfilter datamap due to unhandled bugs. In this commit, we fixed the bugs and bring this feature back. Since bloomfilter datamap create index on the carbon native raw bytes, we have to convert original literal value to carbon native bytes both in loading and querying. This closes apache#2425 [CARBONDATA-2701] Refactor code to store minimal required info in Block and Blocklet Cache 1. Refactored code to keep only minimal information in block and blocklet cache. 2. Introduced segment properties holder at JVM level to hold the segment properties. As it is heavy object, new segment properties object will be created only when schema or cardinality is changed for a table. This closes apache#2454 [CARBONDATA-2589][CARBONDATA-2590][CARBONDATA-2602]Local dictionary query Support Supported Non filter query for local dictionary Supported Filter query on local dictionary Supported Query on complex column for primitive type local dictionary columns Local Dictionary support on Varchar columns Supported Vector reader on local dictionary [CARBONDATA-2589][CARBONDATA-2590][CARBONDATA-2602]Local dictionary query Support Supported Non filter query for local dictionary Supported Filter query on local dictionary Supported Query on complex column for primitive type local dictionary columns Local Dictionary support on Varchar columns Supported Vector reader on local dictionary This closes apache#2447 [CARBONDATA-2585][CARBONDATA-2586]Fix local dictionary support for preagg and set localdict info in column schema This PR fixes local dictionary support for preaggregate and set the column dict info of each column in column schema read and write for backward compatibility. This closes apache#2451 [CARBONDATA-2711] carbonFileList is not initalized when updatetablelist call bug fix: carbon is not initalized within updatetablelist method when we execute 'SELECT table_name FROM information_schema.tables WHERE table_schema = 'tmp_sbu_vadmdb' from command line This closes apache#2468 [CARBONDATA-2685][DataMap] Parallize datamap rebuild processing for segments Currently in carbondata, while rebuilding datamap, one spark job will be started for each segment and all the jobs are executed serailly. If we have many historical segments, the rebuild will takes a lot of time. Here we optimize the procedure for datamap rebuild and start one start for each segments, all the tasks can be done in parallel in one spark job. This closes apache#2443 [CARBONDATA-2706][BloomDataMap] clear bloom index files after segment is deleted clear bloom index files after corresponding segment is deleted and cleaned This closes apache#2461 [CARBONDATA-2715][LuceneDataMap] Fix bug in search mode with lucene datamap in windows While comparing two pathes, the file separator is different in windows, thus causing empty pruned blocklets. This PR will ignore the file separator This closes apache#2470 [CARBONDATA-2703][Tests] Clear up env after tests 1.reset session parameters after test 2.clean up output after test This closes apache#2458 [CARBONDATA-2607][Complex Column Enhancements] Complex Primitive DataType Adaptive Encoding In this PR the improvement was done to save the complex type more effectively so that reading becomes more efficient. The changes are: Primitive types inside complex types are separate pages. Previously it was a single byte array column page for a complex column. Now all sub-levels inside the complex data types are stored as separate pages with their respective datatypes. No Dictionary Primitive DataTypes inside Complex Columns will be processed through Adaptive Encoding. Previously only snappy compression was applied. All Primitive datatypes inside complex if it is now dictionary, only value will be saved except String, Varchar which is saved as ByteArray. Previously all sub-levels are saved as Length And Value Format inside a single Byte Array. Currently only Struct And Array type column pages are saved in ByteArray. All other primitive except String and varchar are saved in respective fixed datatype length. Support for the Safe and Unsafe Fixed length Column Page to support growing dynamic array implementation. This is done to support Array datatype. Co-authored-by: sounakr <sounakr@gmail.com> This closes apache#2417
[CARBONDATA-2587][CARBONDATA-2588] Local Dictionary Data Loading support What changes are proposed in this PR Added code to support Local Dictionary Data Loading for primitive type Added code to support Local Dictionary Data Loading for complex type. How this PR is tested Manual testing is done in 3 Node setup. UT will be raised in different PR This closes apache#2402 [CARBONDATA-2647] [CARBONDATA-2648] Add support for COLUMN_META_CACHE and CACHE_LEVEL in create table and alter table properties Things done as part of this PR Support for configuring COLUMN_META_CACHE in create and alter table set properties DDL. Support for configuring CACHE_LEVEL in create and alter table set properties DDL. Describe formatted display support for COLUMN_META_CACHE and CACHE_LEVEL Any interfaces changed? Create Table Syntax CREATE TABLE [dbName].tableName (col1 String, col2 String, col3 int,…) STORED BY ‘carbondata’ TBLPROPERTIES (‘COLUMN_META_CACHE’=’col1,col2,…’, 'CACHE_LEVEL'='BLOCKLET') Alter Table set properties Syntax ALTER TABLE [dbName].tableName SET TBLPROPERTIES (‘COLUMN_META_CACHE’=’col1,col2,…’, 'CACHE_LEVEL'='BLOCKLET') This closs apache#2418 [CARBONDATA-2549] Bloom remove guava cache and use CarbonCache Currently, bloom cache is implemented using guava cache, carbon has its own lru cache interfaces and complete sysytem it controls the cache intstead of controlling feature wise. So replace guava cache with carbon lru cache. This closes apache#2327 [CARBONDATA-2608]Document update about Json Writer with examples. Document update about Json Writer with examples. This closes apache#2409 [CARBONDATA-2634][BloomDataMap] Add datamap properties in show datamap outputs add datamap properties in show datamap outputs This closes apache#2404 [CARBONDATA-2647] [CARBONDATA-2648] Fix cache level display in describe formatted command 1. Correct CACHE_LEVEL display in describe formatted command. It was always displays BLOCK even though val was configured BLOCKLET. 2. Correct the method arguments to pass dbName first and then tableName. 3. Added test case for blocking column_meta_cache and cache_level on child dataMaps. This closes apache#2426 [CARBONDATA-2669] Local Dictionary Store Size optimisation and other function issues Problems Local dictionary store size issue. When all column data is empty and columns are not present in sort columns local dictionary size was more than no dictionary dictionary store size. Page level dictionary merging Issue While merging the page used dictionary values in a blocklet it was missing some of the dictionary values, this is because, AND operation was done on bitset Local Dictionary null values Null value was not added in LV because of this new dictionary values was getting generated for null values Local dictionary generator thread specific Solution: Added rle for unsorted dictionary values to reduce the size. Now OR operation is performed while merging the dictionary values Added LV for null values Local dictionary generator task specific This closes apache#2427 [CARBONDATA-2585][CARBONDATA-2586][Local Dictionary]Local dictionary support for alter table, preaggregate, varchar datatype, alter set and unset What changes were proposed in this pull request? In this PR, local dictionary support is added for alter table, preaggregate, varChar datatype, alter table set and unset command UTs are added for local dictionary load support All the validations related to above features are taken care in this PR How was this patch tested? All the tests were executed in 3 node cluster. UTs and SDV test cases are added in the same PR This closes apache#2401 [HOTFIX] Fixed compilation issues and bloom clear issue Fixed test This closes apache#2428 [CARBONDATA-2635][BloomDataMap] Support different index datamaps on same column User can create different provider based index datamaps on one column, for example user can create bloomfilter datamap and lucene datamap on one column, but not able to create two bloomfilter datamap on one column. This closes apache#2405 [CARBONDATA-2646][DataLoad]change the log level while loading data into a table with 'sort_column_bounds' property,'ERROR' flag change to 'WARN' flag for some expected tasks. change the log level while loading data into a table with 'sort_column_bounds' property,'ERROR' flag change to 'WARN' flag for some expected tasks. This closes apache#2407 [CARBONDATA-2545] Fix some spell error in CarbonData This closes apache#2419 [CARBONDATA-2629] Support SDK carbon reader read data from HDFS and S3 with filter function Now SDK carbon reader only support read data from local with filter function, it will throw exception when read data from HDFS and S3 with filter function This PR support it: Support SDK carbon reader read data from HDFS and S3 with filter function This closes apache#2399 [CARBONDATA-2644][DataLoad]ADD carbon.load.sortMemory.spill.percentage parameter invalid value check This closes apache#2397 [CARBONDATA-2653][BloomDataMap] Fix bugs in incorrect blocklet number in bloomfilter In non-deferred reuibuild scenario, the last bloomfilter index file has already been written onBlockletEnd, no need to write again, otherwise an extra blocklet number will be generated in the bloom index file. This closes apache#2408 [CARBONDATA-2674][Streaming]Streaming with merge index enabled does not consider the merge index file while pruning This closes apache#2429 [CARBONDATA-2606][Complex DataType Enhancements]Fix for ComplexDataType Projection PushDown Problem1: Fix for ComplexDataType Projection PushDown when Table Schema contains ColumnName in UpperCase Solution: Change ColumnName to Lowercase Problem2: If Struct contains Array, pushdown only parent column Solution: Check for ArrayType or GetArrayItem in the Complex Column, if any ArrayType is found, then pushdown parent column This closes apache#2421 [CARBONDATA-2633][BloomDataMap] Fix bugs in bloomfilter for dictionary/sort/date/TimeStamp column for dictionary column, carbon convert literal value to dict value, then convert dict value to mdk value, at last it stores the mdk value as internal value in carbonfile. for other columns, carbon convert literal value to internal value using field-converter. Since bloomfilter datamap stores the internal value, during query we should convert the literal value in filter to internal value in order to match the value stored in bloomfilter datamap. Changes are made: 1.FieldConverters were refactored to extract common value convert methods. 2.BloomQueryModel was optimized to support converting literal value to internal value. 3.fix bugs for int/float/date/timestamp as bloom index column 4.fix bugs in dictionary/sort column as bloom index column 5.add tests 6.block (deferred) rebuild for bloom datamap (contains bugs that does not fix in this commit) This closes apache#2403 [HOTFIX][32K]maintain proper mapping for varChar Columns and noDictionary Columns for all the dimensions while creating sort data rows instance Problem: when creating the column mapping for varChar columns and no dictionary columns for existing dimensions, the mapping is incorrect. Solution: remove unwanted variable counter and map correct index to varChar columns and noDictionary columns based on the number of dimensions This closes apache#2395 [CARBONDATA-2650][Datamap] Fix bugs in negative number of skipped blocklets Currently in carbondata, default blocklet datamap will be used to prune blocklets. Then other indexdatamap will be used. But the other index datamap works for segment scope, which in some scenarios, the size of pruned result will be bigger than that of default datamap, thus causing negative number of skipped blocklets in explain query output. Here we add intersection after pruning. If the pruned result size is zero, we will finish the pruning. This closes apache#2410 [CARBONDATA-2654][Datamap] Optimize output for explaining querying with datamap Currently if we have multiple datamaps and the query hits all the datamaps, carbondata explain command will only show the first datamap and all the datamaps are not shown. In this commit, we show all the datamaps that are hitted in this query. This closes apache#2411 [CARBONDATA-2687][BloomDataMap][Doc] Update document for bloomfilter datamap In previous PR, cache behaviour for bloomfilter datamap has been changed: changed from guava-cache to carbon-cache. This PR update the document for bloomfilter datamap and remove the description for cache. This closes apache#2446 Code Generator Error is thrown when Select filter contains more than one count of distinct of ComplexColumn with group by Clause [CARBONDATA-2684] [PR-2442] Distinct count fails on complex columns This PR fixes Code Generator Error thrown when Select filter contains more than one count of distinct of ComplexColumn with group by Clause This closes apache#2449 [CARBONDATA-2645] Segregate block and blocklet cache Things done as part of this PR Segregate block and blocklet cache. In this driver will cache the metadata based on CACHE_LEVEL. If CACHE_LEVEL is set to BLOCK then only carbondata files metadata will be cached in driver. If CACHE_LEVEL is set to BLOCKLET thenmetadata for number of carbondata files * number of blocklets in each carbondata file will be cached in driver. This closes apache#2437 [CARBONDATA-2675][32K] Support config long_string_columns when create datamap Create datamap use select statement, but long string column is defined with StringType in the result dataframe if this column is selected. This PR allows to set long_string_columns property in dmproperties. This closes apache#2432 [CARBONDATA-2683][32K] fix data convertion problem for Varchar Spark uses org.apache.spark.unsafe.types.UTF8String for string datatype internally. In carbon, varchar datatype should do the same convertion as string datatype. Or it may throw exception This closes apache#2438 [CARBONDATA-2657][BloomDataMap] Fix bugs in loading and querying on bloom column with empty values Fix bugs in loading and querying on bloom column … Fix bugs in loading and querying with empty values on bloom index columns. Convert null values to corresponding values. This closes apache#2413 [CARBONDATA-2585][CARBONDATA-2586][Local Dictionary]Added test cases for local dictionary support for alter table, set, unset and preaggregate Added test cases for local dictionary support for alter table, set, unset and pre-aggregate All the validations related to above features are taken care in this PR This closes apache#2422 [CARBONDATA-2606][Complex DataType Enhancements] Fixed Projection Pushdown when Select filter contains Struct column. Problem: If Select filter contains Struct Column which is not in Projection list, then only null value is stored for struct column given in filter and select query result is null. Solution: Pushdown Parent column of corresponding struct type if any struct column is present in Filter list. This closes apache#2439 [CARBONDATA-2642] Added configurable Lock path property A new property is being exposed which will allow the user to configure the lock path "carbon.lock.path" Refactored code to create a separate implementation for S3CarbonFile. This closes apache#2642 [CARBONDATA-2686] Implement Left join on MV datamap This closes apache#2444 [CARBONDATA-2660][BloomDataMap] Add test for querying on longstring bloom index column Filtering on longstring bloom index column is already supported in PR apache#2403, here we only add test for it. This closes apache#2416 [CARBONDATA-2689] Added validations for complex columns in alter set statements Issue: Alter set statements were not validating complex dataType columns correctly. Fix: Added a recursive method to validate string and varchar child columns of complex dataType columns. This closes apache#2450 [CARBONDATA-2681][32K] Fix loading problem using global/batch sort fails when table has long string columns In SortStepRowHandler, global/batch sort use convertRawRowTo3Parts instead of convertIntermediateSortTempRowTo3Parted. varcharDimCnt was not add up to noDictArray cause error: Problem while converting row to 3 parts. This closes apache#2435 [CARBONDATA-2658][DataLoad] Fix bugs in spilling in-memory pages the parameter carbon.load.sortMemory.spill.percentage configured the value range 0-100,according to configuration merge and spill in-memory pages to disk This closes apache#2414 [CARBONDATA-2666] updated rename command so that table directory is not renamed rename will not rename table folder but only changes metadata This closes apache#2420 [CARBONDATA-2637][BloomDataMap] Fix bugs for deferred rebuild for bloomfilter datamap Previously when we implement ISSUE-2633, deferred rebuild for bloom datamap is disabled for bloomfilter datamap due to unhandled bugs. In this commit, we fixed the bugs and bring this feature back. Since bloomfilter datamap create index on the carbon native raw bytes, we have to convert original literal value to carbon native bytes both in loading and querying. This closes apache#2425 [CARBONDATA-2701] Refactor code to store minimal required info in Block and Blocklet Cache 1. Refactored code to keep only minimal information in block and blocklet cache. 2. Introduced segment properties holder at JVM level to hold the segment properties. As it is heavy object, new segment properties object will be created only when schema or cardinality is changed for a table. This closes apache#2454 [CARBONDATA-2589][CARBONDATA-2590][CARBONDATA-2602]Local dictionary query Support Supported Non filter query for local dictionary Supported Filter query on local dictionary Supported Query on complex column for primitive type local dictionary columns Local Dictionary support on Varchar columns Supported Vector reader on local dictionary [CARBONDATA-2589][CARBONDATA-2590][CARBONDATA-2602]Local dictionary query Support Supported Non filter query for local dictionary Supported Filter query on local dictionary Supported Query on complex column for primitive type local dictionary columns Local Dictionary support on Varchar columns Supported Vector reader on local dictionary This closes apache#2447 [CARBONDATA-2585][CARBONDATA-2586]Fix local dictionary support for preagg and set localdict info in column schema This PR fixes local dictionary support for preaggregate and set the column dict info of each column in column schema read and write for backward compatibility. This closes apache#2451 [CARBONDATA-2711] carbonFileList is not initalized when updatetablelist call bug fix: carbon is not initalized within updatetablelist method when we execute 'SELECT table_name FROM information_schema.tables WHERE table_schema = 'tmp_sbu_vadmdb' from command line This closes apache#2468 [CARBONDATA-2685][DataMap] Parallize datamap rebuild processing for segments Currently in carbondata, while rebuilding datamap, one spark job will be started for each segment and all the jobs are executed serailly. If we have many historical segments, the rebuild will takes a lot of time. Here we optimize the procedure for datamap rebuild and start one start for each segments, all the tasks can be done in parallel in one spark job. This closes apache#2443 [CARBONDATA-2706][BloomDataMap] clear bloom index files after segment is deleted clear bloom index files after corresponding segment is deleted and cleaned This closes apache#2461 [CARBONDATA-2715][LuceneDataMap] Fix bug in search mode with lucene datamap in windows While comparing two pathes, the file separator is different in windows, thus causing empty pruned blocklets. This PR will ignore the file separator This closes apache#2470 [CARBONDATA-2703][Tests] Clear up env after tests 1.reset session parameters after test 2.clean up output after test This closes apache#2458 [CARBONDATA-2607][Complex Column Enhancements] Complex Primitive DataType Adaptive Encoding In this PR the improvement was done to save the complex type more effectively so that reading becomes more efficient. The changes are: Primitive types inside complex types are separate pages. Previously it was a single byte array column page for a complex column. Now all sub-levels inside the complex data types are stored as separate pages with their respective datatypes. No Dictionary Primitive DataTypes inside Complex Columns will be processed through Adaptive Encoding. Previously only snappy compression was applied. All Primitive datatypes inside complex if it is now dictionary, only value will be saved except String, Varchar which is saved as ByteArray. Previously all sub-levels are saved as Length And Value Format inside a single Byte Array. Currently only Struct And Array type column pages are saved in ByteArray. All other primitive except String and varchar are saved in respective fixed datatype length. Support for the Safe and Unsafe Fixed length Column Page to support growing dynamic array implementation. This is done to support Array datatype. Co-authored-by: sounakr <sounakr@gmail.com> This closes apache#2417
[CARBONDATA-2587][CARBONDATA-2588] Local Dictionary Data Loading support What changes are proposed in this PR Added code to support Local Dictionary Data Loading for primitive type Added code to support Local Dictionary Data Loading for complex type. How this PR is tested Manual testing is done in 3 Node setup. UT will be raised in different PR This closes apache#2402 [CARBONDATA-2647] [CARBONDATA-2648] Add support for COLUMN_META_CACHE and CACHE_LEVEL in create table and alter table properties Things done as part of this PR Support for configuring COLUMN_META_CACHE in create and alter table set properties DDL. Support for configuring CACHE_LEVEL in create and alter table set properties DDL. Describe formatted display support for COLUMN_META_CACHE and CACHE_LEVEL Any interfaces changed? Create Table Syntax CREATE TABLE [dbName].tableName (col1 String, col2 String, col3 int,…) STORED BY ‘carbondata’ TBLPROPERTIES (‘COLUMN_META_CACHE’=’col1,col2,…’, 'CACHE_LEVEL'='BLOCKLET') Alter Table set properties Syntax ALTER TABLE [dbName].tableName SET TBLPROPERTIES (‘COLUMN_META_CACHE’=’col1,col2,…’, 'CACHE_LEVEL'='BLOCKLET') This closs apache#2418 [CARBONDATA-2549] Bloom remove guava cache and use CarbonCache Currently, bloom cache is implemented using guava cache, carbon has its own lru cache interfaces and complete sysytem it controls the cache intstead of controlling feature wise. So replace guava cache with carbon lru cache. This closes apache#2327 [CARBONDATA-2608]Document update about Json Writer with examples. Document update about Json Writer with examples. This closes apache#2409 [CARBONDATA-2634][BloomDataMap] Add datamap properties in show datamap outputs add datamap properties in show datamap outputs This closes apache#2404 [CARBONDATA-2647] [CARBONDATA-2648] Fix cache level display in describe formatted command 1. Correct CACHE_LEVEL display in describe formatted command. It was always displays BLOCK even though val was configured BLOCKLET. 2. Correct the method arguments to pass dbName first and then tableName. 3. Added test case for blocking column_meta_cache and cache_level on child dataMaps. This closes apache#2426 [CARBONDATA-2669] Local Dictionary Store Size optimisation and other function issues Problems Local dictionary store size issue. When all column data is empty and columns are not present in sort columns local dictionary size was more than no dictionary dictionary store size. Page level dictionary merging Issue While merging the page used dictionary values in a blocklet it was missing some of the dictionary values, this is because, AND operation was done on bitset Local Dictionary null values Null value was not added in LV because of this new dictionary values was getting generated for null values Local dictionary generator thread specific Solution: Added rle for unsorted dictionary values to reduce the size. Now OR operation is performed while merging the dictionary values Added LV for null values Local dictionary generator task specific This closes apache#2427 [CARBONDATA-2585][CARBONDATA-2586][Local Dictionary]Local dictionary support for alter table, preaggregate, varchar datatype, alter set and unset What changes were proposed in this pull request? In this PR, local dictionary support is added for alter table, preaggregate, varChar datatype, alter table set and unset command UTs are added for local dictionary load support All the validations related to above features are taken care in this PR How was this patch tested? All the tests were executed in 3 node cluster. UTs and SDV test cases are added in the same PR This closes apache#2401 [HOTFIX] Fixed compilation issues and bloom clear issue Fixed test This closes apache#2428 [CARBONDATA-2635][BloomDataMap] Support different index datamaps on same column User can create different provider based index datamaps on one column, for example user can create bloomfilter datamap and lucene datamap on one column, but not able to create two bloomfilter datamap on one column. This closes apache#2405 [CARBONDATA-2646][DataLoad]change the log level while loading data into a table with 'sort_column_bounds' property,'ERROR' flag change to 'WARN' flag for some expected tasks. change the log level while loading data into a table with 'sort_column_bounds' property,'ERROR' flag change to 'WARN' flag for some expected tasks. This closes apache#2407 [CARBONDATA-2545] Fix some spell error in CarbonData This closes apache#2419 [CARBONDATA-2629] Support SDK carbon reader read data from HDFS and S3 with filter function Now SDK carbon reader only support read data from local with filter function, it will throw exception when read data from HDFS and S3 with filter function This PR support it: Support SDK carbon reader read data from HDFS and S3 with filter function This closes apache#2399 [CARBONDATA-2644][DataLoad]ADD carbon.load.sortMemory.spill.percentage parameter invalid value check This closes apache#2397 [CARBONDATA-2653][BloomDataMap] Fix bugs in incorrect blocklet number in bloomfilter In non-deferred reuibuild scenario, the last bloomfilter index file has already been written onBlockletEnd, no need to write again, otherwise an extra blocklet number will be generated in the bloom index file. This closes apache#2408 [CARBONDATA-2674][Streaming]Streaming with merge index enabled does not consider the merge index file while pruning This closes apache#2429 [CARBONDATA-2606][Complex DataType Enhancements]Fix for ComplexDataType Projection PushDown Problem1: Fix for ComplexDataType Projection PushDown when Table Schema contains ColumnName in UpperCase Solution: Change ColumnName to Lowercase Problem2: If Struct contains Array, pushdown only parent column Solution: Check for ArrayType or GetArrayItem in the Complex Column, if any ArrayType is found, then pushdown parent column This closes apache#2421 [CARBONDATA-2633][BloomDataMap] Fix bugs in bloomfilter for dictionary/sort/date/TimeStamp column for dictionary column, carbon convert literal value to dict value, then convert dict value to mdk value, at last it stores the mdk value as internal value in carbonfile. for other columns, carbon convert literal value to internal value using field-converter. Since bloomfilter datamap stores the internal value, during query we should convert the literal value in filter to internal value in order to match the value stored in bloomfilter datamap. Changes are made: 1.FieldConverters were refactored to extract common value convert methods. 2.BloomQueryModel was optimized to support converting literal value to internal value. 3.fix bugs for int/float/date/timestamp as bloom index column 4.fix bugs in dictionary/sort column as bloom index column 5.add tests 6.block (deferred) rebuild for bloom datamap (contains bugs that does not fix in this commit) This closes apache#2403 [HOTFIX][32K]maintain proper mapping for varChar Columns and noDictionary Columns for all the dimensions while creating sort data rows instance Problem: when creating the column mapping for varChar columns and no dictionary columns for existing dimensions, the mapping is incorrect. Solution: remove unwanted variable counter and map correct index to varChar columns and noDictionary columns based on the number of dimensions This closes apache#2395 [CARBONDATA-2650][Datamap] Fix bugs in negative number of skipped blocklets Currently in carbondata, default blocklet datamap will be used to prune blocklets. Then other indexdatamap will be used. But the other index datamap works for segment scope, which in some scenarios, the size of pruned result will be bigger than that of default datamap, thus causing negative number of skipped blocklets in explain query output. Here we add intersection after pruning. If the pruned result size is zero, we will finish the pruning. This closes apache#2410 [CARBONDATA-2654][Datamap] Optimize output for explaining querying with datamap Currently if we have multiple datamaps and the query hits all the datamaps, carbondata explain command will only show the first datamap and all the datamaps are not shown. In this commit, we show all the datamaps that are hitted in this query. This closes apache#2411 [CARBONDATA-2687][BloomDataMap][Doc] Update document for bloomfilter datamap In previous PR, cache behaviour for bloomfilter datamap has been changed: changed from guava-cache to carbon-cache. This PR update the document for bloomfilter datamap and remove the description for cache. This closes apache#2446 Code Generator Error is thrown when Select filter contains more than one count of distinct of ComplexColumn with group by Clause [CARBONDATA-2684] [PR-2442] Distinct count fails on complex columns This PR fixes Code Generator Error thrown when Select filter contains more than one count of distinct of ComplexColumn with group by Clause This closes apache#2449 [CARBONDATA-2645] Segregate block and blocklet cache Things done as part of this PR Segregate block and blocklet cache. In this driver will cache the metadata based on CACHE_LEVEL. If CACHE_LEVEL is set to BLOCK then only carbondata files metadata will be cached in driver. If CACHE_LEVEL is set to BLOCKLET thenmetadata for number of carbondata files * number of blocklets in each carbondata file will be cached in driver. This closes apache#2437 [CARBONDATA-2675][32K] Support config long_string_columns when create datamap Create datamap use select statement, but long string column is defined with StringType in the result dataframe if this column is selected. This PR allows to set long_string_columns property in dmproperties. This closes apache#2432 [CARBONDATA-2683][32K] fix data convertion problem for Varchar Spark uses org.apache.spark.unsafe.types.UTF8String for string datatype internally. In carbon, varchar datatype should do the same convertion as string datatype. Or it may throw exception This closes apache#2438 [CARBONDATA-2657][BloomDataMap] Fix bugs in loading and querying on bloom column with empty values Fix bugs in loading and querying on bloom column … Fix bugs in loading and querying with empty values on bloom index columns. Convert null values to corresponding values. This closes apache#2413 [CARBONDATA-2585][CARBONDATA-2586][Local Dictionary]Added test cases for local dictionary support for alter table, set, unset and preaggregate Added test cases for local dictionary support for alter table, set, unset and pre-aggregate All the validations related to above features are taken care in this PR This closes apache#2422 [CARBONDATA-2606][Complex DataType Enhancements] Fixed Projection Pushdown when Select filter contains Struct column. Problem: If Select filter contains Struct Column which is not in Projection list, then only null value is stored for struct column given in filter and select query result is null. Solution: Pushdown Parent column of corresponding struct type if any struct column is present in Filter list. This closes apache#2439 [CARBONDATA-2642] Added configurable Lock path property A new property is being exposed which will allow the user to configure the lock path "carbon.lock.path" Refactored code to create a separate implementation for S3CarbonFile. This closes apache#2642 [CARBONDATA-2686] Implement Left join on MV datamap This closes apache#2444 [CARBONDATA-2660][BloomDataMap] Add test for querying on longstring bloom index column Filtering on longstring bloom index column is already supported in PR apache#2403, here we only add test for it. This closes apache#2416 [CARBONDATA-2689] Added validations for complex columns in alter set statements Issue: Alter set statements were not validating complex dataType columns correctly. Fix: Added a recursive method to validate string and varchar child columns of complex dataType columns. This closes apache#2450 [CARBONDATA-2681][32K] Fix loading problem using global/batch sort fails when table has long string columns In SortStepRowHandler, global/batch sort use convertRawRowTo3Parts instead of convertIntermediateSortTempRowTo3Parted. varcharDimCnt was not add up to noDictArray cause error: Problem while converting row to 3 parts. This closes apache#2435 [CARBONDATA-2658][DataLoad] Fix bugs in spilling in-memory pages the parameter carbon.load.sortMemory.spill.percentage configured the value range 0-100,according to configuration merge and spill in-memory pages to disk This closes apache#2414 [CARBONDATA-2666] updated rename command so that table directory is not renamed rename will not rename table folder but only changes metadata This closes apache#2420 [CARBONDATA-2637][BloomDataMap] Fix bugs for deferred rebuild for bloomfilter datamap Previously when we implement ISSUE-2633, deferred rebuild for bloom datamap is disabled for bloomfilter datamap due to unhandled bugs. In this commit, we fixed the bugs and bring this feature back. Since bloomfilter datamap create index on the carbon native raw bytes, we have to convert original literal value to carbon native bytes both in loading and querying. This closes apache#2425 [CARBONDATA-2701] Refactor code to store minimal required info in Block and Blocklet Cache 1. Refactored code to keep only minimal information in block and blocklet cache. 2. Introduced segment properties holder at JVM level to hold the segment properties. As it is heavy object, new segment properties object will be created only when schema or cardinality is changed for a table. This closes apache#2454 [CARBONDATA-2589][CARBONDATA-2590][CARBONDATA-2602]Local dictionary query Support Supported Non filter query for local dictionary Supported Filter query on local dictionary Supported Query on complex column for primitive type local dictionary columns Local Dictionary support on Varchar columns Supported Vector reader on local dictionary [CARBONDATA-2589][CARBONDATA-2590][CARBONDATA-2602]Local dictionary query Support Supported Non filter query for local dictionary Supported Filter query on local dictionary Supported Query on complex column for primitive type local dictionary columns Local Dictionary support on Varchar columns Supported Vector reader on local dictionary This closes apache#2447 [CARBONDATA-2585][CARBONDATA-2586]Fix local dictionary support for preagg and set localdict info in column schema This PR fixes local dictionary support for preaggregate and set the column dict info of each column in column schema read and write for backward compatibility. This closes apache#2451 [CARBONDATA-2711] carbonFileList is not initalized when updatetablelist call bug fix: carbon is not initalized within updatetablelist method when we execute 'SELECT table_name FROM information_schema.tables WHERE table_schema = 'tmp_sbu_vadmdb' from command line This closes apache#2468 [CARBONDATA-2685][DataMap] Parallize datamap rebuild processing for segments Currently in carbondata, while rebuilding datamap, one spark job will be started for each segment and all the jobs are executed serailly. If we have many historical segments, the rebuild will takes a lot of time. Here we optimize the procedure for datamap rebuild and start one start for each segments, all the tasks can be done in parallel in one spark job. This closes apache#2443 [CARBONDATA-2706][BloomDataMap] clear bloom index files after segment is deleted clear bloom index files after corresponding segment is deleted and cleaned This closes apache#2461 [CARBONDATA-2715][LuceneDataMap] Fix bug in search mode with lucene datamap in windows While comparing two pathes, the file separator is different in windows, thus causing empty pruned blocklets. This PR will ignore the file separator This closes apache#2470 [CARBONDATA-2703][Tests] Clear up env after tests 1.reset session parameters after test 2.clean up output after test This closes apache#2458 [CARBONDATA-2607][Complex Column Enhancements] Complex Primitive DataType Adaptive Encoding In this PR the improvement was done to save the complex type more effectively so that reading becomes more efficient. The changes are: Primitive types inside complex types are separate pages. Previously it was a single byte array column page for a complex column. Now all sub-levels inside the complex data types are stored as separate pages with their respective datatypes. No Dictionary Primitive DataTypes inside Complex Columns will be processed through Adaptive Encoding. Previously only snappy compression was applied. All Primitive datatypes inside complex if it is now dictionary, only value will be saved except String, Varchar which is saved as ByteArray. Previously all sub-levels are saved as Length And Value Format inside a single Byte Array. Currently only Struct And Array type column pages are saved in ByteArray. All other primitive except String and varchar are saved in respective fixed datatype length. Support for the Safe and Unsafe Fixed length Column Page to support growing dynamic array implementation. This is done to support Array datatype. Co-authored-by: sounakr <sounakr@gmail.com> This closes apache#2417
…y/sort/date/TimeStamp column for dictionary column, carbon convert literal value to dict value, then convert dict value to mdk value, at last it stores the mdk value as internal value in carbonfile. for other columns, carbon convert literal value to internal value using field-converter. Since bloomfilter datamap stores the internal value, during query we should convert the literal value in filter to internal value in order to match the value stored in bloomfilter datamap. Changes are made: 1.FieldConverters were refactored to extract common value convert methods. 2.BloomQueryModel was optimized to support converting literal value to internal value. 3.fix bugs for int/float/date/timestamp as bloom index column 4.fix bugs in dictionary/sort column as bloom index column 5.add tests 6.block (deferred) rebuild for bloom datamap (contains bugs that does not fix in this commit) This closes #2403
In carbondata,
for dictionary column, carbon convert literal value to dict value, then
convert dict value to mdk value, at last it stores the mdk value as
internal value in carbonfile.
for other columns, carbon convert literal value to internal value using
field-converter.
Since bloomfilter datamap stores the internal value, during query we
should convert the literal value in filter to internal value in order to
match the value stored in bloomfilter datamap.
Changes are made:
internal value.
not fix in this commit)
Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:
Any interfaces changed?
No
Any backward compatibility impacted?
Yes. Because the encoding of the index value has changed. Besides, deferred build of bloom datamap has been blocked in this PR, will do it later
Document update required?
No
Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
Added tests
- How it is tested? Please attach test report.
Tested in local machine
- Is it a performance related change? Please attach the performance test report.
Query performance with bloomfilter may decrease, because it contains extra value-convert procedure
- Any additional information to help reviewers in testing this change.
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.