Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: test #2689

Closed

Conversation

xuchuanyin
Copy link
Contributor

  1. add zstd compressor for compressing column data
  2. add zstd support in thrift
  3. since zstd does not support zero-copy while compressing, offheap will not take effect for zstd
  4. Column compressor is configured through system property and can be changed in each load. Before loading, Carbondata will get the compressor and use that compressor during that loading. During querying, carbondata will get the compressor information from metadata in the file data.
  5. Also support compressing streaming table using zstd. The compressor info is stored in FileHeader of the streaming file.
  6. This PR also considered and verified on the legacy store and compaction

A simple test with 1.2GB raw CSV data shows that the size (in MB) of final store with different compressor:

local dictionary snappy zstd Size Reduced
enabled 335 207 38.2%
disabled 375 225 40%

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

  • Any interfaces changed?
    Yes, only internal used interfaces are changed

  • Any backward compatibility impacted?
    Yes, backward compatibility is handled

  • Document update required?
    Yes

  • Testing done
    Please provide details on
    - Whether new unit test cases have been added or why no new tests are required?
    Added tests
    - How it is tested? Please attach test report.
    Tested in local machine
    - Is it a performance related change? Please attach the performance test report.
    The size of final store has been decreased by 40% compared with default snappy
    - Any additional information to help reviewers in testing this change.
    NA

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
    NA

@xuchuanyin
Copy link
Contributor Author

xuchuanyin commented Sep 4, 2018

This PR is a replacement for PR #2628 with no changes, the CI for original PR has problems.

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8282/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/211/

@xuchuanyin xuchuanyin changed the title [CARBONDATA-2851][CARBONDATA-2852] Support zstd as column compressor in final store WIP:[CARBONDATA-2851][CARBONDATA-2852] Support zstd as column compressor in final store Sep 4, 2018
@xuchuanyin
Copy link
Contributor Author

retest this please

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/231/

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8301/

@xuchuanyin xuchuanyin changed the title WIP:[CARBONDATA-2851][CARBONDATA-2852] Support zstd as column compressor in final store WIP: test Sep 5, 2018
@xuchuanyin xuchuanyin force-pushed the 0813_read_compressor_from_datafiles branch from 10ccff8 to 343a57c Compare September 5, 2018 07:54
@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8323/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/253/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/5/

1. add zstd compressor for compressing column data
2. add zstd support in thrift
3. legacy store is not considered in this commit
4. since zstd does not support zero-copy while compressing, offheap will
not take effect for zstd
5. support lazy load for compressor
In query procedure, we need to decompress the column page. Previously we
get the compressor from system property. Now since we support new
compressors, we should read the compressor information from the metadata
in datafiles.
This PR also solve the compatibility related problems on V1/V2 store where we
only support snappy.
we will get the column compressor before data loading/compaction start,
so that it can make all the pages use the same compressor in case of
concurrent modifying compressor during loading.
column compressor is necessary for carbon load model, otherwise load
will fail.
optimize parameters for column page, use columnPageEncodeMeta instead of
its members
@xuchuanyin xuchuanyin force-pushed the 0813_read_compressor_from_datafiles branch from 67cccb1 to 81dd2b5 Compare September 6, 2018 01:35
@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/103/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.3/8341/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/271/

@xuchuanyin xuchuanyin closed this Sep 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants