Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI #1011

Merged
merged 25 commits into from Mar 4, 2023

Conversation

jiangjiguang
Copy link
Contributor

The PR includes 3 aspects:

  1. Use java17 vector api to decode bit-packing , the performance gain is 4x ~ 8x according to the microbenchmark
  2. Upgrade the project to java17 to support java vector api
  3. Add ParquetReadRouter to compatible with different platform when computing engines(such as spark) read parquet.

Jira

Tests

  • Add unit tests org.apache.parquet.column.values.bitpacking.TestByteBitPackingVectorLE

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • The PR adds maven profile vector to enable generate java17 vector bit-packing decode opt. code, and run junit tests: mvn clean install -P vector
  • The PR needs Intel Ice Lake CPU to run junit tests
  • The contributors are jiangjiguang jiyu1021 guangzegu Fang-Xie , and co-contributor is jatin-bhateja

@jiangjiguang
Copy link
Contributor Author

@wangyum @gszadovszky This PR always fails to build, I do not know why. Is the reason of failure "1 workflow awaiting approval" ? please help me, this is my first PR to parquet-mr community, thanks!

@jiangjiguang
Copy link
Contributor Author

@wangyum Could you review the PR ? thanks

@gszadovszky
Copy link
Contributor

@jiangjiguang, Apache introduced a process to require approvals for test executions to keep resources under control. I've approved it so the tests are running. Not sure though when I will have time to properly review it.

@jiangjiguang
Copy link
Contributor Author

@gszadovszky I resubmitted the code, can you approve the workflow again ?

@jiangjiguang
Copy link
Contributor Author

@gszadovszky Could you review the PR ? thanks

@gszadovszky
Copy link
Contributor

@jiangjiguang, however I agree on upgrading to newer java versions (and to have performance benefits like this one) it is not always an easy thing to do. For example our main historical user Hadoop is still on java 1.8 which means they would not be able to upgrade parquet after releasing this one. I think it worth a discussion on the dev list and maybe even a formal vote on it.

@jiangjiguang
Copy link
Contributor Author

@gszadovszky I agree on it. As far as I know some products(such as hadoop presto flink ) are working towards upgrading to java17. Spark already supports java17, trino requires a minimum java version is java17.0.3, So I think parquet should support java17 as soon as possible and be compatible with java8, because of java17 vector can bring 4x ~ 8x performance gain for parquet encode/decode. The PR uses "maven profile -P vector" and "code gen" to support java17 and is compatible with java8

@gszadovszky
Copy link
Contributor

@jiangjiguang,

What do you mean by this change is compatible with java8? The vectorized encoding/decoding requires java17 runtime, right? If yes, we will need to have a way release artifacts for java17 and keep java8 compatibility with others. We need to have an agreement (at community level) how we want to achieve this.

By running through the change related to the compile it seems that if you have a jdk17 in the compile environment it will be compiled for a java17 runtime automatically. Since we do not have a specific environment for making a release (shame on us) it highly depends on the environment of the dev how creates the release. So, we shall either add a release configuration for 8 to the jdk7 profile (if it makes sense) or do not have the jdk17 profile activated automatically. Otherwise we might release artifacts that are not compatible < java17.

@jiangjiguang
Copy link
Contributor Author

@gszadovszky That what I said supports java17 and is compatible with java8 depends on compile environment. Can you start a discussion on how to support java17 ?

@gszadovszky
Copy link
Contributor

@jiangjiguang,

With you current implementation if parquet-mr is compiled with jdk17 than it will not be compatible with java8 while any jdk under 17 will generate a code compatible with java8 since release is set to 8. My problem is this exact behaviour. I would suggest keep generating java8 compatible artifacts with using any versions of jdks (including 17) by default and let it compile for java17 (and the vectorized stuff) when specific profile is activated manually.

About starting the discussion about java17 compatibility. I am currently not really active in the community and not even following the dev list. Maybe, @shangxinli or @ggershinsky would like to drive this topic?

@shangxinli
Copy link
Contributor

shangxinli commented Dec 3, 2022

Thank @gszadovszky a lot for helping with this PR!

+1 for what @gszadovszky said. The mainstream runtime JDK is still 1.8. Parquet is one of the underlying building blocks for many big data applications. The bare minimum, for now, is to keep java8 compatible. Otherwise forcing applications to upgrade to jdk17 because of Parquet is disruptive and impacts adoptions.

@jiangjiguang, I am very happy to see you have this PR to help the Parquet community. Would you mind starting an email discussion to dev@parquet.apache.org for this topic?

cc @ggershinsky @wgtmac

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This work looks promising! It would be great if you can add some micro-benchmark to parquet-benchmarks.


/**
* This class generates vector bit packers that pack the most significant bit first.
* The result of the generation is checked in. To regenerate the code run this class and check in the result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is used to generate ByteBitPackingVectorLE class, which unpackValues/packValues with java17 vector api.
If use java17 vector api directly, errors will occur when compiling or developing with java8(<java17).

Co-authored-by: Fang-Xie <fang.xie@intel.com>
Co-authored-by: guangzegu <guangze.gu@intel.com>
Co-authored-by: jiyu1021 <yu.ji@intel.com>
Co-authored-by: jatin-bhateja <jatin.bhateja@intel.com>
@jiangjiguang
Copy link
Contributor Author

This work looks promising! It would be great if you can add some micro-benchmark to parquet-benchmarks.

@wgtmac I have add the micro-benchmark to parquet-benchmarks, this is the result:

image

@jiangjiguang
Copy link
Contributor Author

@wgtmac I have resubmitted the PR, can you review it again ?

@jiangjiguang
Copy link
Contributor Author

jiangjiguang commented Jan 23, 2023

Thank @gszadovszky a lot for helping with this PR!

+1 for what @gszadovszky said. The mainstream runtime JDK is still 1.8. Parquet is one of the underlying building blocks for many big data applications. The bare minimum, for now, is to keep java8 compatible. Otherwise forcing applications to upgrade to jdk17 because of Parquet is disruptive and impacts adoptions.

@jiangjiguang, I am very happy to see you have this PR to help the Parquet community. Would you mind starting an email discussion to dev@parquet.apache.org for this topic?

cc @ggershinsky @wgtmac

@gszadovszky @shangxinli @wgtmac I have started the discussion about how to upgrade java17 over a month, but nobody involved! So I have updated the PR, it does not involve how to upgrade java17.
The default compilation is java8
Just add maven build parameters -P java17-target -P vector and get the expected jars when people want to use java17 vector to speed up parquet decode

@gszadovszky
Copy link
Contributor

Sorry for the bad experience, @jiangjiguang. Unfortunately, I don't have too much time for the parquet community.
@wgtmac, @jatin-bhateja, since you were the ones who actively reviewed this PR I would like to ask for your approvals. I won't have time to properly review this but I am happy to push it if you agree.

@gszadovszky
Copy link
Contributor

@jiangjiguang, please check test failures.

@jiangjiguang
Copy link
Contributor Author

jiangjiguang commented Jan 28, 2023

@jiangjiguang, please check test failures.

@gszadovszky Thank you for your attention about the PR, I have fixed it and the test successes.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I have left some comments and the implementation is overall looking good. Thanks @jiangjiguang for your effort!

My main concern is the extensibility to support other instruction sets. In addition, it seems to me that the java vector api is still incubating. As I am not a java expert, do we have the risk of unstable API?

…bitpacking/ParquetReadRouter.java

Co-authored-by: Gang Wu <ustcwg@gmail.com>
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A minor comment mainly on naming.

pom.xml Outdated
@@ -659,5 +662,13 @@
</plugins>
</build>
</profile>

<profile>
<id>plugins</id>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plugins is a little bit generic to me. Rename it to encoding-plugin or vector-plugins? Any suggestion? @gszadovszky

In addition, please rename plugins folder to parquet-plugins to follow naming of other sub-directories.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac @gszadovszky vector-plugins +1, and it can show the feature of this PR.
I have renamed plugins to parquet-plugins.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. I've added a note for the README, though.

One more thing. We need to add testing of this profile automatically with github actions. (The current test results for this PR have no meaning since these tests are not executed.)

README.md Show resolved Hide resolved
@jiangjiguang
Copy link
Contributor Author

@gszadovszky @wgtmac This feature need avx512vbmi and avx512_vbmi2 instruction set, so it needs github action runners with intel ice lake. I do not know how to select runners with Intel Ice Lake ? So I have submitted the help (actions/runner#2467).

@jiangjiguang
Copy link
Contributor Author

@gszadovszky @wgtmac I have added a new workflow named Vector-plugins, can you run it ?

@wgtmac
Copy link
Member

wgtmac commented Mar 1, 2023

@gszadovszky @wgtmac I have added a new workflow named Vector-plugins, can you run it ?

It seems that this PR is closed. Could you please reopen it and see if it can run automatically?

@jiangjiguang jiangjiguang reopened this Mar 1, 2023
@jiangjiguang
Copy link
Contributor Author

@gszadovszky @wgtmac I have added a new workflow named Vector-plugins, can you run it ?

It seems that this PR is closed. Could you please reopen it and see if it can run automatically?

@wgtmac sorry, may be my wrong click, I have reopened it

@wgtmac
Copy link
Member

wgtmac commented Mar 1, 2023

@wgtmac sorry, may be my wrong click, I have reopened it

NP. Seems it is running. https://github.com/apache/parquet-mr/actions/runs/4300429406/jobs/7496638049

.github/workflows/vector-plugins.yml Outdated Show resolved Hide resolved
.github/workflows/vector-plugins.yml Outdated Show resolved Hide resolved
@jiangjiguang jiangjiguang changed the title PARQUET-2159: java17 vector parquet bit-packing decode optimization PARQUET-2159: Vectorized BytePacker decoder using Java VectorAPI Mar 2, 2023
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @jiangjiguang !

@wgtmac
Copy link
Member

wgtmac commented Mar 2, 2023

I'd request sign off from @gszadovszky @shangxinli

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments about testing. @jiangjiguang, do you have any info about selecting the proper CPU environment for these tests?

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me.
Thank you for working on it, @jiangjiguang!

@jatin-bhateja
Copy link
Contributor

jatin-bhateja commented Mar 4, 2023

Hi @gszadovszky , @wgtmac Thanks for your reviews and guidance during this long review process. May we request to you please merge this PR by adding following people as co-authors ( @jiangjiguang @jiyu1021 @guangzegu @Fang-Xie @jatin-bhateja) alternatively make @jiangjiguang a committer so that he can merge on our behalf.

@jiangjiguang
Copy link
Contributor Author

Hi @gszadovszky , @wgtmac Thanks for your reviews and guidance during this long review process. May we request to you please merge this PR by adding following people as co-authors ( @jiangjiguang @jiyu1021 @guangzegu @Fang-Xie @jatin-bhateja) alternatively make @jiangjiguang a committer so that he can merge on our behalf.

@jatin-bhateja I have added them as co-authors on the first commit.
image

@wgtmac wgtmac merged commit d730fa7 into apache:master Mar 4, 2023
@wgtmac
Copy link
Member

wgtmac commented Mar 4, 2023

I have merged it. Thanks @jiangjiguang @jatin-bhateja for the contribution and @gszadovszky for the review!

@LuciferYang
Copy link
Contributor

LuciferYang commented Apr 4, 2023

@jiangjiguang Sorry to bother you, I am not sure if the use of the AVX 512 will still make other CPU cores to downshift frequency. If so, is there a way to manually turn off this feature now? Otherwise, the new version may be difficult to promote and use in the enterprise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants