Skip to content

PARQUET-1644: Clean up some benchmark code and docs.#672

Merged
gszadovszky merged 9 commits intoapache:masterfrom
RyanSkraba:PARQUET-1644-benchmark-cleanup
Sep 24, 2019
Merged

PARQUET-1644: Clean up some benchmark code and docs.#672
gszadovszky merged 9 commits intoapache:masterfrom
RyanSkraba:PARQUET-1644-benchmark-cleanup

Conversation

@RyanSkraba
Copy link
Copy Markdown
Contributor

@RyanSkraba RyanSkraba commented Aug 29, 2019

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason: The benchmarking module is used for test purposes

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does


echo "Starting WRITE benchmarks"
java -jar ${SCRIPT_PATH}/target/parquet-benchmarks.jar p*Write* "$@"
java -jar ${SCRIPT_PATH}/target/parquet-benchmarks.jar org.apache.parquet.benchmarks.WriteBenchmarks "$@"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, NestedNullWritingBenchmarks won't be executed in this benchmark suite, but I think that's fine. @gszadovszky do you agree?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. When this script was created only ReadBenchmarks and WriteBenchmarks were exist. However, it might make sense to have a more descriptive name (e.g. runReadWrite.sh).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I combined the two existing run scripts into one with some very limited functionality to run predefined "suites" by keyword, but also to run all benchmarks.

What do you think?

@RyanSkraba
Copy link
Copy Markdown
Contributor Author

Thanks for taking a look! I'm going to make the run.sh script a bit more descriptive to run "feature sets" so that there's a standard way to run all or some benchmarks. I don't want to go crazy with bash, but I think there's some value to have one entry point to running benchmarks in a standard way.

JMH isn't behaving as I would like for setting up and cleaning resources... I have a couple more fixes to make and I'll push.

Ryan Skraba added 3 commits September 5, 2019 18:09
@RyanSkraba
Copy link
Copy Markdown
Contributor Author

For info, I made a change around general benchmarks setup/cleanup.

Before, the read files for ReadBenchmark were generated and handled differently than the read files for PageChecksumReadBenchmarks. I tried to make them consistent across benchmarks using the following logic:

  • When reading, ensure that any necessary file exists during @Setup but don't do any cleanup during the benchmark run.
  • When writing, ensure that the output file doesn't exist during @Setup
  • All cleanup needs to be done outside of the process running the benchmark.

As far as I can tell, this is the preferred way to use JMH when running "macro-benchmarks".

I didn't observe any actual functional differences after changing the setup strategy -- the total user time to run the PageChecksumReadBenchmarks went down, for example, but the measured operation time stayed the same.

Copy link
Copy Markdown
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I have some minor issues, but like it overall.

Copy link
Copy Markdown
Contributor

@nandorKollar nandorKollar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please address Gabor's comments.

@RyanSkraba
Copy link
Copy Markdown
Contributor Author

Hello! I git merged master into the PR instead of rebasing -- If I understand correctly, these'll be squashed at merge time.

Thanks for the review, and my apologies for the late fixes!

@gszadovszky gszadovszky merged commit 7c4d1ec into apache:master Sep 24, 2019
@RyanSkraba RyanSkraba deleted the PARQUET-1644-benchmark-cleanup branch September 24, 2019 12:29
@RyanSkraba RyanSkraba restored the PARQUET-1644-benchmark-cleanup branch September 24, 2019 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants