Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve BenchmarkParquetReader #6275

Closed
wants to merge 4 commits into from

Conversation

yiweiHeOSS
Copy link
Contributor

This PR is to fix #6247.

So far the Benchmark test has 4 types, this PR added more types to the BenchmarkParquetReader test:

  1. ShortDecimalType
  2. LongDecimalType
  3. VARCHAR

We should test them with different filter rates and null rates just like the previous tests in BenchmarkParquetReader.

Also, I noticed we have never done the filter test for the type HUGEINT (int128_t) before, which is the actual type of LongDecimalType. So this PR also implemented the code to generate the filter of the type HUGEINT (int128_t) and modify the code to generate the data of the type HUGEINT (int128_t) correctly.

While developing the code, we found there is another problem/enhancement we need to do, so I created an issue for it #6248 This could be the next step.

@netlify
Copy link

netlify bot commented Aug 25, 2023

Deploy Preview for meta-velox ready!

Name Link
🔨 Latest commit 0b772e0
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/652f37cb95e68d00082f03c0
😎 Deploy Preview https://deploy-preview-6275--meta-velox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 25, 2023
@yingsu00 yingsu00 assigned yiweiHeOSS and unassigned yiweiHeOSS Aug 25, 2023
@yiweiHeOSS
Copy link
Contributor Author

My change to the below code cuase failures in another testcase:

return HugeInt::build(Random::rand32(gen), Random::rand32(gen));
to        return HugeInt::build(Random::rand64(gen), Random::rand64(gen));

[ FAILED ] E2EFilterTest.longDecimalDictionary
[ FAILED ] E2EFilterTest.longDecimalDirect

So I modify the code to add a temp to see the velue in the debugger:

class HugeInt {
 public:
  static constexpr FOLLY_ALWAYS_INLINE int128_t
  build(uint64_t hi, uint64_t lo) {
    // GCC does not allow left shift negative value.
    int128_t temp = (static_cast<__uint128_t>(hi) << 64) | lo;
    return temp;
  }

lo: 10936052917074306677
hi: 15573954114018521923
temp: -52993621163942803648006051651297827211

The error message later in the test:

[ RUN      ] E2EFilterTest.longDecimalDictionary
/root/velox/velox/dwio/common/tests/E2EFilterTestBase.cpp:122: Failure
Value of: resultBatch->equalValueAt(batches[batchIndex].get(), i, rowIndex)
  Actual: false
Expected: true
Content mismatch at resultBatch 0 at index 0: expected: {-5299362116394280364800605165.1297827211, {-10962914699381264738035368224.7093712783}} actual: {-502284656663603218480.5461940619, {-524318847941508198011.4504906639}}
Google Test trace:
/root/velox/velox/dwio/common/tests/E2EFilterTestBase.cpp:322: No row group skip
/root/velox/velox/dwio/common/tests/E2EFilterTestBase.cpp:235: Failure
Value of: result->equalValueAt(expectedColumn, i, expectedRow)
  Actual: false
Expected: true
Content mismatch at 0 column 0: expected: -5299362116394280364800605165.1297827211 actual: -502284656663603218480.5461940619
Google Test trace:
/root/velox/velox/dwio/common/tests/E2EFilterTestBase.cpp:322: No row group skip
/root/velox/velox/dwio/common/tests/E2EFilterTestBase.cpp:235: Failure
Value of: result->equalValueAt(expectedColumn, i, expectedRow)
  Actual: false
Expected: true

So the expected value looks good, but the actual value is corrupted. Maybe there is a problem in the code of E2EFilterTest related to write/read.

@yiweiHeOSS
Copy link
Contributor Author

expected:-52993621163942803648006051651297827211:
-100111110111100011001100111111011001011011000101001100101111000110100000111011010101011101000001001000111010010101100110001011
actual:-5022846566636032184805461940619:
-111111011001011011000101001100101111000110100000111011010101011101000001001000111010010101100110001011
The head bits are cut in actual 100111110111100011001100

@yiweiHeOSS
Copy link
Contributor Author

Created an issue for the issue above #6317

Copy link
Collaborator

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be any commits that just fix the format issues from previous commits. Please merge it with the previous one.

Add a separate commit to rewrite ColumnStats<StringView>::makeRangeFilter and add ColumnStats<StringView>::makeRandomFilter.

Separate the decimal benchmark and varchar benchmark into two commits.

Can you please add the benchmark result as a comment block at the end of the benchmark? Be sure to use release build, and include your hardware spec.

velox/dwio/common/tests/utils/BatchMaker.cpp Outdated Show resolved Hide resolved
velox/dwio/common/tests/utils/FilterGenerator.cpp Outdated Show resolved Hide resolved
velox/dwio/common/tests/utils/FilterGenerator.h Outdated Show resolved Hide resolved
velox/dwio/common/tests/utils/FilterGenerator.h Outdated Show resolved Hide resolved
@yingsu00 yingsu00 changed the title Add tests for type (Long/short)Decimal and VARCHAR into BenchmarkParquetReader. Add the filter test code for HUGEINT (int128_t). Improve BenchmarkParquetReader Oct 13, 2023
Copy link
Collaborator

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge the last commit "Change the callers of original makeRangeFilter()." into the first one.
Many of the commit message lines are too long. Please take a look at https://gist.github.com/robertpainsi/b632364184e70900af4ab688decf6f53 and update them accordingly.

@yiweiHeOSS
Copy link
Contributor Author

Simplified/shortened the commit messages and made the change according to the comments.

Copy link
Collaborator

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge "Merge branch 'main' into addTypes" into the proper commit of the 4 previous commits? It's not good to have a commit that just merge branch.

@yingsu00 yingsu00 requested a review from Yuhta October 19, 2023 23:09
@czentgr
Copy link
Collaborator

czentgr commented Nov 2, 2023

@yiweiHeOSS please rebase to the lastest. Looks like this is over 2 weeks old already. Any reason the 4 commits were not squashed? Best to squash and provide a comprehensive description for the final commit (as found in this PR description). Please let me know if you need help with that.
@Yuhta Would you please be able to review this PR? Thank you!

@yiweiHeOSS
Copy link
Contributor Author

@czentgr I intentionally made the 4 separate commits since they are required by @yingsu00.

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@yiweiHeOSS
Copy link
Contributor Author

Hi @Yuhta , just a reminder, the PR #6342 also needs to merge. It is a dependency of this PR, not sure if it is merged already.

@facebook-github-bot
Copy link
Contributor

@Yuhta merged this pull request in 060232f.

Copy link

Conbench analyzed the 1 benchmark run on commit 060232f8.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tests for type (Long/short)Decimal and VARCHAR into BenchmarkParquetReader
4 participants