Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2242:record count for row group size check configurable #1024

Closed
wants to merge 1 commit into from

Conversation

xjlem
Copy link

@xjlem xjlem commented Feb 9, 2023

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @xjlem!

Could you please change the title to start with PARQUET-2242: ? Then this PR will be linked to the JIRA automatically.

Additionally, it would be better to add some test cases to deal with different size checking configs. For example, what will happen if we have set conflicted config for page size check and row group size check?

@@ -95,6 +98,8 @@ private ParquetProperties(WriterVersion writerVersion, int pageSize, int dictPag
this.enableDictionary = enableDict;
this.minRowCountForPageSizeCheck = minRowCountForPageSizeCheck;
this.maxRowCountForPageSizeCheck = maxRowCountForPageSizeCheck;
this.minRowCountForBlockSizeCheck =minRowCountForBlockSizeCheck;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.minRowCountForBlockSizeCheck =minRowCountForBlockSizeCheck;
this.minRowCountForBlockSizeCheck = minRowCountForBlockSizeCheck;

@@ -142,6 +142,8 @@ public static enum JobSummaryLevel {
public static final String MAX_PADDING_BYTES = "parquet.writer.max-padding";
public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min";
public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max";
public static final String MIN_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = "parquet.block.size.row.check.min";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add them to README.md in the parquet-hadoop directory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apache:parquet-1.10.x doesn't has README.md in the parquet-hadoop directory and now I add README.md in this branch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why fixing this on an old branch but not on the master?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we use the 1.10 version,and I found that this issue has been fixed in PARQUET-1920.Thanks for your review and this pr looks like can be closed.

@@ -147,12 +152,12 @@ private void checkBlockSizeReached() throws IOException {
LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, nextRowGroupSize, recordCount);
flushRowGroupToStore();
initStore();
recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
recordCountForNextMemCheck = min(max(minRowCountForBlockSizeCheck, recordCount / 2), maxRowCountForBlockSizeCheck);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment in line 146 says checking the memory size is relatively expensive. Does this affect the default behavior and introduce regression in terms of writer time or file size for common cases? For example, does it check more frequently than before?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,it's like the config 'parquet.page.size.row.check.min'、'parquet.page.size.row.check.max'.
The rowgroup check algorithm is like the page check algorithm with config set ‘parquet.page.size.check.estimate’ true .

@xjlem xjlem changed the title Parquet 2242:record count for row group size check configurable PARQUET-2242:record count for row group size check configurable Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants