New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-869 Configurable min/max record counts for block size check #447

Open
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
2 participants
@pradeepg26

pradeepg26 commented Jan 11, 2018

Make min/max record counts for block size check are no longer hard coded inside InternalParquetRecordWriter.

pradeepg26 added some commits Jan 10, 2018

@@ -102,6 +103,8 @@ private void initStore() {
MessageColumnIO columnIO = new ColumnIOFactory(validating).getColumnIO(schema);
this.recordConsumer = columnIO.getRecordWriter(columnStore);
writeSupport.prepareForWrite(recordConsumer);
System.out.println(String.format("Created ParquetWriter with [%d, %d] for block size checks. Estimation(%s). BlockSize(%d)",

This comment has been minimized.

@rdblue

rdblue Jan 31, 2018

Contributor

Please use SLF4J for logging. You can see examples in the rest of the library.

@rdblue

rdblue Jan 31, 2018

Contributor

Please use SLF4J for logging. You can see examples in the rest of the library.

This comment has been minimized.

@pradeepg26

pradeepg26 Jan 31, 2018

Oops... missed removing this statement.

@pradeepg26

pradeepg26 Jan 31, 2018

Oops... missed removing this statement.

@@ -142,13 +145,23 @@ private void checkBlockSizeReached() throws IOException {
LOG.info("mem size {} > {}: flushing {} records to disk.", memSize, nextRowGroupSize, recordCount);
flushRowGroupToStore();
initStore();
recordCountForNextMemCheck = min(max(MINIMUM_RECORD_COUNT_FOR_CHECK, recordCount / 2), MAXIMUM_RECORD_COUNT_FOR_CHECK);
if (estimateNextBlockSizeCheck) {

This comment has been minimized.

@rdblue

rdblue Jan 31, 2018

Contributor

Do you need this setting, or is it included to match the page size checking? I don't think we would ever want to set it to true because it is important to be under the max row group size.

@rdblue

rdblue Jan 31, 2018

Contributor

Do you need this setting, or is it included to match the page size checking? I don't think we would ever want to set it to true because it is important to be under the max row group size.

This comment has been minimized.

@pradeepg26
@pradeepg26

pradeepg26 Jan 31, 2018

Will do.

This comment has been minimized.

@pradeepg26

pradeepg26 Feb 2, 2018

Working on the comments @rdblue... do you think instead of a min/max record count for row group size checks, we should check every n rows?

@pradeepg26

pradeepg26 Feb 2, 2018

Working on the comments @rdblue... do you think instead of a min/max record count for row group size checks, we should check every n rows?

This comment has been minimized.

@pradeepg26

pradeepg26 Feb 2, 2018

I've taken a closer look at this code. The only change with the estimateNextBlockSizeCheck is that it uses a static min number of records before performing the next size check if disabled.

If estimateNextBlockSizeCheck is enabled, then existing behavior is preserved where we try to estimate when to perform the next size check by performing the check at the halfway point of the predicted number of records (unless it's larger than the max).

So, I added the estimateNextBlockSizeCheck to keep it consistent with page size checking.

@pradeepg26

pradeepg26 Feb 2, 2018

I've taken a closer look at this code. The only change with the estimateNextBlockSizeCheck is that it uses a static min number of records before performing the next size check if disabled.

If estimateNextBlockSizeCheck is enabled, then existing behavior is preserved where we try to estimate when to perform the next size check by performing the check at the halfway point of the predicted number of records (unless it's larger than the max).

So, I added the estimateNextBlockSizeCheck to keep it consistent with page size checking.

This comment has been minimized.

@rdblue

rdblue Feb 2, 2018

Contributor

Consistency here is fine. I'm not sure we will use it, but I can certainly think of valid use cases for it.

@rdblue

rdblue Feb 2, 2018

Contributor

Consistency here is fine. I'm not sure we will use it, but I can certainly think of valid use cases for it.

public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.min";
public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size.row.check.max";
public static final String ESTIMATE_PAGE_SIZE_CHECK = "parquet.page.size.check.estimate";
public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK = "parquet.page.size_row_check_min";

This comment has been minimized.

@rdblue

rdblue Jan 31, 2018

Contributor

The page properties should not change, and the row group properties should match the naming convention for pages.

@rdblue

rdblue Jan 31, 2018

Contributor

The page properties should not change, and the row group properties should match the naming convention for pages.

This comment has been minimized.

@pradeepg26

pradeepg26 Jan 31, 2018

Oops... it was changed for internal reasons... forgot to change it back.

@pradeepg26

pradeepg26 Jan 31, 2018

Oops... it was changed for internal reasons... forgot to change it back.

public static final boolean DEFAULT_ESTIMATE_ROW_COUNT_FOR_BLOCK_SIZE_CHECK = true;
public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_PAGE_SIZE_CHECK = 100;
public static final int DEFAULT_MAXIMUM_RECORD_COUNT_FOR_PAGE_SIZE_CHECK = 10000;
public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_BLOCK_SIZE_CHECK = 100;

This comment has been minimized.

@rdblue

rdblue Jan 31, 2018

Contributor

Instead of block, please use "row group". Block is an overused term that we are avoiding in the public API, except for those places where it already appears.

@rdblue

rdblue Jan 31, 2018

Contributor

Instead of block, please use "row group". Block is an overused term that we are avoiding in the public API, except for those places where it already appears.

This comment has been minimized.

@pradeepg26
@pradeepg26

pradeepg26 Jan 31, 2018

Will do.

private final ParquetFileWriter parquetFileWriter;
private final WriteSupport<T> writeSupport;
private final MessageType schema;
private final Map<String, String> extraMetaData;
private final long rowGroupSize;
private final boolean estimateNextBlockSizeCheck;

This comment has been minimized.

@rdblue

rdblue Jan 31, 2018

Contributor

Was rowGroupSize not used? Why was it removed?

@rdblue

rdblue Jan 31, 2018

Contributor

Was rowGroupSize not used? Why was it removed?

This comment has been minimized.

@pradeepg26

pradeepg26 Jan 31, 2018

I wrote this code over a year ago... I don't remember, but I'll check. I think it was an unused variable.

@pradeepg26

pradeepg26 Jan 31, 2018

I wrote this code over a year ago... I don't remember, but I'll check. I think it was an unused variable.

@rdblue

This comment has been minimized.

Show comment
Hide comment
@rdblue

rdblue Jan 31, 2018

Contributor

Thanks for doing this, @pradeepg26! We've occasionally had use cases where it would have been nice. My two main concerns are the unnecessary boolean controlling whether or not to estimate when to do the next check and the naming. Naming should use "row group" instead of "block".

Contributor

rdblue commented Jan 31, 2018

Thanks for doing this, @pradeepg26! We've occasionally had use cases where it would have been nice. My two main concerns are the unnecessary boolean controlling whether or not to estimate when to do the next check and the naming. Naming should use "row group" instead of "block".

@rdblue rdblue referenced a pull request that will close this pull request May 1, 2018

Open

PARQUET-869: Configurable record counts for block size checks #470

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment