-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test cases for parquetMetrics with multiple rowgroups #314
test cases for parquetMetrics with multiple rowgroups #314
Conversation
@@ -65,6 +70,180 @@ | |||
private final GenericFixed fixed = new GenericData.Fixed( | |||
org.apache.avro.Schema.createFixed("fixedCol", null, null, 4), | |||
"abcd".getBytes(StandardCharsets.UTF_8)); | |||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: there should be an empty line between test cases and other code.
|
||
Record lastRecord = firstRecord; | ||
|
||
for (int i = 0; i < minimumRowGroupRecordCount * 2; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process for creating the record set seems a bit too complicated to me. Why not base the records on i
instead of adding 1 to the last record's value? This should also be a helper method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue thanks for reviewing !
Now the class TestMetrics
has the test methods now.
So, will add these methods there.
// create parquet file with multiple row groups. by using smaller number of bytes | ||
File parquetFile = writeRecords( | ||
schema, | ||
ImmutableMap.of(PARQUET_ROW_GROUP_SIZE_BYTES, Integer.toString(minimumRowGroupRecordCount * Integer.BYTES)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this size related to the number of records? What about just using a small number, like 1mb and a few thousand records?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue thanks for the reviewing this !
I find it is interesting :)
Yes this value can be independent, but I kept it like this, to derive number of row groups based on the rows we added. ( if we see we added 200+ records, and we got 3 row groups)
Another point I find which is interesting is about the why records has to be > 100, if we want more than 1 row groups.
Looks like 100 is minimum value for minNumberOfRecords in each row group.
So, we use 100+ records to be written, and give rowGroupSize = (100*1 column size).
details about this value : ( I believe you would be aware of, so may be can skip lengthy part :) )
If we see this existing TestParquet
It's using the 100 as the number of records, and if we run this test with < 100 records (say 80) , and even though the here1 we pass say 20
still the it doesn't create more then 1 row groups.
what I understood, reason this startRowGroup
It calculates this value as
this.nextCheckRecordCount = min(max(recordCount / 2, 100), 10000);
so, even if the recordCount < 100
the nextCheckRecordCount
will be 100.
and this variable is used eventually when it writes records, from checkSize() which is called from add().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also in the case of testMetricsForTopLevelWithMultipleRG
, my understanding is.
it internally uses
ParquetWriter.java
which uses InternalParquetRecordWriter.java which also has MINIMUM_RECORD_COUNT_FOR_CHECK = 100
which is also making sure each row group will have minimum 100 records.
Thanks for pinging me, @manishmalhotrawork. I don't have any other comments other than what @rdblue already mentioned. Thanks for working on this! |
closing this one, as |
this PR is for issue #132.
-- tested firstLevel fields and nestedStructure with multipleRowGroups.
@aokolnychyi can you please review.