Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test cases for parquetMetrics with multiple rowgroups #314

Conversation

manishmalhotrawork
Copy link
Contributor

this PR is for issue #132.
-- tested firstLevel fields and nestedStructure with multipleRowGroups.

@aokolnychyi can you please review.

@@ -65,6 +70,180 @@
private final GenericFixed fixed = new GenericData.Fixed(
org.apache.avro.Schema.createFixed("fixedCol", null, null, 4),
"abcd".getBytes(StandardCharsets.UTF_8));
@Test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: there should be an empty line between test cases and other code.


Record lastRecord = firstRecord;

for (int i = 0; i < minimumRowGroupRecordCount * 2; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process for creating the record set seems a bit too complicated to me. Why not base the records on i instead of adding 1 to the last record's value? This should also be a helper method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue thanks for reviewing !
Now the class TestMetrics has the test methods now.
So, will add these methods there.

// create parquet file with multiple row groups. by using smaller number of bytes
File parquetFile = writeRecords(
schema,
ImmutableMap.of(PARQUET_ROW_GROUP_SIZE_BYTES, Integer.toString(minimumRowGroupRecordCount * Integer.BYTES)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this size related to the number of records? What about just using a small number, like 1mb and a few thousand records?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue thanks for the reviewing this !

I find it is interesting :)
Yes this value can be independent, but I kept it like this, to derive number of row groups based on the rows we added. ( if we see we added 200+ records, and we got 3 row groups)

Another point I find which is interesting is about the why records has to be > 100, if we want more than 1 row groups.
Looks like 100 is minimum value for minNumberOfRecords in each row group.
So, we use 100+ records to be written, and give rowGroupSize = (100*1 column size).

details about this value : ( I believe you would be aware of, so may be can skip lengthy part :) )
If we see this existing TestParquet

It's using the 100 as the number of records, and if we run this test with < 100 records (say 80) , and even though the here1 we pass say 20 still the it doesn't create more then 1 row groups.

what I understood, reason this startRowGroup

It calculates this value as
this.nextCheckRecordCount = min(max(recordCount / 2, 100), 10000);
so, even if the recordCount < 100 the nextCheckRecordCount will be 100.
and this variable is used eventually when it writes records, from checkSize() which is called from add().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also in the case of testMetricsForTopLevelWithMultipleRG, my understanding is.
it internally uses
ParquetWriter.java
which uses InternalParquetRecordWriter.java which also has MINIMUM_RECORD_COUNT_FOR_CHECK = 100 which is also making sure each row group will have minimum 100 records.

@aokolnychyi
Copy link
Contributor

Thanks for pinging me, @manishmalhotrawork. I don't have any other comments other than what @rdblue already mentioned. Thanks for working on this!

@manishmalhotrawork
Copy link
Contributor Author

closing this one, as TestParquetMetrics is refactored.
opened PR-365

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants