-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Improve GenericReader performance #4218
Conversation
CC: @rbalamohan |
@aokolnychyi: Could you please review when you have some time? I see that you created the other jmh tests, so you might be the best person to take a look. |
return o -> { | ||
if (o == null) { | ||
return null; | ||
} | ||
|
||
List<Object> data = ((StructObjectInspector) pair.sourceInspector()).getStructFieldsDataAsList(o); | ||
Record result = GenericRecord.create(type); | ||
Record result = template.copy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this result in NPE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then there were an NPE with the original code as well. The template is only null, when the type is null.
private GenericRecord(StructType struct) {
this.struct = struct;
this.size = struct.fields().size(); <-- NPE here when the struct is null
this.values = new Object[size];
this.nameToPos = NAME_MAP_CACHE.get(struct);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the type
should never be null here because the SchemaWithPartnerVistor
guarantee it. See:
iceberg/core/src/main/java/org/apache/iceberg/schema/SchemaWithPartnerVisitor.java
Lines 52 to 61 in 90225d6
for (Types.NestedField field : struct.fields()) { | |
P fieldPartner = partner != null ? accessors.fieldPartner(partner, field.fieldId(), field.name()) : null; | |
visitor.beforeField(field, fieldPartner); | |
T result; | |
try { | |
result = visit(field.type(), fieldPartner, visitor, accessors); | |
} finally { | |
visitor.afterField(field, fieldPartner); | |
} | |
results.add(visitor.field(field, fieldPartner, result)); |
I think we can just add a Precondition.checkNotNull
in the line128 instead of checking the null case in a if-else
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here we use GenericRecord(GenericRecord toCopy)
constructor rather than GenericRecord(StructType struct)
to eliminate the access to NAME_MAP_CACHE
? I'm afraid the further PR will easily fallback to use the other because this tiny different is hard for the next developer and reviewer to notice. Is possible to add few doc ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were an issue with positional deletes where the row was not present in the file. The code below still created reader for the empty struct.
Types.StructType struct = iType != null ? iType.asStructType() : null;
return visitor.struct(struct, group, visitFields(struct, group, visitor));
Changed it to this, let's see if it breaks something:
return iType != null ?
visitor.struct(iType.asStructType(), group, visitFields(iType.asStructType(), group, visitor)) : null;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to revert the changes.
In the Deserializer we do not have null struct so I added the Preconditions.checkNotNull()
check, but I had to revert to the null checks at the readers because the Spark reader expects the visitors/readers to handle null structs when the readers are created.
I had failures with TestSparkReadProjection.testListOfStructsProjection()
and TestGenericAppenderFactory.testPosDeleteWriterWithRowSchema()
Seems like @aokolnychyi is occupied otherwise. @RussellSpitzer, or @shardulm94, any chance of taking a look? |
return o -> { | ||
if (o == null) { | ||
return null; | ||
} | ||
|
||
List<Object> data = ((StructObjectInspector) pair.sourceInspector()).getStructFieldsDataAsList(o); | ||
Record result = GenericRecord.create(type); | ||
Record result = template.copy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the type
should never be null here because the SchemaWithPartnerVistor
guarantee it. See:
iceberg/core/src/main/java/org/apache/iceberg/schema/SchemaWithPartnerVisitor.java
Lines 52 to 61 in 90225d6
for (Types.NestedField field : struct.fields()) { | |
P fieldPartner = partner != null ? accessors.fieldPartner(partner, field.fieldId(), field.name()) : null; | |
visitor.beforeField(field, fieldPartner); | |
T result; | |
try { | |
result = visit(field.type(), fieldPartner, visitor, accessors); | |
} finally { | |
visitor.afterField(field, fieldPartner); | |
} | |
results.add(visitor.field(field, fieldPartner, result)); |
I think we can just add a Precondition.checkNotNull
in the line128 instead of checking the null case in a if-else
block.
return o -> { | ||
if (o == null) { | ||
return null; | ||
} | ||
|
||
List<Object> data = ((StructObjectInspector) pair.sourceInspector()).getStructFieldsDataAsList(o); | ||
Record result = GenericRecord.create(type); | ||
Record result = template.copy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here we use GenericRecord(GenericRecord toCopy)
constructor rather than GenericRecord(StructType struct)
to eliminate the access to NAME_MAP_CACHE
? I'm afraid the further PR will easily fallback to use the other because this tiny different is hard for the next developer and reviewer to notice. Is possible to add few doc ?
@openinx: Fixed the null check where I was able to, added the comments, and the tests are green. If you have some time, I think the PR is ready for another review. Thanks again for taking the time to check this out! |
iceberg/core/src/main/java/org/apache/iceberg/data/GenericRecord.java Lines 36 to 46 in c45b8b4
Note that this uses identity ( Otherwise if should perform similarly to a ConcurrentHashMap from the caller's perspective. |
|
||
RecordReader(List<Type> types, | ||
List<ParquetValueReader<?>> readers, | ||
StructType struct) { | ||
super(types, readers); | ||
this.structType = struct; | ||
this.template = struct != null ? GenericRecord.create(struct) : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these also have preconditions that validate the struct is not null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original change did not contain the null check, and there were failing tests (org.apache.iceberg.TestGenericAppenderFactory > testPosDeleteWriterWithRowSchema[FileFormat=parquet, Partitioned=true] FAILED) with the exception below.
In nutshell, when we are reading the positional delete files we either have row
in the data or we don't. I have tried to fix this by returning a null
reader instead of creating a reader for a null
struct. But this caused issues with Spark projection where we create the reader for the full schema, and having a null
or missing reader caused issues when we tried to match the readers to the projected schema.
Fixing this is not trivial so I decided to move forward with the null checks to keep the backward compatibility:
- It should possible to create a reader with a
null
schema - Fail only when we try to actually read from this reader
java.lang.NullPointerException
at org.apache.iceberg.data.GenericRecord.<init>(GenericRecord.java:63)
at org.apache.iceberg.data.GenericRecord.create(GenericRecord.java:53)
at org.apache.iceberg.data.parquet.GenericParquetReaders$RecordReader.<init>(GenericParquetReaders.java:62)
at org.apache.iceberg.data.parquet.GenericParquetReaders.createStructReader(GenericParquetReaders.java:52)
at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.struct(BaseParquetReaders.java:176)
at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.struct(BaseParquetReaders.java:114)
at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:141)
at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visitField(TypeWithSchemaVisitor.java:169)
at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visitFields(TypeWithSchemaVisitor.java:183)
at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:47)
at org.apache.iceberg.data.parquet.BaseParquetReaders.createReader(BaseParquetReaders.java:67)
at org.apache.iceberg.data.parquet.BaseParquetReaders.createReader(BaseParquetReaders.java:58)
at org.apache.iceberg.data.parquet.GenericParquetReaders.buildReader(GenericParquetReaders.java:41)
at org.apache.iceberg.data.DeleteFilter.lambda$openDeletes$6(DeleteFilter.java:255)
at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:118)
at org.apache.iceberg.parquet.ParquetReader.init(ParquetReader.java:66)
at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:77)
at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:38)
at org.apache.iceberg.util.Filter.lambda$filter$0(Filter.java:35)
at org.apache.iceberg.io.CloseableIterable$2.iterator(CloseableIterable.java:73)
at org.apache.iceberg.io.CloseableIterable$4$1.<init>(CloseableIterable.java:99)
at org.apache.iceberg.io.CloseableIterable$4.iterator(CloseableIterable.java:98)
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.<init>(CloseableIterable.java:152)
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.<init>(CloseableIterable.java:143)
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable.iterator(CloseableIterable.java:138)
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable.iterator(CloseableIterable.java:129)
at java.lang.Iterable.forEach(Iterable.java:74)
at org.apache.iceberg.deletes.Deletes.toPositionIndex(Deletes.java:97)
at org.apache.iceberg.deletes.Deletes.toPositionIndex(Deletes.java:91)
at org.apache.iceberg.data.DeleteFilter.applyPosDeletes(DeleteFilter.java:231)
at org.apache.iceberg.data.DeleteFilter.filter(DeleteFilter.java:126)
at org.apache.iceberg.data.GenericReader.open(GenericReader.java:77)
at org.apache.iceberg.relocated.com.google.common.collect.Iterators$6.transform(Iterators.java:826)
at org.apache.iceberg.relocated.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:52)
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.<init>(CloseableIterable.java:151)
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.<init>(CloseableIterable.java:143)
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable.iterator(CloseableIterable.java:138)
at org.apache.iceberg.data.GenericReader.open(GenericReader.java:65)
at org.apache.iceberg.data.TableScanIterable.iterator(TableScanIterable.java:41)
at org.apache.iceberg.data.TableScanIterable.iterator(TableScanIterable.java:29)
at java.lang.Iterable.forEach(Iterable.java:74)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost looks good to me now, except the .gitignore issue and @rdblue 's Precondition issue.
We are reusing the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for me now. Thanks @pvary for the contribution, and thanks all for reviewing !
Thanks everyone for the review and the merge! |
(cherry picked from commit e75d732)
@rbalamohan identified that:
This causes running
GenericRecord.create()
for every record expensive.Sometimes we can not reuse the containers in readers, but we still need better performance. If we create a template record in readers and copy this template record then we can avoid the cache retrieval of the
nameToPos
map.Created some basic jmh performance tests, and here are the results:
We can see that the change gained ~10-20% performance for the generic readers.