-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1121: Fix column coversion check bug which causes column filters don't work #1055
Conversation
ping @guiyanakuang could you help to review this? |
LGTM (Pending CIs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR, @PengleiShi . I'll review this and test with Apache Spark too during this weekend.
cc @stiga-huang , too |
@@ -126,6 +128,11 @@ public SchemaEvolution(TypeDescription fileSchema, | |||
} | |||
} | |||
buildConversion(fileSchema, this.readerSchema, positionalLevels); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this File to Reader id mapping should be part of buildConversion
@@ -38,6 +38,8 @@ | |||
public class SchemaEvolution { | |||
// indexed by reader column id | |||
private final TypeDescription[] readerFileTypes; | |||
// key: file column id, value: reader column id | |||
private final Map<Integer, Integer> typeIdsMap = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use an array just like readerFileTypes
.
BTW, for readability, I'd suggest renaming readerFileTypes
to readerIdToFileTypes
and renaming typeIdsMap
to fileIdToReaderIds
.
@@ -296,13 +303,13 @@ private boolean typesAreImplicitConversion(final TypeDescription fileType, | |||
|
|||
/** | |||
* Check if column is safe for ppd evaluation | |||
* @param colId reader column id | |||
* @param colId file column id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, can we also rename colId
to fileColId
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
return !(colId < 0 || colId >= ppdSafeConversion.length) && | ||
ppdSafeConversion[colId]; | ||
Integer readerTypeId = typeIdsMap.get(colId); | ||
return readerTypeId != null && ppdSafeConversion[readerTypeId]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we only use ppdSafeConversion[]
in this method, I think changing ppdSafeConversion[]
to be indexed by file column ids is a simpler solution, i.e. modifying the assignment in populatePpdSafeConversion
and populatePpdSafeConversionForChildren
to use file ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we only use
ppdSafeConversion[]
in this method, I think changingppdSafeConversion[]
to be indexed by file column ids is a simpler solution, i.e. modifying the assignment inpopulatePpdSafeConversion
andpopulatePpdSafeConversionForChildren
to use file ids.
This looks better. Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @PengleiShi , @guiyanakuang , @pgaref , @stiga-huang !
I also tested with Apache Spark 3.3. Merged to master.
…don't work (#1055) ### What changes were proposed in this pull request? Add a map in `SchemaEvolution` which contains the mapping from the file column id to the reader column id, the mapping will be used in `SchemaEvolution.isPPDSafeConversion()` ### Why are the changes needed? `RecordReaderImpl.pickRowGroups()` calls `SchemaEvolution.isPPDSafeConversion()` with file column id rather than reader column id which is required, this causes column filters can't work effectively and recordReader can't skip row groups which are not matched, so we need find the corresponding reader column id via file column id to ensure `SchemaEvolution.isPPDSafeConversion()` can work correctly. ### How was this patch tested? UT (cherry picked from commit e22f537) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…don't work (#1055) ### What changes were proposed in this pull request? Add a map in `SchemaEvolution` which contains the mapping from the file column id to the reader column id, the mapping will be used in `SchemaEvolution.isPPDSafeConversion()` ### Why are the changes needed? `RecordReaderImpl.pickRowGroups()` calls `SchemaEvolution.isPPDSafeConversion()` with file column id rather than reader column id which is required, this causes column filters can't work effectively and recordReader can't skip row groups which are not matched, so we need find the corresponding reader column id via file column id to ensure `SchemaEvolution.isPPDSafeConversion()` can work correctly. ### How was this patch tested? UT (cherry picked from commit e22f537) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1593a9e) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
I backported this to branch-1.7 and branch-1.6, too. |
@PengleiShi . I added you to the Apache ORC contributor group and assigned ORC-1121 to you. |
@dongjoon-hyun . Thanks! |
…don't work (apache#1055) ### What changes were proposed in this pull request? Add a map in `SchemaEvolution` which contains the mapping from the file column id to the reader column id, the mapping will be used in `SchemaEvolution.isPPDSafeConversion()` ### Why are the changes needed? `RecordReaderImpl.pickRowGroups()` calls `SchemaEvolution.isPPDSafeConversion()` with file column id rather than reader column id which is required, this causes column filters can't work effectively and recordReader can't skip row groups which are not matched, so we need find the corresponding reader column id via file column id to ensure `SchemaEvolution.isPPDSafeConversion()` can work correctly. ### How was this patch tested? UT
What changes were proposed in this pull request?
Add a map in
SchemaEvolution
which contains the mapping from the file column id to the reader column id, the mapping will be used inSchemaEvolution.isPPDSafeConversion()
Why are the changes needed?
RecordReaderImpl.pickRowGroups()
callsSchemaEvolution.isPPDSafeConversion()
with file column id rather than reader column id which is required, this causes column filters can't work effectively and recordReader can't skip row groups which are not matched, so we need find the corresponding reader column id via file column id to ensureSchemaEvolution.isPPDSafeConversion()
can work correctly.How was this patch tested?
UT