ORC-1121: Fix column coversion check bug which causes column filters don't work #1055

PengleiShi · 2022-03-04T10:08:14Z

What changes were proposed in this pull request?

Add a map in SchemaEvolution which contains the mapping from the file column id to the reader column id, the mapping will be used in SchemaEvolution.isPPDSafeConversion()

Why are the changes needed?

RecordReaderImpl.pickRowGroups() calls SchemaEvolution.isPPDSafeConversion() with file column id rather than reader column id which is required, this causes column filters can't work effectively and recordReader can't skip row groups which are not matched, so we need find the corresponding reader column id via file column id to ensure SchemaEvolution.isPPDSafeConversion() can work correctly.

How was this patch tested?

UT

…don't work

PengleiShi · 2022-03-04T10:17:31Z

ping @guiyanakuang could you help to review this?

guiyanakuang · 2022-03-04T10:30:11Z

LGTM (Pending CIs)

guiyanakuang · 2022-03-04T10:31:52Z

cc @pgaref @dongjoon-hyun

dongjoon-hyun

Thank you for making a PR, @PengleiShi . I'll review this and test with Apache Spark too during this weekend.

dongjoon-hyun · 2022-03-04T20:01:57Z

cc @stiga-huang , too

pgaref · 2022-03-04T20:35:56Z

java/core/src/java/org/apache/orc/impl/SchemaEvolution.java

@@ -126,6 +128,11 @@ public SchemaEvolution(TypeDescription fileSchema,
      }
    }
    buildConversion(fileSchema, this.readerSchema, positionalLevels);


I believe this File to Reader id mapping should be part of buildConversion

stiga-huang · 2022-03-05T11:42:40Z

java/core/src/java/org/apache/orc/impl/SchemaEvolution.java

@@ -38,6 +38,8 @@
 public class SchemaEvolution {
  // indexed by reader column id
  private final TypeDescription[] readerFileTypes;
+  // key: file column id, value: reader column id
+  private final Map<Integer, Integer> typeIdsMap = new HashMap<>();


I think we can use an array just like readerFileTypes.

BTW, for readability, I'd suggest renaming readerFileTypes to readerIdToFileTypes and renaming typeIdsMap to fileIdToReaderIds.

stiga-huang · 2022-03-05T11:51:49Z

java/core/src/java/org/apache/orc/impl/SchemaEvolution.java

@@ -296,13 +303,13 @@ private boolean typesAreImplicitConversion(final TypeDescription fileType,

  /**
   * Check if column is safe for ppd evaluation
-   * @param colId reader column id
+   * @param colId file column id


For readability, can we also rename colId to fileColId?

stiga-huang · 2022-03-05T11:57:52Z

java/core/src/java/org/apache/orc/impl/SchemaEvolution.java

-      return !(colId < 0 || colId >= ppdSafeConversion.length) &&
-          ppdSafeConversion[colId];
+      Integer readerTypeId = typeIdsMap.get(colId);
+      return readerTypeId != null && ppdSafeConversion[readerTypeId];


Since we only use ppdSafeConversion[] in this method, I think changing ppdSafeConversion[] to be indexed by file column ids is a simpler solution, i.e. modifying the assignment in populatePpdSafeConversion and populatePpdSafeConversionForChildren to use file ids.

Since we only use ppdSafeConversion[] in this method, I think changing ppdSafeConversion[] to be indexed by file column ids is a simpler solution, i.e. modifying the assignment in populatePpdSafeConversion and populatePpdSafeConversionForChildren to use file ids.

This looks better. Done

stiga-huang

+1, LGTM.

dongjoon-hyun

+1, LGTM. Thank you, @PengleiShi , @guiyanakuang , @pgaref , @stiga-huang !
I also tested with Apache Spark 3.3. Merged to master.

…don't work (#1055) ### What changes were proposed in this pull request? Add a map in `SchemaEvolution` which contains the mapping from the file column id to the reader column id, the mapping will be used in `SchemaEvolution.isPPDSafeConversion()` ### Why are the changes needed? `RecordReaderImpl.pickRowGroups()` calls `SchemaEvolution.isPPDSafeConversion()` with file column id rather than reader column id which is required, this causes column filters can't work effectively and recordReader can't skip row groups which are not matched, so we need find the corresponding reader column id via file column id to ensure `SchemaEvolution.isPPDSafeConversion()` can work correctly. ### How was this patch tested? UT (cherry picked from commit e22f537) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…don't work (#1055) ### What changes were proposed in this pull request? Add a map in `SchemaEvolution` which contains the mapping from the file column id to the reader column id, the mapping will be used in `SchemaEvolution.isPPDSafeConversion()` ### Why are the changes needed? `RecordReaderImpl.pickRowGroups()` calls `SchemaEvolution.isPPDSafeConversion()` with file column id rather than reader column id which is required, this causes column filters can't work effectively and recordReader can't skip row groups which are not matched, so we need find the corresponding reader column id via file column id to ensure `SchemaEvolution.isPPDSafeConversion()` can work correctly. ### How was this patch tested? UT (cherry picked from commit e22f537) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1593a9e) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2022-03-08T20:48:09Z

I backported this to branch-1.7 and branch-1.6, too.

dongjoon-hyun · 2022-03-08T21:01:54Z

@PengleiShi . I added you to the Apache ORC contributor group and assigned ORC-1121 to you.
Welcome to the Apache ORC community.

PengleiShi · 2022-03-09T09:09:31Z

@dongjoon-hyun . Thanks！

…don't work (apache#1055) ### What changes were proposed in this pull request? Add a map in `SchemaEvolution` which contains the mapping from the file column id to the reader column id, the mapping will be used in `SchemaEvolution.isPPDSafeConversion()` ### Why are the changes needed? `RecordReaderImpl.pickRowGroups()` calls `SchemaEvolution.isPPDSafeConversion()` with file column id rather than reader column id which is required, this causes column filters can't work effectively and recordReader can't skip row groups which are not matched, so we need find the corresponding reader column id via file column id to ensure `SchemaEvolution.isPPDSafeConversion()` can work correctly. ### How was this patch tested? UT

github-actions bot added the JAVA label Mar 4, 2022

ORC-1121: Fix column coversion check bug which causes column filters …

66aa2f8

…don't work

PengleiShi force-pushed the ORC-1121 branch from b951847 to 66aa2f8 Compare March 4, 2022 10:10

PengleiShi changed the title ~~Fix column coversion check bug which causes column filters don't work~~ ORC-1121: Fix column coversion check bug which causes column filters don't work Mar 4, 2022

dongjoon-hyun reviewed Mar 4, 2022

View reviewed changes

pgaref reviewed Mar 4, 2022

View reviewed changes

stiga-huang reviewed Mar 5, 2022

View reviewed changes

ppdSafeConversion uses file column id as index

f0385e6

stiga-huang approved these changes Mar 8, 2022

View reviewed changes

dongjoon-hyun approved these changes Mar 8, 2022

View reviewed changes

dongjoon-hyun merged commit e22f537 into apache:main Mar 8, 2022

dongjoon-hyun added this to the 1.6.14 milestone Mar 8, 2022

dongjoon-hyun mentioned this pull request Mar 8, 2022

ORC-1121: Predicate pushdown does not work #1053

Closed

williamhyun mentioned this pull request Apr 3, 2022

ORC 1.7.4-SNAPSHOT fails with Iceberg Nan count tests #1075

Closed

dongjoon-hyun mentioned this pull request Apr 4, 2022

ORC-1146: Float category missing check if the statistic sum is a finite value #1077

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC-1121: Fix column coversion check bug which causes column filters don't work #1055

ORC-1121: Fix column coversion check bug which causes column filters don't work #1055

PengleiShi commented Mar 4, 2022

PengleiShi commented Mar 4, 2022

guiyanakuang commented Mar 4, 2022

guiyanakuang commented Mar 4, 2022

dongjoon-hyun left a comment

dongjoon-hyun commented Mar 4, 2022

pgaref Mar 4, 2022

stiga-huang Mar 5, 2022

stiga-huang Mar 5, 2022

PengleiShi Mar 7, 2022

stiga-huang Mar 5, 2022

PengleiShi Mar 7, 2022

stiga-huang left a comment

dongjoon-hyun left a comment

dongjoon-hyun commented Mar 8, 2022

dongjoon-hyun commented Mar 8, 2022

PengleiShi commented Mar 9, 2022

ORC-1121: Fix column coversion check bug which causes column filters don't work #1055

ORC-1121: Fix column coversion check bug which causes column filters don't work #1055

Conversation

PengleiShi commented Mar 4, 2022

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

PengleiShi commented Mar 4, 2022

guiyanakuang commented Mar 4, 2022

guiyanakuang commented Mar 4, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 4, 2022

pgaref Mar 4, 2022

Choose a reason for hiding this comment

stiga-huang Mar 5, 2022

Choose a reason for hiding this comment

stiga-huang Mar 5, 2022

Choose a reason for hiding this comment

PengleiShi Mar 7, 2022

Choose a reason for hiding this comment

stiga-huang Mar 5, 2022

Choose a reason for hiding this comment

PengleiShi Mar 7, 2022

Choose a reason for hiding this comment

stiga-huang left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 8, 2022

dongjoon-hyun commented Mar 8, 2022

PengleiShi commented Mar 9, 2022