Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1121: Fix column coversion check bug which causes column filters don't work #1055

Merged
merged 2 commits into from
Mar 8, 2022

Conversation

PengleiShi
Copy link
Contributor

What changes were proposed in this pull request?

Add a map in SchemaEvolution which contains the mapping from the file column id to the reader column id, the mapping will be used in SchemaEvolution.isPPDSafeConversion()

Why are the changes needed?

RecordReaderImpl.pickRowGroups() calls SchemaEvolution.isPPDSafeConversion() with file column id rather than reader column id which is required, this causes column filters can't work effectively and recordReader can't skip row groups which are not matched, so we need find the corresponding reader column id via file column id to ensure SchemaEvolution.isPPDSafeConversion() can work correctly.

How was this patch tested?

UT

@github-actions github-actions bot added the JAVA label Mar 4, 2022
@PengleiShi PengleiShi changed the title Fix column coversion check bug which causes column filters don't work ORC-1121: Fix column coversion check bug which causes column filters don't work Mar 4, 2022
@PengleiShi
Copy link
Contributor Author

ping @guiyanakuang could you help to review this?

@guiyanakuang
Copy link
Member

LGTM (Pending CIs)

@guiyanakuang
Copy link
Member

cc @pgaref @dongjoon-hyun

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @PengleiShi . I'll review this and test with Apache Spark too during this weekend.

@dongjoon-hyun
Copy link
Member

cc @stiga-huang , too

@@ -126,6 +128,11 @@ public SchemaEvolution(TypeDescription fileSchema,
}
}
buildConversion(fileSchema, this.readerSchema, positionalLevels);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this File to Reader id mapping should be part of buildConversion

@@ -38,6 +38,8 @@
public class SchemaEvolution {
// indexed by reader column id
private final TypeDescription[] readerFileTypes;
// key: file column id, value: reader column id
private final Map<Integer, Integer> typeIdsMap = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use an array just like readerFileTypes.

BTW, for readability, I'd suggest renaming readerFileTypes to readerIdToFileTypes and renaming typeIdsMap to fileIdToReaderIds.

@@ -296,13 +303,13 @@ private boolean typesAreImplicitConversion(final TypeDescription fileType,

/**
* Check if column is safe for ppd evaluation
* @param colId reader column id
* @param colId file column id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, can we also rename colId to fileColId?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return !(colId < 0 || colId >= ppdSafeConversion.length) &&
ppdSafeConversion[colId];
Integer readerTypeId = typeIdsMap.get(colId);
return readerTypeId != null && ppdSafeConversion[readerTypeId];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only use ppdSafeConversion[] in this method, I think changing ppdSafeConversion[] to be indexed by file column ids is a simpler solution, i.e. modifying the assignment in populatePpdSafeConversion and populatePpdSafeConversionForChildren to use file ids.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only use ppdSafeConversion[] in this method, I think changing ppdSafeConversion[] to be indexed by file column ids is a simpler solution, i.e. modifying the assignment in populatePpdSafeConversion and populatePpdSafeConversionForChildren to use file ids.

This looks better. Done

Copy link
Contributor

@stiga-huang stiga-huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @PengleiShi , @guiyanakuang , @pgaref , @stiga-huang !
I also tested with Apache Spark 3.3. Merged to master.

@dongjoon-hyun dongjoon-hyun merged commit e22f537 into apache:main Mar 8, 2022
dongjoon-hyun pushed a commit that referenced this pull request Mar 8, 2022
…don't work (#1055)

### What changes were proposed in this pull request?

Add a map in `SchemaEvolution` which contains the mapping from the file column id to the reader column id, the mapping will be used in `SchemaEvolution.isPPDSafeConversion()`

### Why are the changes needed?

`RecordReaderImpl.pickRowGroups()`  calls `SchemaEvolution.isPPDSafeConversion()` with file column id rather than reader column id which is required, this causes column filters can't work effectively and recordReader can't skip row groups which are not matched, so we need find the corresponding reader column id via file column id to ensure `SchemaEvolution.isPPDSafeConversion()` can work correctly.

### How was this patch tested?

UT

(cherry picked from commit e22f537)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Mar 8, 2022
…don't work (#1055)

### What changes were proposed in this pull request?

Add a map in `SchemaEvolution` which contains the mapping from the file column id to the reader column id, the mapping will be used in `SchemaEvolution.isPPDSafeConversion()`

### Why are the changes needed?

`RecordReaderImpl.pickRowGroups()`  calls `SchemaEvolution.isPPDSafeConversion()` with file column id rather than reader column id which is required, this causes column filters can't work effectively and recordReader can't skip row groups which are not matched, so we need find the corresponding reader column id via file column id to ensure `SchemaEvolution.isPPDSafeConversion()` can work correctly.

### How was this patch tested?

UT

(cherry picked from commit e22f537)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 1593a9e)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun dongjoon-hyun added this to the 1.6.14 milestone Mar 8, 2022
@dongjoon-hyun
Copy link
Member

I backported this to branch-1.7 and branch-1.6, too.

@dongjoon-hyun
Copy link
Member

@PengleiShi . I added you to the Apache ORC contributor group and assigned ORC-1121 to you.
Welcome to the Apache ORC community.

@PengleiShi
Copy link
Contributor Author

@dongjoon-hyun . Thanks!

cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
…don't work (apache#1055)

### What changes were proposed in this pull request?

Add a map in `SchemaEvolution` which contains the mapping from the file column id to the reader column id, the mapping will be used in `SchemaEvolution.isPPDSafeConversion()`

### Why are the changes needed?

`RecordReaderImpl.pickRowGroups()`  calls `SchemaEvolution.isPPDSafeConversion()` with file column id rather than reader column id which is required, this causes column filters can't work effectively and recordReader can't skip row groups which are not matched, so we need find the corresponding reader column id via file column id to ensure `SchemaEvolution.isPPDSafeConversion()` can work correctly.

### How was this patch tested?

UT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants