Skip to content

Conversation

@mderoy
Copy link
Contributor

@mderoy mderoy commented Jul 28, 2023

Problem:

Our application is using the following to read from the table

        IcebergGenerics.ScanBuilder scanBuilder = IcebergGenerics.read(iceberg_table);
        CloseableIterable<Record> records = scanBuilder.useSnapshot(snapshotId).build();
        List<List<String>> output = new ArrayList<List<String>>();
        for (Record record : records) {

however when I apply a case insensitive filter like

        	Expression filterExpr = ExpressionParser.fromJson(m_scanFilter);
        	scanBuilder = scanBuilder.where(filterExpr).caseInsensitive();

I get the below error despite setting the scan to caseInsensitive

2023-07-28 11:17:45.000208 EDT INFO: Scan of 'test.usethestats' using provided filter: '{"type":"eq","term":"V1","value":1}'
Records in test.usethestats :
Error processing the request: Cannot find field 'V1' in struct: struct<1: v1: optional int>

Solution:
The GenericReader is initializing a ParquetReader but not propagating the caseSensitivity argument down to that reader...the same is true for the Orc reader (and avro but I do not see that the avro reader has such an argument). I've tested the fix with Parquet file using our application...I don't have a good way to test with orc.

Testing:

  • added caseinsensitive scenarios to existing projection and filter tests
With Filter:
2023-07-28 11:19:16.000684 EDT INFO: Scan of 'test.usethestats' using provided filter: '{"type":"eq","term":"V1","value":1}'
Records in test.usethestats :
1

Output Without Filter (just to verify)
Records in test.usethestats :
1
20
200

let me know what additional things are needed to land this. I'm assuming that the CI will run all the standard tests...maybe I should add a test for this somewhere?

@github-actions github-actions bot added the data label Jul 28, 2023
@mderoy mderoy changed the title Fix issue where the genericreader does not propagate case sensitivity Java: Fix issue where the genericreader does not propagate case sensitivity Jul 28, 2023
@mderoy
Copy link
Contributor Author

mderoy commented Jul 28, 2023

Fixes issue #8178

@ConeyLiu
Copy link
Contributor

I think it would be better to add a UT for this.

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for fixing this @mderoy. The changes LGTM, but could you please add some unit tests that make sure this gets the expected results for Orc & Parquet?

@mderoy
Copy link
Contributor Author

mderoy commented Aug 2, 2023

Looks like there are no tests around GenericReader at the moment (unless you can point me to some). closest thing I can find is ./data/src/test/java/org/apache/iceberg/data/orc/TestGenericData.java which uses the GenericOrcReader. I'll look into adding a new file for testing the GenericReader when I have some time :)

@Fokko Fokko added this to the Iceberg 1.4.0 milestone Aug 2, 2023
@nastra
Copy link
Contributor

nastra commented Aug 3, 2023

I think a good place to add some tests would be TestLocalScan, as it performs scans with all file formats.

@mderoy
Copy link
Contributor Author

mderoy commented Aug 21, 2023

@nastra I added a new case insensitive test which covers both projection and filtering, but let me know if you'd like me to just put a filter scenario in those tests instead :)

@mderoy
Copy link
Contributor Author

mderoy commented Aug 21, 2023

There are formatting failures, and after some thought, I think I do prefer moving the caseinsensitive scenarios into the existing test cases for filtering and projection. I'll fix those now.

@mderoy mderoy force-pushed the fix-caseinsensitive-filter branch from e8e5aef to adafd72 Compare August 21, 2023 15:51
@mderoy mderoy requested a review from nastra August 21, 2023 17:11
@mderoy
Copy link
Contributor Author

mderoy commented Aug 21, 2023

ready for re-review :)

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @mderoy

@nastra nastra merged commit 1fbb5cb into apache:master Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants