feat: add compressed file support for ORCRecordReader #9884

etolbakov · 2022-11-30T23:01:37Z

This is WIP on #9847
The aim of this change is to support gz compressed files for ORCRecordReader and ParquetAvroRecordReader in order to achieve the feature parity with other types like csv/json format.

snleee

Can we add the test case for this? You can refer ORCRecordReaderTest. We can probably add the data.orc.gz to the resource (we already have data.orc). Also, are we going to cover ParquetAvroRecordReader from this PR?

snleee · 2022-12-01T03:08:41Z

...-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java

@@ -106,6 +110,19 @@ public void init(File dataFile, @Nullable Set<String> fieldsToRead, @Nullable Re
    _nextRowId = 0;
  }

+  private File unzipIfRequired(File dataFile) throws IOException {
+    if (dataFile.getName().endsWith(".gz")) {


I think that the better approach is to add a helper function in RecordReaderUtils to identify whether the file is gzipped instead of depending on the extension. I have seen the cases where people deal with gzipped files (but not end with .gz).

There are multiple approaches: https://stackoverflow.com/questions/30507653/how-to-check-whether-file-is-gzip-or-not-in-java

I think that we can try to open the file with GZIPInputStream and check for the exception.

etolbakov · 2022-12-05T21:29:04Z

hi @snleee,
I've made some adjustments. Could you please take a look when you have time?

While I was implementing the solution was a bit concerned about the code duplication, but not sure if extractingunpackIfRequired into, f.e. RecordReaderUtils would make the code clear. Please let me know what do you think about it.
Also regarding theisGZippedFile utility method as far as I understood we would like to support only gz and not archive files in general, right?

Regards, Eugene

snleee

Can you set up Pinot Code Style and apply it to pass the lint test?

Please refer to the following:
https://docs.pinot.apache.org/developers/developers-and-contributors/code-setup

snleee · 2022-12-05T21:57:14Z

...-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReader.java

@@ -106,6 +111,18 @@ public void init(File dataFile, @Nullable Set<String> fieldsToRead, @Nullable Re
    _nextRowId = 0;
  }

+  private File unpackIfRequired(File dataFile) throws IOException {


Let's move this to RecordReaderUtils class

done!
I decided to add extension as a function parameter explicitly, as you've mentioned above the archive will not necessarily have .gz suffix.

snleee · 2022-12-07T07:28:52Z

@etolbakov

Error:      src/test/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReaderTest.java
Error:          @@ -26,7 +26,6 @@
Error:           import·java.util.List;
Error:           import·java.util.Map;
Error:           import·java.util.zip.GZIPOutputStream;
Error:          -
Error:           import·org.apache.hadoop.conf.Configuration;
Error:           import·org.apache.hadoop.fs.Path;
Error:           import·org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
Error:  Run 'mvn spotless:apply' to fix these violations.

Let's try to run mvn spotless:apply to let maven clean up the formatting.

snleee

LTGM otherwise. I triggered the test again. Let's see how this goes :)

snleee · 2022-12-08T01:27:04Z

...mat/pinot-orc/src/test/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReaderTest.java

@@ -17,9 +17,15 @@
 * specific language governing permissions and limitations
 * under the License.
 */
+


(nit) remove line

codecov-commenter · 2022-12-08T02:05:35Z

Codecov Report

Merging #9884 (9d0ca7b) into master (041865a) will increase coverage by 1.75%.
The diff coverage is 62.06%.

@@             Coverage Diff              @@
##             master    #9884      +/-   ##
============================================
+ Coverage     68.65%   70.40%   +1.75%     
- Complexity     5049     5060      +11     
============================================
  Files          1973     1982       +9     
  Lines        106008   106490     +482     
  Branches      16060    16140      +80     
============================================
+ Hits          72775    74970    +2195     
+ Misses        28110    26279    -1831     
- Partials       5123     5241     +118

Flag	Coverage Δ
integration1	`25.05% <0.00%> (+0.01%)`	⬆️
integration2	`24.53% <0.00%> (?)`
unittests1	`67.94% <0.00%> (-0.01%)`	⬇️
unittests2	`15.84% <62.06%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ache/pinot/spi/data/readers/RecordReaderUtils.java	`41.17% <0.00%> (-58.83%)`	⬇️
.../pinot/plugin/inputformat/orc/ORCRecordReader.java	`74.83% <80.00%> (-0.33%)`	⬇️
...n/inputformat/parquet/ParquetAvroRecordReader.java	`89.47% <100.00%> (+0.58%)`	⬆️
...inputformat/parquet/ParquetNativeRecordReader.java	`90.47% <100.00%> (+0.23%)`	⬆️
...lugin/inputformat/parquet/ParquetRecordReader.java	`86.95% <100.00%> (+0.59%)`	⬆️
...pinot/plugin/inputformat/parquet/ParquetUtils.java	`50.00% <100.00%> (ø)`
...ain/java/org/apache/pinot/spi/utils/LoopUtils.java	`50.00% <0.00%> (-50.00%)`	⬇️
...ery/aggregation/ObjectAggregationResultHolder.java	`60.00% <0.00%> (-25.72%)`	⬇️
...ery/aggregation/DoubleAggregationResultHolder.java	`60.00% <0.00%> (-15.00%)`	⬇️
...aggregation/groupby/ObjectGroupByResultHolder.java	`75.00% <0.00%> (-12.50%)`	⬇️
... and 224 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

feat: add compressed file support for ORCRecordReader

ca6bd21

snleee reviewed Dec 1, 2022

View reviewed changes

feat: fix code review remarks, add gz support for parquet files

01ff16a

snleee reviewed Dec 5, 2022

View reviewed changes

etolbakov added 2 commits December 6, 2022 20:29

feat: move unpackIfRequired method in RecordReaderUtils

e6c9766

feat: apply Pinot code style

265b95b

chore: fix code styles

9d0ca7b

etolbakov changed the title ~~WIP - feat: add compressed file support for ORCRecordReader~~ feat: add compressed file support for ORCRecordReader Dec 7, 2022

snleee approved these changes Dec 8, 2022

View reviewed changes

snleee merged commit 8a2fbf9 into apache:master Dec 8, 2022

etolbakov deleted the 9847-gz_compressed_file_support branch December 8, 2022 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add compressed file support for ORCRecordReader #9884

feat: add compressed file support for ORCRecordReader #9884

etolbakov commented Nov 30, 2022

snleee left a comment •

edited

snleee Dec 1, 2022 •

edited

etolbakov commented Dec 5, 2022

snleee left a comment

snleee Dec 5, 2022

etolbakov Dec 6, 2022

snleee commented Dec 7, 2022

snleee left a comment

snleee Dec 8, 2022

codecov-commenter commented Dec 8, 2022 •

edited

feat: add compressed file support for ORCRecordReader #9884

feat: add compressed file support for ORCRecordReader #9884

Conversation

etolbakov commented Nov 30, 2022

snleee left a comment • edited

Choose a reason for hiding this comment

snleee Dec 1, 2022 • edited

Choose a reason for hiding this comment

etolbakov commented Dec 5, 2022

snleee left a comment

Choose a reason for hiding this comment

snleee Dec 5, 2022

Choose a reason for hiding this comment

etolbakov Dec 6, 2022

Choose a reason for hiding this comment

snleee commented Dec 7, 2022

snleee left a comment

Choose a reason for hiding this comment

snleee Dec 8, 2022

Choose a reason for hiding this comment

codecov-commenter commented Dec 8, 2022 • edited

Codecov Report

snleee left a comment •

edited

snleee Dec 1, 2022 •

edited

codecov-commenter commented Dec 8, 2022 •

edited