Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add compressed file support for ORCRecordReader #9884

Merged
merged 5 commits into from Dec 8, 2022
Merged

feat: add compressed file support for ORCRecordReader #9884

merged 5 commits into from Dec 8, 2022

Conversation

etolbakov
Copy link
Contributor

This is WIP on #9847
The aim of this change is to support gz compressed files for ORCRecordReader and ParquetAvroRecordReader in order to achieve the feature parity with other types like csv/json format.

Copy link
Contributor

@snleee snleee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the test case for this? You can refer ORCRecordReaderTest. We can probably add the data.orc.gz to the resource (we already have data.orc). Also, are we going to cover ParquetAvroRecordReader from this PR?

@@ -106,6 +110,19 @@ public void init(File dataFile, @Nullable Set<String> fieldsToRead, @Nullable Re
_nextRowId = 0;
}

private File unzipIfRequired(File dataFile) throws IOException {
if (dataFile.getName().endsWith(".gz")) {
Copy link
Contributor

@snleee snleee Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the better approach is to add a helper function in RecordReaderUtils to identify whether the file is gzipped instead of depending on the extension. I have seen the cases where people deal with gzipped files (but not end with .gz).

There are multiple approaches: https://stackoverflow.com/questions/30507653/how-to-check-whether-file-is-gzip-or-not-in-java

I think that we can try to open the file with GZIPInputStream and check for the exception.

@etolbakov
Copy link
Contributor Author

hi @snleee,
I've made some adjustments. Could you please take a look when you have time?

While I was implementing the solution was a bit concerned about the code duplication, but not sure if extractingunpackIfRequired into, f.e. RecordReaderUtils would make the code clear. Please let me know what do you think about it.
Also regarding theisGZippedFile utility method as far as I understood we would like to support only gz and not archive files in general, right?


Regards, Eugene

Copy link
Contributor

@snleee snleee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you set up Pinot Code Style and apply it to pass the lint test?

Please refer to the following:
https://docs.pinot.apache.org/developers/developers-and-contributors/code-setup

@@ -106,6 +111,18 @@ public void init(File dataFile, @Nullable Set<String> fieldsToRead, @Nullable Re
_nextRowId = 0;
}

private File unpackIfRequired(File dataFile) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to RecordReaderUtils class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!
I decided to add extension as a function parameter explicitly, as you've mentioned above the archive will not necessarily have .gz suffix.

@snleee
Copy link
Contributor

snleee commented Dec 7, 2022

@etolbakov

Error:      src/test/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordReaderTest.java
Error:          @@ -26,7 +26,6 @@
Error:           import·java.util.List;
Error:           import·java.util.Map;
Error:           import·java.util.zip.GZIPOutputStream;
Error:          -
Error:           import·org.apache.hadoop.conf.Configuration;
Error:           import·org.apache.hadoop.fs.Path;
Error:           import·org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
Error:  Run 'mvn spotless:apply' to fix these violations.

Let's try to run mvn spotless:apply to let maven clean up the formatting.

@etolbakov etolbakov changed the title WIP - feat: add compressed file support for ORCRecordReader feat: add compressed file support for ORCRecordReader Dec 7, 2022
Copy link
Contributor

@snleee snleee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM otherwise. I triggered the test again. Let's see how this goes :)

@@ -17,9 +17,15 @@
* specific language governing permissions and limitations
* under the License.
*/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) remove line

@codecov-commenter
Copy link

codecov-commenter commented Dec 8, 2022

Codecov Report

Merging #9884 (9d0ca7b) into master (041865a) will increase coverage by 1.75%.
The diff coverage is 62.06%.

@@             Coverage Diff              @@
##             master    #9884      +/-   ##
============================================
+ Coverage     68.65%   70.40%   +1.75%     
- Complexity     5049     5060      +11     
============================================
  Files          1973     1982       +9     
  Lines        106008   106490     +482     
  Branches      16060    16140      +80     
============================================
+ Hits          72775    74970    +2195     
+ Misses        28110    26279    -1831     
- Partials       5123     5241     +118     
Flag Coverage Δ
integration1 25.05% <0.00%> (+0.01%) ⬆️
integration2 24.53% <0.00%> (?)
unittests1 67.94% <0.00%> (-0.01%) ⬇️
unittests2 15.84% <62.06%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ache/pinot/spi/data/readers/RecordReaderUtils.java 41.17% <0.00%> (-58.83%) ⬇️
.../pinot/plugin/inputformat/orc/ORCRecordReader.java 74.83% <80.00%> (-0.33%) ⬇️
...n/inputformat/parquet/ParquetAvroRecordReader.java 89.47% <100.00%> (+0.58%) ⬆️
...inputformat/parquet/ParquetNativeRecordReader.java 90.47% <100.00%> (+0.23%) ⬆️
...lugin/inputformat/parquet/ParquetRecordReader.java 86.95% <100.00%> (+0.59%) ⬆️
...pinot/plugin/inputformat/parquet/ParquetUtils.java 50.00% <100.00%> (ø)
...ain/java/org/apache/pinot/spi/utils/LoopUtils.java 50.00% <0.00%> (-50.00%) ⬇️
...ery/aggregation/ObjectAggregationResultHolder.java 60.00% <0.00%> (-25.72%) ⬇️
...ery/aggregation/DoubleAggregationResultHolder.java 60.00% <0.00%> (-15.00%) ⬇️
...aggregation/groupby/ObjectGroupByResultHolder.java 75.00% <0.00%> (-12.50%) ⬇️
... and 224 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@snleee snleee merged commit 8a2fbf9 into apache:master Dec 8, 2022
@etolbakov etolbakov deleted the 9847-gz_compressed_file_support branch December 8, 2022 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants