PARQUET-1822: Avoid requiring Hadoop installation for reading/writing #1111

amousavigourabi · 2023-06-12T20:45:40Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

Add disk InputFile and OutputFile implementations

Add some Javadoc to OutputFile

wernerdaehn · 2023-06-13T07:28:31Z

I don't want to sound too greedy, but the next level of this feature would be if the classes in question have no imports of Hadoop in them.
Something like: Parquet(with Hadoop) -> Parquet(with java.nio.File)
And the lower level classes are a jar of their own.

Just dreaming...

amousavigourabi · 2023-06-13T07:48:36Z

I don't want to sound too greedy, but the next level of this feature would be if the classes in question have no imports of Hadoop in them. Something like: Parquet(with Hadoop) -> Parquet(with java.nio.File) And the lower level classes are a jar of their own.

Just dreaming...

One day... The next step is to get rid of the tight coupling to the other Hadoop classes (mainly Configuration) as that shouldn't break anything, which would at least allow users to drop hadoop-client-runtime. But first, this PR is to allow users to avoid the bigger Hadoop issues more easily.

wernerdaehn · 2023-06-13T07:50:00Z

And I greatly appreciate your work!

wgtmac · 2023-06-15T09:58:36Z

@gszadovszky @shangxinli @Fokko Do you have time to take a look? This has been discussed in the mailing list: https://lists.apache.org/thread/d33757j99xqn63hrfz415sq60v3x9hmy

gszadovszky

Thanks a lot for working on this @amousavigourabi!

About the method comments. I would only add a method comment that overrides another if there are changes in the behavior comparing to the definition of the super one.

@gszadovszky

See comments made by @gszadovszky under apache#1111

gszadovszky

Sorry @amousavigourabi, the comments related to the naming "disk" intended to be posted with my previous review. Somehow these comments got lost from it.

parquet-common/src/main/java/org/apache/parquet/io/DiskInputFile.java

parquet-common/src/main/java/org/apache/parquet/io/DiskOutputFile.java

@gszadovszky

See comments made by @gszadovszky under apache#1111

gszadovszky · 2023-06-15T14:02:25Z

@amousavigourabi, please, also update the class comments of LocalInputFile and LocalOutputFile accordingly.

@gszadovszky

See comments made by @gszadovszky under apache#1111

parquet-common/src/main/java/org/apache/parquet/io/LocalInputFile.java

wgtmac · 2023-06-15T15:03:13Z

parquet-common/src/main/java/org/apache/parquet/io/LocalInputFile.java

+  @Override
+  public long getLength() throws IOException {
+    RandomAccessFile file = new RandomAccessFile(path.toFile(), "r");
+    long length = file.length();


Should it be cached in case of repeated read?

Or would path.toFile().length() do the same thing?

It is stored in a long after the first read now.

wgtmac · 2023-06-15T15:04:28Z

parquet-common/src/main/java/org/apache/parquet/io/LocalOutputFile.java

+    return new PositionOutputStream() {
+
+      private final BufferedOutputStream stream =
+        new BufferedOutputStream(Files.newOutputStream(path), (int) buffer);


Does this support overwrite?

I would expect create fails if file with same name exists.

parquet-common/src/main/java/org/apache/parquet/io/LocalOutputFile.java

…ile.java Co-authored-by: Gang Wu <ustcwg@gmail.com>

…File.java Co-authored-by: Gang Wu <ustcwg@gmail.com>

wgtmac

Thanks @amousavigourabi! Comments are mostly for testing.

parquet-common/src/main/java/org/apache/parquet/io/LocalOutputFile.java

wgtmac · 2023-06-16T01:40:04Z

parquet-avro/src/test/java/org/apache/parquet/avro/TestReadWrite.java

@@ -117,19 +120,52 @@ public void testEmptyArray() throws Exception {
    }
  }

+  @Test
+  public void testEmptyArrayLocal() throws Exception {


This duplicates testEmptyArray(). Can we parameterize them to share common code?

Good idea, done

wgtmac · 2023-06-16T01:40:18Z

parquet-avro/src/test/java/org/apache/parquet/avro/TestReadWrite.java

@@ -145,6 +181,39 @@ public void testEmptyMap() throws Exception {
    }
  }

+  @Test
+  public void testEmptyMapLocal() throws Exception {


wgtmac · 2023-06-16T01:44:51Z

parquet-common/src/test/java/org/apache/parquet/io/TestLocalInputOutput.java

+
+public class TestLocalInputOutput {
+
+  Path pathFileExists = Paths.get("src/test/resources/disk_output_file_create_overwrite.parquet");


It seems that we don't need to add a test file. What about using createTempFile() to create a path in your tests here?

wgtmac · 2023-06-22T15:58:52Z

@amousavigourabi Will you have any update on this?

amousavigourabi · 2023-06-24T21:53:38Z

@amousavigourabi Will you have any update on this?

In crunch mode atm so it took a bit longer, but everything has been addressed now.

wgtmac

LGTM. Thanks!

wgtmac · 2023-06-25T06:08:55Z

@gszadovszky Do you want to take another pass?

eyeyar03 · 2023-08-25T13:19:40Z

Hi @wgtmac , just new to exploring and parsing parquet in Java.. I've been trying the sample code here but I can't make it work because of the LocalInputFile not yet in the current version published https://mvnrepository.com/artifact/org.apache.parquet/parquet-common.

This PR seems to be merged to master already though, any reason why I am not seeing the changes in the pushed jar in maven?

Thanks for the help. :)

wgtmac · 2023-08-25T14:27:31Z

@eyeyar03 We haven't released the next major version 1.14.0 yet, so that's why you cannot see it from there. Usually we don't backport any new feature to a minor release, so the next minor version 1.13.2 will not have it too.

I am not sure what is the best time for the next major release. Could you please advise? @gszadovszky @shangxinli

gszadovszky · 2023-08-28T06:09:12Z

@wgtmac, in terms of semantic versioning we usually do minor releases (e.g. 1.14.0) for new features and bugfix/patch releases (e.g. 1.13.2) to fix regressions. (A major release would mean 2.0.0 which may contain breaking changes but it is not planned any time soon.)
There is no exact rules for a release. We usually do minor releases if the community feels there are enough features/improvements in master since the last minor release. For bugfix/patch release, we shall do so if a serious regression has been introduced in the last minor releases and we have the fix for it.

amousavigourabi · 2023-08-28T11:18:30Z

@gszadovszky @wgtmac if the next minor release is still far away (1.12.0 and 1.13.0 had over two years between them, with 1.13.0 released a few months ago), I wouldn't mind hosting the two relevant implementations in their own little Maven artifact in the meantime, as there does seem to be some demand.

gszadovszky · 2023-08-28T11:51:25Z

@amousavigourabi, I would suggest to join the mailing list dev@parquet.apache.org and start a discussion about a potential minor release in the near future.

eyeyar03 · 2023-11-03T20:15:50Z

@eyeyar03 We haven't released the next major version 1.14.0 yet, so that's why you cannot see it from there. Usually we don't backport any new feature to a minor release, so the next minor version 1.13.2 will not have it too.

I am not sure what is the best time for the next major release. Could you please advise? @gszadovszky @shangxinli

Thanks @wgtmac . This is noted. Guess we'll have to find an alternative solution for now while waiting for the next major release.

drealeed · 2023-12-01T15:45:42Z

Our project needs this feature as well, is there a date for the next major release?

amousavigourabi · 2023-12-01T23:48:44Z

Our project needs this feature as well, is there a date for the next major release?

@drealeed if you just need to be able to drop the Hadoop Path dependency, you might want to consider copying the InputFile, OutputFile implementations from this pull request before the next release is out. If you need to fully drop Hadoop, this is still being worked on.

drealeed · 2023-12-04T13:00:03Z

@amousavigourabi , that's actually what I did and it's working for us now. Thanks

chadselph · 2024-03-07T19:37:17Z

I'm a big fan of avoiding hadoop dependencies but this change did cause an issue for us. Instantiating the LocalOutputFile creates an outputstream of 128MB, which was causing OOM exceptions for us on our lower-memory deployments. The fix was easy, just set .withRowGroupSize to a smaller value but the old code using hadoop Path did not require this much memory.

wgtmac · 2024-03-08T13:33:01Z

Hi @chadselph, thanks for reporting the issue. Did you try to debug what's the root cause? Does this relate to compression? This should be unexpected. cc @amousavigourabi

amousavigourabi · 2024-03-08T16:58:58Z

Hi @chadselph, thanks for the report. Like @wgtmac said this should be unexpected. If you could set up an example of where it shows this behaviour that'd be a good start for us to address the issue.

chadselph · 2024-03-13T21:53:25Z

@amousavigourabi @wgtmac sorry for the delay, here's an example

import java.io.IOException;
import java.nio.file.Files;
import org.apache.avro.SchemaBuilder;
import org.apache.avro.generic.GenericData;
import org.apache.parquet.avro.AvroParquetWriter;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;
import org.apache.parquet.io.LocalOutputFile;

public class UseMemory {
  public static void main(String[] args) throws IOException {
    var schema =
        SchemaBuilder.builder()
            .record("Record")
            .fields()
            .nullableBoolean("maybe", false)
            .endRecord();
    var memStart = Runtime.getRuntime().freeMemory();
    var temp = Files.createTempFile("parquet-mem", ".parquet");
    Files.delete(temp.toAbsolutePath());
    var writer =
        AvroParquetWriter.<GenericData.Record>builder(new LocalOutputFile(temp.toAbsolutePath()))
            .withCompressionCodec(CompressionCodecName.SNAPPY)
            .withSchema(schema)
            .build();
    System.out.println(
        "Used " + (memStart - Runtime.getRuntime().freeMemory()) / (1024 * 1024) + " MBytes.");
    writer.close();
    Files.delete(temp.toAbsolutePath());
  }
}

I realize this is a crude way to display memory usage, but in case you're in doubt, you can run it in intellij in the debugger and calculate the retained size of the objects, writer is 134MB and all of it is from the output stream created here.

ParquetWriter sets the buffer size to whatever the rowSize is, which defaults to 128MB.

wgtmac · 2024-03-14T05:58:49Z

Thanks for the info! @chadselph

Do you have time to take an initial look? @amousavigourabi

amousavigourabi added 3 commits June 12, 2023 22:08

PARQUET-1822: Add nio Path wrappers

6cafabf

Add disk InputFile and OutputFile implementations

Add more documentation

f41fc15

Add some Javadoc to OutputFile

Add read/write tests with disk impl

3a81d6b

amousavigourabi marked this pull request as ready for review June 12, 2023 20:46

Add license headers

0cc76f1

amousavigourabi force-pushed the avoid-hadoop-path branch 5 times, most recently from 4fa92ba to 680b8d9 Compare June 13, 2023 06:57

amousavigourabi force-pushed the avoid-hadoop-path branch 2 times, most recently from cafa9d1 to 99c57ea Compare June 13, 2023 08:16

Fix reader test

26ceb05

amousavigourabi force-pushed the avoid-hadoop-path branch from 99c57ea to 26ceb05 Compare June 13, 2023 08:18

gszadovszky requested changes Jun 15, 2023

View reviewed changes

Remove Javadoc

710f0dd

See comments made by @gszadovszky under apache#1111

amousavigourabi requested a review from gszadovszky June 15, 2023 12:36

gszadovszky requested changes Jun 15, 2023

View reviewed changes

parquet-common/src/main/java/org/apache/parquet/io/DiskInputFile.java Outdated Show resolved Hide resolved

parquet-common/src/main/java/org/apache/parquet/io/DiskOutputFile.java Outdated Show resolved Hide resolved

Rename Disk -> Local

a859954

See comments made by @gszadovszky under apache#1111

amousavigourabi requested a review from gszadovszky June 15, 2023 13:18

Update Javadoc

5a9a4de

See comments made by @gszadovszky under apache#1111

wgtmac reviewed Jun 15, 2023

View reviewed changes

amousavigourabi and others added 3 commits June 15, 2023 17:09

Update parquet-common/src/main/java/org/apache/parquet/io/LocalInputF…

0775c49

…ile.java Co-authored-by: Gang Wu <ustcwg@gmail.com>

Update parquet-common/src/main/java/org/apache/parquet/io/LocalOutput…

a200284

…File.java Co-authored-by: Gang Wu <ustcwg@gmail.com>

Update tests disk -> local

7643868

Adds license header to test file

b38c84a

amousavigourabi requested a review from wgtmac June 15, 2023 16:23

Merge branch 'apache:master' into avoid-hadoop-path

ae69dd0

wgtmac reviewed Jun 16, 2023

View reviewed changes

amousavigourabi added 4 commits June 24, 2023 20:37

Use temp files for tests

9f309a2

Parameterizes tests

223e7dd

Update test

912547d

Fix parameterization

2ed43c6

amousavigourabi requested a review from wgtmac June 24, 2023 22:56

wgtmac approved these changes Jun 25, 2023

View reviewed changes

gszadovszky approved these changes Jun 26, 2023

View reviewed changes

wgtmac merged commit 7c4cb42 into apache:master Jul 4, 2023
9 checks passed

amousavigourabi deleted the avoid-hadoop-path branch November 23, 2023 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1822: Avoid requiring Hadoop installation for reading/writing #1111

PARQUET-1822: Avoid requiring Hadoop installation for reading/writing #1111

amousavigourabi commented Jun 12, 2023

wernerdaehn commented Jun 13, 2023

amousavigourabi commented Jun 13, 2023

wernerdaehn commented Jun 13, 2023

wgtmac commented Jun 15, 2023

gszadovszky left a comment

gszadovszky left a comment

gszadovszky commented Jun 15, 2023

wgtmac Jun 15, 2023

amousavigourabi Jun 15, 2023

wgtmac Jun 15, 2023

wgtmac Jun 15, 2023

wgtmac left a comment

wgtmac Jun 16, 2023

amousavigourabi Jun 24, 2023

wgtmac Jun 16, 2023

wgtmac Jun 16, 2023

wgtmac commented Jun 22, 2023

amousavigourabi commented Jun 24, 2023

wgtmac left a comment

wgtmac commented Jun 25, 2023

eyeyar03 commented Aug 25, 2023

wgtmac commented Aug 25, 2023

gszadovszky commented Aug 28, 2023

amousavigourabi commented Aug 28, 2023

gszadovszky commented Aug 28, 2023

eyeyar03 commented Nov 3, 2023

drealeed commented Dec 1, 2023

amousavigourabi commented Dec 1, 2023

drealeed commented Dec 4, 2023

chadselph commented Mar 7, 2024

wgtmac commented Mar 8, 2024

amousavigourabi commented Mar 8, 2024

chadselph commented Mar 13, 2024 •

edited

wgtmac commented Mar 14, 2024


		public class TestLocalInputOutput {

		Path pathFileExists = Paths.get("src/test/resources/disk_output_file_create_overwrite.parquet");

PARQUET-1822: Avoid requiring Hadoop installation for reading/writing #1111

PARQUET-1822: Avoid requiring Hadoop installation for reading/writing #1111

Conversation

amousavigourabi commented Jun 12, 2023

Jira

Tests

Commits

Documentation

wernerdaehn commented Jun 13, 2023

amousavigourabi commented Jun 13, 2023

wernerdaehn commented Jun 13, 2023

wgtmac commented Jun 15, 2023

gszadovszky left a comment

Choose a reason for hiding this comment

gszadovszky left a comment

Choose a reason for hiding this comment

gszadovszky commented Jun 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac commented Jun 22, 2023

amousavigourabi commented Jun 24, 2023

wgtmac left a comment

Choose a reason for hiding this comment

wgtmac commented Jun 25, 2023

eyeyar03 commented Aug 25, 2023

wgtmac commented Aug 25, 2023

gszadovszky commented Aug 28, 2023

amousavigourabi commented Aug 28, 2023

gszadovszky commented Aug 28, 2023

eyeyar03 commented Nov 3, 2023

drealeed commented Dec 1, 2023

amousavigourabi commented Dec 1, 2023

drealeed commented Dec 4, 2023

chadselph commented Mar 7, 2024

wgtmac commented Mar 8, 2024

amousavigourabi commented Mar 8, 2024

chadselph commented Mar 13, 2024 • edited

wgtmac commented Mar 14, 2024

chadselph commented Mar 13, 2024 •

edited