New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1822: Avoid requiring Hadoop installation for reading/writing #1111
Conversation
Add disk InputFile and OutputFile implementations
Add some Javadoc to OutputFile
4fa92ba
to
680b8d9
Compare
I don't want to sound too greedy, but the next level of this feature would be if the classes in question have no imports of Hadoop in them. Just dreaming... |
One day... The next step is to get rid of the tight coupling to the other Hadoop classes (mainly Configuration) as that shouldn't break anything, which would at least allow users to drop hadoop-client-runtime. But first, this PR is to allow users to avoid the bigger Hadoop issues more easily. |
And I greatly appreciate your work! |
cafa9d1
to
99c57ea
Compare
99c57ea
to
26ceb05
Compare
@gszadovszky @shangxinli @Fokko Do you have time to take a look? This has been discussed in the mailing list: https://lists.apache.org/thread/d33757j99xqn63hrfz415sq60v3x9hmy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this @amousavigourabi!
About the method comments. I would only add a method comment that overrides another if there are changes in the behavior comparing to the definition of the super one.
See comments made by @gszadovszky under apache#1111
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @amousavigourabi, the comments related to the naming "disk" intended to be posted with my previous review. Somehow these comments got lost from it.
parquet-common/src/main/java/org/apache/parquet/io/DiskInputFile.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/io/DiskOutputFile.java
Outdated
Show resolved
Hide resolved
See comments made by @gszadovszky under apache#1111
@amousavigourabi, please, also update the class comments of |
See comments made by @gszadovszky under apache#1111
parquet-common/src/main/java/org/apache/parquet/io/LocalInputFile.java
Outdated
Show resolved
Hide resolved
@Override | ||
public long getLength() throws IOException { | ||
RandomAccessFile file = new RandomAccessFile(path.toFile(), "r"); | ||
long length = file.length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be cached in case of repeated read?
Or would path.toFile().length()
do the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is stored in a long
after the first read now.
return new PositionOutputStream() { | ||
|
||
private final BufferedOutputStream stream = | ||
new BufferedOutputStream(Files.newOutputStream(path), (int) buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this support overwrite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect create
fails if file with same name exists.
parquet-common/src/main/java/org/apache/parquet/io/LocalOutputFile.java
Outdated
Show resolved
Hide resolved
…ile.java Co-authored-by: Gang Wu <ustcwg@gmail.com>
…File.java Co-authored-by: Gang Wu <ustcwg@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amousavigourabi! Comments are mostly for testing.
@@ -117,19 +120,52 @@ public void testEmptyArray() throws Exception { | |||
} | |||
} | |||
|
|||
@Test | |||
public void testEmptyArrayLocal() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This duplicates testEmptyArray()
. Can we parameterize them to share common code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done
@@ -145,6 +181,39 @@ public void testEmptyMap() throws Exception { | |||
} | |||
} | |||
|
|||
@Test | |||
public void testEmptyMapLocal() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
||
public class TestLocalInputOutput { | ||
|
||
Path pathFileExists = Paths.get("src/test/resources/disk_output_file_create_overwrite.parquet"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we don't need to add a test file. What about using createTempFile()
to create a path in your tests here?
@amousavigourabi Will you have any update on this? |
In crunch mode atm so it took a bit longer, but everything has been addressed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
@gszadovszky Do you want to take another pass? |
Hi @wgtmac , just new to exploring and parsing parquet in Java.. I've been trying the sample code here but I can't make it work because of the LocalInputFile not yet in the current version published https://mvnrepository.com/artifact/org.apache.parquet/parquet-common. This PR seems to be merged to master already though, any reason why I am not seeing the changes in the pushed jar in maven? Thanks for the help. :) |
@eyeyar03 We haven't released the next major version 1.14.0 yet, so that's why you cannot see it from there. Usually we don't backport any new feature to a minor release, so the next minor version 1.13.2 will not have it too. I am not sure what is the best time for the next major release. Could you please advise? @gszadovszky @shangxinli |
@wgtmac, in terms of semantic versioning we usually do minor releases (e.g. |
@gszadovszky @wgtmac if the next minor release is still far away (1.12.0 and 1.13.0 had over two years between them, with 1.13.0 released a few months ago), I wouldn't mind hosting the two relevant implementations in their own little Maven artifact in the meantime, as there does seem to be some demand. |
@amousavigourabi, I would suggest to join the mailing list dev@parquet.apache.org and start a discussion about a potential minor release in the near future. |
Thanks @wgtmac . This is noted. Guess we'll have to find an alternative solution for now while waiting for the next major release. |
Our project needs this feature as well, is there a date for the next major release? |
@drealeed if you just need to be able to drop the Hadoop Path dependency, you might want to consider copying the InputFile, OutputFile implementations from this pull request before the next release is out. If you need to fully drop Hadoop, this is still being worked on. |
@amousavigourabi , that's actually what I did and it's working for us now. Thanks |
I'm a big fan of avoiding hadoop dependencies but this change did cause an issue for us. Instantiating the |
Hi @chadselph, thanks for reporting the issue. Did you try to debug what's the root cause? Does this relate to compression? This should be unexpected. cc @amousavigourabi |
Hi @chadselph, thanks for the report. Like @wgtmac said this should be unexpected. If you could set up an example of where it shows this behaviour that'd be a good start for us to address the issue. |
@amousavigourabi @wgtmac sorry for the delay, here's an example import java.io.IOException;
import java.nio.file.Files;
import org.apache.avro.SchemaBuilder;
import org.apache.avro.generic.GenericData;
import org.apache.parquet.avro.AvroParquetWriter;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;
import org.apache.parquet.io.LocalOutputFile;
public class UseMemory {
public static void main(String[] args) throws IOException {
var schema =
SchemaBuilder.builder()
.record("Record")
.fields()
.nullableBoolean("maybe", false)
.endRecord();
var memStart = Runtime.getRuntime().freeMemory();
var temp = Files.createTempFile("parquet-mem", ".parquet");
Files.delete(temp.toAbsolutePath());
var writer =
AvroParquetWriter.<GenericData.Record>builder(new LocalOutputFile(temp.toAbsolutePath()))
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withSchema(schema)
.build();
System.out.println(
"Used " + (memStart - Runtime.getRuntime().freeMemory()) / (1024 * 1024) + " MBytes.");
writer.close();
Files.delete(temp.toAbsolutePath());
}
} I realize this is a crude way to display memory usage, but in case you're in doubt, you can run it in intellij in the debugger and calculate the retained size of the objects,
|
Thanks for the info! @chadselph Do you have time to take an initial look? @amousavigourabi |
Make sure you have checked all steps below.
Jira
Tests
Commits
Documentation