PARQUET-2134: Fix type checking in HadoopStreams.wrap #951

7c00 · 2022-03-09T03:47:44Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-2134
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

7c00 · 2022-03-09T04:50:12Z

Related issue: prestodb/presto#17435

shangxinli · 2022-03-14T16:48:05Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java

+  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
+    InputStream wrapped = stream.getWrappedStream();
+    if (wrapped instanceof FSDataInputStream) {
+      return isWrappedStreamByteBufferReadable(((FSDataInputStream) wrapped));


Is there a corner case that can cause an infinite loop?

Yes, it could be. But it may be hard to create such a case. As its code shows, FSDataInputStream is a wrapper class of an inputstream. When we check the wrapped inputstream recursively, it would finally reach an inputstream whose type is not FSDataInputStream. A developer could override getWrappedStream as return this to cause an infinite loop, while this makes no sense.

I understand it would be very rare case but once that happen it would be hard to debug this 'hang' issue. Let's do two things: 1) Add check if it is 'this'; Throw exception if that happens; 2) Add debug log; When it hangs, developer can enable debug log and see what parquet-mr is doing.

Good suggestions! I have updated as the comment.

shangxinli · 2022-04-03T22:53:42Z

Thanks for adding the check and debug log. LGTM! One more thing(sorry for not asking at first-round review), do you think it makes sense to add tests?

shangxinli · 2022-05-11T15:04:30Z

@7c00 Do you have time to look into the last feedback?

7c00 · 2022-05-11T15:14:11Z

@7c00 Do you have time to look into the last feedback?

@shangxinli Thanks for your comments. Sorry for late to reply. I think it's ok to add some unit tests, and I am going to do that in this weekend. I will ping you when finished. Thank you in advance.

steveloughran

This would be a lot easier if HadoopStreams didn't use reflection to get at ByteBufferReadable. that's an api which came with hadoop 2.0.2, so *everything has it.

steveloughran · 2022-06-06T17:49:50Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java

@@ -51,7 +52,7 @@ public class HadoopStreams {
  public static SeekableInputStream wrap(FSDataInputStream stream) {
    Objects.requireNonNull(stream, "Cannot wrap a null input stream");
    if (byteBufferReadableClass != null && h2SeekableConstructor != null &&
-        byteBufferReadableClass.isInstance(stream.getWrappedStream())) {
+        isWrappedStreamByteBufferReadable(stream)) {


this is really going into the internals of the hadoop classes and potentially tricky if there is any dynamic decision making in the inner class. The good news there is I don't see anything doing that.

there is a way to ask (hadoop 3.2+) if a stream does support the API before calling, using the StreamCapabilities interface.
https://issues.apache.org/jira/browse/HDFS-14111

if (stream.hasCapability( "in:readbytebuffer") { // stream is confident it has the api ) else { // do the checking of the inner class }

steveloughran · 2022-06-06T17:52:38Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java

+  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
+    InputStream wrapped = stream.getWrappedStream();
+    if (wrapped == stream) {
+      throw new ParquetDecodingException("Illegal FSDataInputStream as wrapped itself");


this can't happen. the inner stream is set in the constructor, so cannot take the not-yet-constructed class as an argument...no need to worry about recursion.

This extends apache#951 Since [HDFS-14111](https://issues.apache.org/jira/browse/HDFS-14111) all input streams in the hadoop codebase which implement `ByteBufferReadable` return true on the StreamCapabilities probe `stream.hasCapability("in:readbytebuffer")`; those which don't are forbidden to do so. This means that on Hadoop 3.3.0+ the preferred way to probe for the API is to ask the stream. The StreamCapabilities probe was added in Hadoop 2.9. Along with making all use of `ByteBufferReadable` non-reflective, this makes the checks fairly straightforward. Tests verify that if a stream implements `ByteBufferReadable' then it will be bonded to H2SeekableInputStream, even if multiply wrapped by FSDataInputStreams, and that if it doesn't, it won't.

steveloughran · 2022-06-07T11:26:29Z

I've taken this PR and added the changes I was suggesting, plus tests. see #971. If you take that extra commit and merge it in here, it should complete this PR

shangxinli · 2022-06-09T15:55:12Z

I've taken this PR and added the changes I was suggesting, plus tests. see #971. If you take that extra commit and merge it in here, it should complete this PR

@7c00 Are you OK since you are originally created this PR?

This extends apache#951 Since [HDFS-14111](https://issues.apache.org/jira/browse/HDFS-14111) all input streams in the hadoop codebase which implement `ByteBufferReadable` return true on the StreamCapabilities probe `stream.hasCapability("in:readbytebuffer")`; those which don't are forbidden to do so. This means that on Hadoop 3.3.0+ the preferred way to probe for the API is to ask the stream. The StreamCapabilities probe was added in Hadoop 2.9. Along with making all use of `ByteBufferReadable` non-reflective, this makes the checks fairly straightforward. Tests verify that if a stream implements `ByteBufferReadable' then it will be bonded to H2SeekableInputStream, even if multiply wrapped by FSDataInputStreams, and that if it doesn't, it won't.

7c00 · 2022-06-13T06:26:33Z

Thanks @steveloughran @shangxinli . I have cherry-picked the commit from #971

Do I need to squash the two commits into one?

steveloughran · 2022-06-13T12:46:54Z

whoever actually commits this can use the github squash option to combine all commits into one before merging.

FYI, I've just started writing a shim library so that apps compiling against hadoop 3.2.0 wil be able to invoke the 3.3+ API calls when present: HADOOP-18287.

First parquet will need to be able to compile/link against hadoop 3.x: #976

shangxinli · 2022-06-18T02:21:35Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java

+   * @return true if it is safe to a H2SeekableInputStream to access the data
+   */
+  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
+    if (stream.hasCapability("in:readbytebuffer")) {


We don't have the Hadoop 3..3.0 yet in Parquet. Does it mean we need to hold of this PR?

no, the StreamCapabilities probe has been around since hadoop 2. it is just in 3.3.0 all streams which implement the api return true for this probe...a probe which gets passed down the wrapped streams. It avoids looking at the wrapped streams as you should be able to trust the response (put differently: if something lied it is in trouble)

@steveloughran @shangxinli it looks like the API is not available in Hadoop 2.8.x, so it will create issues for projects that want to use the latest version of Parquet but still want to keep Hadoop 2.8.x.

also see related JIRA: https://issues.apache.org/jira/browse/PARQUET-2276

shangxinli · 2022-07-02T16:27:42Z

@7c00 and @steveloughran Thank both of you for the great contribution! This PR comes from two authors. Can @7c00 add @steveloughran as the co-author to this PR? This is an example.

HadoopStreams.wrap produces a wrong H2SeekableInputStream if the passed-in FSDataInputStream wraps another FSDataInputStream. Since [HDFS-14111](https://issues.apache.org/jira/browse/HDFS-14111) all input streams in the hadoop codebase which implement `ByteBufferReadable` return true on the StreamCapabilities probe `stream.hasCapability("in:readbytebuffer")`; those which don't are forbidden to do so. This means that on Hadoop 3.3.0+ the preferred way to probe for the API is to ask the stream. The StreamCapabilities probe was added in Hadoop 2.9. Along with making all use of `ByteBufferReadable` non-reflective, this makes the checks fairly straightforward. Tests verify that if a stream implements `ByteBufferReadable' then it will be bonded to H2SeekableInputStream, even if multiply wrapped by FSDataInputStreams, and that if it doesn't, it won't. Co-authored-by: Steve Loughran <stevel@cloudera.com>

7c00 · 2022-07-04T16:22:02Z

@shangxinli Thank you for reminding me. I have squashed the PR and added @steveloughran as the co-author.

steveloughran · 2022-07-12T11:11:11Z

thanks. created HADOOP-18336
tag FSDataInputStream.getWrappedStream() @Public/@stable to make sure that hadoop code knows external libs may be calling the method.

shangxinli · 2022-07-24T19:48:24Z

LGTM

7c00 force-pushed the PARQUET-2134 branch from 5f4aa3b to 6510a35 Compare March 9, 2022 03:52

shangxinli reviewed Mar 14, 2022

View reviewed changes

7c00 force-pushed the PARQUET-2134 branch 2 times, most recently from 1e26cb1 to 3a8ef91 Compare March 16, 2022 07:25

7c00 force-pushed the PARQUET-2134 branch from 951261b to ab7494a Compare March 24, 2022 03:21

7c00 requested a review from shangxinli March 25, 2022 01:55

steveloughran reviewed Jun 6, 2022

View reviewed changes

steveloughran mentioned this pull request Jun 7, 2022

PARQUET-2134: Improve binding to ByteBufferReadable #971

Closed

4 tasks

steveloughran mentioned this pull request Jun 13, 2022

PARQUET-2149: Async IO implementation for ParquetFileReader #968

Open

shangxinli reviewed Jun 18, 2022

View reviewed changes

7c00 force-pushed the PARQUET-2134 branch from f04555a to 7eb3395 Compare July 4, 2022 16:16

7c00 force-pushed the PARQUET-2134 branch from 7eb3395 to 0b6491b Compare July 4, 2022 16:17

shangxinli merged commit 3ed2dbb into apache:master Jul 24, 2022

Fokko mentioned this pull request Apr 19, 2023

PARQUET-2289: Avoid using hasCapability #1075

Closed

4 tasks

Fokko mentioned this pull request Apr 20, 2023

PARQUET-2290: Add CI for Hadoop 2 #1076

Merged

4 tasks

Fokko mentioned this pull request Apr 29, 2023

PARQUET-2276: Bring back support for Hadoop 2.7.3 #1084

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2134: Fix type checking in HadoopStreams.wrap #951

PARQUET-2134: Fix type checking in HadoopStreams.wrap #951

7c00 commented Mar 9, 2022 •

edited

7c00 commented Mar 9, 2022

shangxinli Mar 14, 2022

7c00 Mar 16, 2022 •

edited

shangxinli Mar 20, 2022

7c00 Mar 24, 2022

shangxinli commented Apr 3, 2022

shangxinli commented May 11, 2022

7c00 commented May 11, 2022

steveloughran left a comment

steveloughran Jun 6, 2022

steveloughran Jun 6, 2022

steveloughran commented Jun 7, 2022

shangxinli commented Jun 9, 2022

7c00 commented Jun 13, 2022 •

edited

steveloughran commented Jun 13, 2022 •

edited

shangxinli Jun 18, 2022

steveloughran Jun 20, 2022

sunchao Apr 13, 2023

sunchao Apr 13, 2023

shangxinli commented Jul 2, 2022

7c00 commented Jul 4, 2022

steveloughran commented Jul 12, 2022

shangxinli commented Jul 24, 2022

PARQUET-2134: Fix type checking in HadoopStreams.wrap #951

PARQUET-2134: Fix type checking in HadoopStreams.wrap #951

Conversation

7c00 commented Mar 9, 2022 • edited

Jira

Tests

Commits

Documentation

7c00 commented Mar 9, 2022

Choose a reason for hiding this comment

7c00 Mar 16, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shangxinli commented Apr 3, 2022

shangxinli commented May 11, 2022

7c00 commented May 11, 2022

steveloughran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveloughran commented Jun 7, 2022

shangxinli commented Jun 9, 2022

7c00 commented Jun 13, 2022 • edited

steveloughran commented Jun 13, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shangxinli commented Jul 2, 2022

7c00 commented Jul 4, 2022

steveloughran commented Jul 12, 2022

shangxinli commented Jul 24, 2022

7c00 commented Mar 9, 2022 •

edited

7c00 Mar 16, 2022 •

edited

7c00 commented Jun 13, 2022 •

edited

steveloughran commented Jun 13, 2022 •

edited