Parquet-400: Fixed issue reading some files from HDFS and S3 when usi… #306

jaltekruse · 2016-01-07T00:11:35Z

…ng Hadoop 2.x

The problem was not handling the case where a read request returns less
than the requested number of bytes. The FSDataInputStream lacks an
API equivalent for readFully when using ByteBuffers, which used to solve
this problem when using byte arrays as the destination. This has been
fixed by including a loop to manually request the remaining bytes until
everything has been read.

…ng Hadoop 2.x The problem was not handling the case where a read request returns less than the requested number of bytes. The FSDataInputStream lacks an API equivalent for readFully when using ByteBuffers, which used to solve this problem when using byte arrays as the destination. This has been fixed by including a loop to manually request the remaining bytes until everything has been read.

danielcweeks · 2016-01-07T17:24:29Z

@jaltekruse I verified that this fixes the issue. However, I'm also seeing a very significant performance impact on the initial loading of row group data. I haven't done exhaustive testing, but simply using parquet-tools cat, the load appears to have dropped from ~1 second (commit 5a45ae3) to ~45 seconds against a file with a single row group and about 75MB of compressed data.

I think we need to explore this a little more to make sure we aren't regressing.

danielcweeks · 2016-01-07T17:29:58Z

@jaltekruse I just added some logging and what I suspected seems to be accurate. That while loop is now thrashing for that 45 second period while reading the data.

jaltekruse · 2016-01-09T03:20:11Z

Hmmm, I believe the loop should be a valid way of recreating the old readFully() functionality. Looks like the S3 filesytem implementation does not handle repetitive calls like this well?

For now I can detect the type of the filesystem and just use the old method, wrapping the data in a ByteBuffer after it has been read into a byte array.

Seem like a reasonable idea? @danielcweeks

rdblue · 2016-01-12T22:27:20Z

This looks fine to me, but is it possible to add some tests? Maybe a FileSystem wrapper that returns a FSDataInputStream variant that will under-fill the buffer at random?

danielcweeks · 2016-01-13T17:31:50Z

@jaltekruse @rdblue Let me verify against hdfs and S3A filesystem. If we can reproduce with either of those then it will be easier to diagnose.

danielcweeks · 2016-01-13T17:37:25Z

Taking a closer look at FSInputStream behavior for readFully()->read(). This might be due to unusual seek behavior. If we're repeatedly calling readFully and the buffer is under-filled, than each successive call will result in a seek call, which is expensive for S3. I'll need to verify, but this might be the issue.

… impl

jaltekruse · 2016-02-05T21:16:37Z

@danielcweeks @rdblue Daniel has provided a workaround for the S3 issue, I haven't had time to write a unit test, but he has manually verified that the performance is fixed with S3. I can file a follow-up JIRA for adding a test if you are okay merging this as is.

rdblue · 2016-02-10T00:42:57Z

@jaltekruse, we just had another look at this problem and it isn't actually seeking. We tracked the problem to allocation. The underlying InputStream implementation for S3 doesn't expose read(ByteBuffer), so when we try to call it, this falls back to the old API, allocates a buffer for the read, and copies what is read into the ByteBuffer that was passed in. In the test case, that allocation is 75MB and the loop-based fix posted here keeps allocating a 75MB buffer (or progressively smaller) for each call to read. Those allocations and the GC runs that result are what take 50 seconds.

The immediate fix is to use readFully for the fall-back case, and to use hasArray to avoid needing to allocate and copy if the buffer is backed by an on-heap array.

This has exposed a problem in the getBuf method's contract. The use of CompatibilityUtil.getBuf here replaced a call to readFully. This is backed by a call to read(ByteBuffer) that doesn't guarantee the entire buffer will be read. I think we need to fix this method and document its guarantees. We should also make sure that calls to readFully before the byte buffer patch landed are getting full buffers as expected.

Also, the maxSize argument is ignored for the byte buffer case. If the value passed in isn't equal to the buffer's remaining size, then you get different behavior depending on the file stream and Hadoop version at runtime. I think that argument should be removed and it should always guarantee the same behavior as read(ByteBuffer), which is documented.

Last, this also exposes a problem with the fallback logic. If the read(ByteBuffer) method is missing for any implementation, the compatibility utils never try it again. If a JVM reads a Parquet file from S3, it looks like it will no longer use the new API for HDFS. I think a wrapper class that provides this method using the old API would be a good solution.

julienledem · 2016-02-16T00:31:56Z

Thanks @rdblue
@jaltekruse does this sound good to you?

piyushnarang · 2016-04-27T18:12:48Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

@@ -96,6 +96,10 @@

  public static String PARQUET_READ_PARALLELISM = "parquet.metadata.read.parallelism";

+  //URI Schemes to blacklist for bytebuffer read.
+  public static final String PARQUET_BYTEBUFFER_BLACKLIST = "parquet.bytebuffer.fs.blacklist";
+  public static final String[] PARQUET_BYTEBUFFER_BLACKLIST_DEFAULT = {"s3", "s3n", "s3a"};


if we're going with the blacklist approach, we should also handle pure hdfs file systems as well right?

We found the underlying bug, so I don't think the plan is to blacklist filesystems anymore.

Ah right saw your other comment :-)

piyushnarang · 2016-05-23T17:16:59Z

Ping @jaltekruse. We have a few changes in master we've been wanting to test out but we're running into this issue. Have you had a chance to rework based on Ryan / Daniel's comments?

jaltekruse · 2016-05-23T21:56:53Z

Hey @piyushnarang, sorry this has been outstanding for so long. I have not had a chance to work on this since my last update. If you would be willing to take a look I can answer any questions and review your changes.

Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/CompatibilityUtil.java

piyushnarang · 2016-05-23T22:03:13Z

hey @jaltekruse sure no worries. I'll try and take a stab over the next few days and reach out if I have any questions and such. Will include you on the PR as well.

jaltekruse · 2016-05-23T22:49:57Z

@piyushnarang I looked back at my branch and saw I had a few changes I had not pushed. This doesn't address the last comment from Ryan about one instance of FileSystem that lacks the API prevents all reads from going through the new API for all instances. We had discussed this after one of the hangouts, to make it efficient we probably need to store our knowledge of which filesystems/input streams support the API in a cache. This will ensure aren't relying on exceptions to test for a valid implementation of the API on every read, and also avoid the current problems with storing this information in a global static.

piyushnarang · 2016-05-23T23:52:32Z

Thanks @jaltekruse, I'll start taking a look.

piyushnarang · 2016-05-25T00:22:01Z

@jaltekruse - can you grant me write perms to your branch? I'll push my updates so that they show up in this PR (can also spin up a new one if needed)

I guess I'm not entirely clear on the approach we'd like to take to ensure that CompatUtils tries again if the first attempt fails (cc @rdblue ):

Refactoring the code in CompatUtils to not be static and creating a CompatUtils per ParquetFileReader will ensure that even if we're on the same JVM, we first try the byteBuffer call for a file and fallback otherwise. Problem though is that in case we are in a setup where we'll never have the byteBuffer call supported, we end up with this extra reflection call to read(byteBuffer) for each file. Could also provide a config flag for this so that users don't pay the perf hit when they know they don't support this.
Cache approach - Think caching something like FSDataInputStream.class -> isV2Supported might not cut it. Seems like some of the FSDataInputStream classes just delegate to the underlying inputStream so we need to be able to see if that inner inputStream supports read(byteBuffer). Things get messier if the inner inputStream also delegates. (Not sure if I'm missing something here..)

julienledem · 2016-08-15T22:07:06Z

@jaltekruse @piyushnarang do we still need this branch?

piyushnarang · 2016-08-15T22:15:49Z

No I don't need it anymore.

julienledem · 2016-08-15T22:32:27Z

@jaltekruse can you close it if it's not needed anymore?

Create blacklist for FileSystems that don't work well with bytebuffer…

da5c5c9

… impl

remove unused parameter

f35c772

piyushnarang reviewed Apr 27, 2016
View reviewed changes

WIP addressing comments

96406d8

Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/CompatibilityUtil.java

Fix infinite loop bug caused by not updating bytebuffer position.

e800f20

piyushnarang mentioned this pull request May 26, 2016

PARQUET-400: Fix for ByteBuffer incomplete read issue #346

Closed

jaltekruse closed this Aug 16, 2016

asfimport mentioned this pull request Jun 23, 2024

Error reading some files after PARQUET-77 bytebuffer read path #1910

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet-400: Fixed issue reading some files from HDFS and S3 when usi… #306

Parquet-400: Fixed issue reading some files from HDFS and S3 when usi… #306

jaltekruse commented Jan 7, 2016

danielcweeks commented Jan 7, 2016

danielcweeks commented Jan 7, 2016

jaltekruse commented Jan 9, 2016

rdblue commented Jan 12, 2016

danielcweeks commented Jan 13, 2016

danielcweeks commented Jan 13, 2016

jaltekruse commented Feb 5, 2016

rdblue commented Feb 10, 2016

julienledem commented Feb 16, 2016

piyushnarang Apr 27, 2016

rdblue Apr 29, 2016

piyushnarang Apr 29, 2016

piyushnarang commented May 23, 2016

jaltekruse commented May 23, 2016

piyushnarang commented May 23, 2016

jaltekruse commented May 23, 2016

piyushnarang commented May 23, 2016

piyushnarang commented May 25, 2016

julienledem commented Aug 15, 2016

piyushnarang commented Aug 15, 2016

julienledem commented Aug 15, 2016

Parquet-400: Fixed issue reading some files from HDFS and S3 when usi… #306

Parquet-400: Fixed issue reading some files from HDFS and S3 when usi… #306

Conversation

jaltekruse commented Jan 7, 2016

danielcweeks commented Jan 7, 2016

danielcweeks commented Jan 7, 2016

jaltekruse commented Jan 9, 2016

rdblue commented Jan 12, 2016

danielcweeks commented Jan 13, 2016

danielcweeks commented Jan 13, 2016

jaltekruse commented Feb 5, 2016

rdblue commented Feb 10, 2016

julienledem commented Feb 16, 2016

piyushnarang Apr 27, 2016

Choose a reason for hiding this comment

rdblue Apr 29, 2016

Choose a reason for hiding this comment

piyushnarang Apr 29, 2016

Choose a reason for hiding this comment

piyushnarang commented May 23, 2016

jaltekruse commented May 23, 2016

piyushnarang commented May 23, 2016

jaltekruse commented May 23, 2016

piyushnarang commented May 23, 2016

piyushnarang commented May 25, 2016

julienledem commented Aug 15, 2016

piyushnarang commented Aug 15, 2016

julienledem commented Aug 15, 2016