-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet-400: Fixed issue reading some files from HDFS and S3 when usi… #306
Conversation
…ng Hadoop 2.x The problem was not handling the case where a read request returns less than the requested number of bytes. The FSDataInputStream lacks an API equivalent for readFully when using ByteBuffers, which used to solve this problem when using byte arrays as the destination. This has been fixed by including a loop to manually request the remaining bytes until everything has been read.
@jaltekruse I verified that this fixes the issue. However, I'm also seeing a very significant performance impact on the initial loading of row group data. I haven't done exhaustive testing, but simply using parquet-tools cat, the load appears to have dropped from ~1 second (commit 5a45ae3) to ~45 seconds against a file with a single row group and about 75MB of compressed data. I think we need to explore this a little more to make sure we aren't regressing. |
@jaltekruse I just added some logging and what I suspected seems to be accurate. That while loop is now thrashing for that 45 second period while reading the data. |
Hmmm, I believe the loop should be a valid way of recreating the old readFully() functionality. Looks like the S3 filesytem implementation does not handle repetitive calls like this well? For now I can detect the type of the filesystem and just use the old method, wrapping the data in a ByteBuffer after it has been read into a byte array. Seem like a reasonable idea? @danielcweeks |
This looks fine to me, but is it possible to add some tests? Maybe a FileSystem wrapper that returns a FSDataInputStream variant that will under-fill the buffer at random? |
@jaltekruse @rdblue Let me verify against hdfs and S3A filesystem. If we can reproduce with either of those then it will be easier to diagnose. |
Taking a closer look at FSInputStream behavior for readFully()->read(). This might be due to unusual seek behavior. If we're repeatedly calling readFully and the buffer is under-filled, than each successive call will result in a seek call, which is expensive for S3. I'll need to verify, but this might be the issue. |
@danielcweeks @rdblue Daniel has provided a workaround for the S3 issue, I haven't had time to write a unit test, but he has manually verified that the performance is fixed with S3. I can file a follow-up JIRA for adding a test if you are okay merging this as is. |
@jaltekruse, we just had another look at this problem and it isn't actually seeking. We tracked the problem to allocation. The underlying The immediate fix is to use This has exposed a problem in the Also, the Last, this also exposes a problem with the fallback logic. If the |
Thanks @rdblue |
@@ -96,6 +96,10 @@ | |||
|
|||
public static String PARQUET_READ_PARALLELISM = "parquet.metadata.read.parallelism"; | |||
|
|||
//URI Schemes to blacklist for bytebuffer read. | |||
public static final String PARQUET_BYTEBUFFER_BLACKLIST = "parquet.bytebuffer.fs.blacklist"; | |||
public static final String[] PARQUET_BYTEBUFFER_BLACKLIST_DEFAULT = {"s3", "s3n", "s3a"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're going with the blacklist approach, we should also handle pure hdfs file systems as well right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We found the underlying bug, so I don't think the plan is to blacklist filesystems anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right saw your other comment :-)
Ping @jaltekruse. We have a few changes in master we've been wanting to test out but we're running into this issue. Have you had a chance to rework based on Ryan / Daniel's comments? |
Hey @piyushnarang, sorry this has been outstanding for so long. I have not had a chance to work on this since my last update. If you would be willing to take a look I can answer any questions and review your changes. |
Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/CompatibilityUtil.java
hey @jaltekruse sure no worries. I'll try and take a stab over the next few days and reach out if I have any questions and such. Will include you on the PR as well. |
@piyushnarang I looked back at my branch and saw I had a few changes I had not pushed. This doesn't address the last comment from Ryan about one instance of FileSystem that lacks the API prevents all reads from going through the new API for all instances. We had discussed this after one of the hangouts, to make it efficient we probably need to store our knowledge of which filesystems/input streams support the API in a cache. This will ensure aren't relying on exceptions to test for a valid implementation of the API on every read, and also avoid the current problems with storing this information in a global static. |
Thanks @jaltekruse, I'll start taking a look. |
@jaltekruse - can you grant me write perms to your branch? I'll push my updates so that they show up in this PR (can also spin up a new one if needed) I guess I'm not entirely clear on the approach we'd like to take to ensure that CompatUtils tries again if the first attempt fails (cc @rdblue ):
|
@jaltekruse @piyushnarang do we still need this branch? |
No I don't need it anymore. |
@jaltekruse can you close it if it's not needed anymore? |
…ng Hadoop 2.x
The problem was not handling the case where a read request returns less
than the requested number of bytes. The FSDataInputStream lacks an
API equivalent for readFully when using ByteBuffers, which used to solve
this problem when using byte arrays as the destination. This has been
fixed by including a loop to manually request the remaining bytes until
everything has been read.