-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23457][SQL] Register task completion listeners first in ParquetFileFormat #20619
Conversation
Hi, @cloud-fan and @gatorsmile . |
|
||
// UnsafeRowParquetRecordReader appends the columns internally to avoid another copy. | ||
if (parquetReader.isInstanceOf[VectorizedParquetRecordReader] && | ||
enableVectorizedReader) { | ||
if (enableVectorizedReader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to merge this if-statement into the above if-statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. It looks possible. I'll update together after getting more reviews. Thanks, @kiszk .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea it seems more reasonable to merge this if-else now.
} | ||
|
||
val iter = new RecordReaderIterator(parquetReader) | ||
taskContext.foreach(_.addTaskCompletionListener(_ => iter.close())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the reported leakage, this is too late.
Test build #87482 has finished for PR 20619 at commit
|
It looks good to me that we move the registrations to the new (earlier) places. |
val vectorizedReader = new VectorizedParquetRecordReader( | ||
convertTz.orNull, enableOffHeapColumnVector && taskContext.isDefined, capacity) | ||
val recordReaderIterator = new RecordReaderIterator(vectorizedReader) | ||
// Register a task completion lister before `initalization`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could new VectorizedParquetRecordReader
or new RecordReaderIterator
fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those constructors didn't look heavy to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
can we provide a manual test like the OOM one in your ORC PR? |
Yep. I'll try for this, too. @cloud-fan . |
The reproducible test case is added into PR description and the code is updated according to @kiszk and @cloud-fan 's comments. |
LGTM |
Thank you for last-minute review before your vacation. I'm lucky. :) |
He is already on vacation. : ) |
Test build #87516 has finished for PR 20619 at commit
|
Oh.. It was a review from a vacation. |
Retest this please. |
The failure is irrelevant to this PR.
|
Test build #87518 has finished for PR 20619 at commit
|
retest this please |
Thank you for retriggering, @gatorsmile . |
Test build #87520 has finished for PR 20619 at commit
|
Would it be worth to add this JIRA number in a comment as we did for ORC? |
retest this please |
Test build #87523 has finished for PR 20619 at commit
|
Umm, we still see the following exception in the log ...
|
Yep. @kiszk . @mgaido91 also reports that, so I'm investigating that more. However, that doesn't mean this approach is not proper. You can see the manual test case example in previous ORC-related PR and this PR. This approach definitely reduces the number of point of failures. For the remaining issue, I think we may need a different approach in a different code path. |
For the following, I'll create another one.
|
LGTM |
Thank you for review, @mgaido91 . |
LGTM with one minor comment |
Thank you, @kiszk . I added SPARK-23390 in the PR description.
|
Retest this please. |
Oh, @kiszk . The following meat really
|
val vectorizedReader = new VectorizedParquetRecordReader( | ||
convertTz.orNull, enableOffHeapColumnVector && taskContext.isDefined, capacity) | ||
val iter = new RecordReaderIterator(vectorizedReader) | ||
// SPARK-23457 Register a task completion lister before `initialization`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, SPARK-23457
is added.
Test build #87527 has finished for PR 20619 at commit
|
Test build #87528 has finished for PR 20619 at commit
|
The final failure is irrelevant to this.
|
Retest this please. |
Test build #87533 has finished for PR 20619 at commit
|
retest this please |
Test build #87534 has finished for PR 20619 at commit
|
retest this please |
Test build #87535 has finished for PR 20619 at commit
|
retest this please. |
Test build #87537 has finished for PR 20619 at commit
|
thanks, merging to master! |
Thank you all! |
Hi, @cloud-fan . |
Yea, please go ahead. |
Thank you, @cloud-fan ! |
What changes were proposed in this pull request?
ParquetFileFormat leaks opened files in some cases. This PR prevents that by registering task completion listers first before initialization.
How was this patch tested?
Manual. The following test case generates the same leakage.