-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37088][PYSPARK][SQL] Writer thread must not access input after task completion listener returns #34245
Conversation
… task completion listener returns
Kubernetes integration test starting |
This looks correct from the point of view of avoiding races. It looks like there were some concerns about deadlocks mentioned in #24699 (comment) . Did you take a look at that? It's definitely easy to have subtle deadlocks with any kind of synchronisation in callbacks. I took a look myself and it looks OK to me but I am new to the code so may be missing something - TaskContextImpl.markTask{Completed,Failed} both seem to drop the TaskContext lock before invoking the listeners. |
Kubernetes integration test status failure |
Hmm, thanks for pointing this out.
We can fix this by releasing the TaskContext lock before invoking the listeners. I'll update the PR with that change and try to write a test to repro the deadlock. |
Hm, seems like the test hanging. Would you mind retriggering https://github.com/ankurdave/spark/runs/3862170994? |
Test build #144095 has finished for PR 34245 at commit
|
Looks like the PySpark tests I updated the PR to release the TaskContextImpl lock before invoking the listeners. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #144157 has finished for PR 34245 at commit
|
onCompleteCallbacks += listener | ||
false | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API doesn't intend to guarantee any ordering of when the task completion listeners are called AFAICT. I think before this change the implementation ends up guaranteeing that the listeners are called sequentially. So it seems possible that some code could be accidentally depending on that.
This might be overengineering it, but we could have a scheme that avoided the deadlock issues and guaranteed sequential execution of callbacks. You would have at most one single thread at any point in time responsible for invoking callbacks. If another thread needs to invoke a callback, it either delegates it to the current callback invocation thread, or it becomes the callback execution thread itself. This means that the callback invocation thread needs to first invoke all of the current registered callbacks, but when it's done with those, check to see if any more callbacks have been queued.
I think we could do that by having the callback invocation thread taking ownership of the current callbacks list, but after invoking those callbacks checking to see if any more have been queued. We'd also need a variable to track if there's a current callback execution thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point that we'd be changing the behavior of this API. It would be nice to preserve the sequential execution behavior, but it does seem pretty complex. I can try implementing it and see whether it's worth it.
Either way, we should probably document and test the behavior more thoroughly. In the current state of the PR, I think the guarantee is something like the following: "Two listeners registered in the same thread will be invoked in reverse order of registration if the task finishes after both are registered. There are no ordering guarantees for listeners registered in different threads, and they may execute concurrently."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no ordering guarantees for listeners registered in different threads
I agree. When there are multiple threads I don't think we can define an "order".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timarmstrong I implemented your suggestion to ensure sequential execution of listeners - it wasn't too complex after all. I also added tests to verify sequential execution, ordering, and liveness in case of reentrancy.
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #144377 has finished for PR 34245 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #144389 has finished for PR 34245 at commit
|
also cc @JoshRosen @Ngone51 |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #144414 has finished for PR 34245 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thanks, merging to master/3.2! |
… task completion listener returns ### What changes were proposed in this pull request? Python UDFs in Spark SQL are run in a separate Python process. The Python process is fed input by a dedicated thread (`BasePythonRunner.WriterThread`). This writer thread drives the child plan by pulling rows from its output iterator and serializing them across a socket. When the child exec node is the off-heap vectorized Parquet reader, these rows are backed by off-heap memory. The child node uses a task completion listener to free the off-heap memory at the end of the task, which invalidates the output iterator and any rows it has produced. Since task completion listeners are registered bottom-up and executed in reverse order of registration, this is safe as long as an exec node never accesses its input after its task completion listener has executed.[^1] The BasePythonRunner task completion listener violates this assumption. It interrupts the writer thread, but does not wait for it to exit. This causes a race condition that can lead to an executor crash: 1. The Python writer thread is processing a row backed by off-heap memory. 2. The task finishes, for example because it has reached a row limit. 3. The BasePythonRunner task completion listener sets the interrupt status of the writer thread, but the writer thread does not check it immediately. 4. The child plan's task completion listener frees its off-heap memory, invalidating the row that the Python writer thread is processing. 5. The Python writer thread attempts to access the invalidated row. The use-after-free triggers a segfault that crashes the executor. This PR fixes the bug by making the BasePythonRunner task completion listener wait for the writer thread to exit before returning. This prevents its input from being invalidated while the thread is running. The sequence of events is now as follows: 1. The Python writer thread is processing a row backed by off-heap memory. 2. The task finishes, for example because it has reached a row limit. 3. The BasePythonRunner task completion listener sets the interrupt status of the writer thread and waits for the writer thread to exit. 4. The child plan's task completion listener can safely free its off-heap memory without invalidating live rows. TaskContextImpl previously held a lock while invoking the task completion listeners. This would now cause a deadlock because the writer thread's exception handler calls `TaskContextImpl#isCompleted()`, which needs to acquire the same lock. To avoid deadlock, this PR modifies TaskContextImpl to release the lock before invoking the listeners, while still maintaining sequential execution of listeners. [^1]: This guarantee was not historically recognized, leading to similar bugs as far back as 2014 ([SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019?focusedCommentId=13953661&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13953661)). The root cause was the lack of a reliably-ordered mechanism for operators to free resources at the end of a task. Such a mechanism (task completion listeners) was added and gradually refined, and we can now make this guarantee explicit. (An alternative approach is to use closeable iterators everywhere, but this would be a major change.) ### Why are the changes needed? Without this PR, attempting to use Python UDFs while the off-heap vectorized Parquet reader is enabled (`spark.sql.columnVector.offheap.enabled true`) can cause executors to segfault. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A [previous PR](#30177) reduced the likelihood of encountering this race condition, but did not eliminate it. The accompanying tests were therefore flaky and had to be disabled. This PR eliminates the race condition, allowing us to re-enable these tests. One of the tests, `test_pandas_udf_scalar`, previously failed 30/1000 times and now always succeeds. An internal workload previously failed with a segfault about 40% of the time when run with `spark.sql.columnVector.offheap.enabled true`, and now succeeds 100% of the time. Closes #34245 from ankurdave/SPARK-33277-thread-join. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dfca1d1) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
I didn't realize that the linked JIRA ticket is the old one. @ankurdave can you create a new JIRA ticket for this bug? thanks! |
@cloud-fan Thanks! I created https://issues.apache.org/jira/browse/SPARK-37088. I noticed that
|
I noticed it occurred on another recent PR as well: #34352 failed in I was also able to repro this locally on ./build/sbt -Phive package
./build/sbt test:compile
seq 100 | parallel -j 8 --halt now,fail=1 'echo {#}; python/run-tests --testnames pyspark.sql.tests.test_udf' Here's the location of the segfault:
|
I think I know why this is happening. The task completion listener that closes the vectorized reader is registered lazily in I didn't realize this was the case, and it contradicts the assumption in this PR that task completion listeners are registered bottom-up. I'll submit a new PR to fix this. |
Hm, the UDF tests (
https://github.com/apache/spark/runs/3981850943?check_suite_focus=true |
Yeah, sorry about that. The flakiness should be fixed by #34369 when we merge that PR. |
…ner lazily ### What changes were proposed in this pull request? The previous PR #34245 assumed task completion listeners are registered bottom-up. `ParquetFileFormat#buildReaderWithPartitionValues()` violates this assumption by registering a task completion listener to close its output iterator lazily. Since task completion listeners are executed in reverse order of registration, this listener always runs before other listeners. When the downstream operator contains a Python UDF and the off-heap vectorized Parquet reader is enabled, this results in a use-after-free that causes a segfault. The fix is to close the output iterator using FileScanRDD's task completion listener. ### Why are the changes needed? Without this PR, the Python tests introduced in #34245 are flaky ([see details in thread](#34245 (comment))). They intermittently fail with a segfault. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Repeatedly ran one of the Python tests introduced in #34245 using the commands below. Previously, the test was flaky and failed after about 50 runs. With this PR, the test has not failed after 1000 runs. ```sh ./build/sbt -Phive clean package && ./build/sbt test:compile seq 1000 | parallel -j 8 --halt now,fail=1 'echo {#}; python/run-tests --testnames pyspark.sql.tests.test_udf' ``` Closes #34369 from ankurdave/SPARK-37089. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…ner lazily ### What changes were proposed in this pull request? The previous PR #34245 assumed task completion listeners are registered bottom-up. `ParquetFileFormat#buildReaderWithPartitionValues()` violates this assumption by registering a task completion listener to close its output iterator lazily. Since task completion listeners are executed in reverse order of registration, this listener always runs before other listeners. When the downstream operator contains a Python UDF and the off-heap vectorized Parquet reader is enabled, this results in a use-after-free that causes a segfault. The fix is to close the output iterator using FileScanRDD's task completion listener. ### Why are the changes needed? Without this PR, the Python tests introduced in #34245 are flaky ([see details in thread](#34245 (comment))). They intermittently fail with a segfault. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Repeatedly ran one of the Python tests introduced in #34245 using the commands below. Previously, the test was flaky and failed after about 50 runs. With this PR, the test has not failed after 1000 runs. ```sh ./build/sbt -Phive clean package && ./build/sbt test:compile seq 1000 | parallel -j 8 --halt now,fail=1 'echo {#}; python/run-tests --testnames pyspark.sql.tests.test_udf' ``` Closes #34369 from ankurdave/SPARK-37089. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1fc1d07) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… task completion listener returns ### What changes were proposed in this pull request? Python UDFs in Spark SQL are run in a separate Python process. The Python process is fed input by a dedicated thread (`BasePythonRunner.WriterThread`). This writer thread drives the child plan by pulling rows from its output iterator and serializing them across a socket. When the child exec node is the off-heap vectorized Parquet reader, these rows are backed by off-heap memory. The child node uses a task completion listener to free the off-heap memory at the end of the task, which invalidates the output iterator and any rows it has produced. Since task completion listeners are registered bottom-up and executed in reverse order of registration, this is safe as long as an exec node never accesses its input after its task completion listener has executed.[^1] The BasePythonRunner task completion listener violates this assumption. It interrupts the writer thread, but does not wait for it to exit. This causes a race condition that can lead to an executor crash: 1. The Python writer thread is processing a row backed by off-heap memory. 2. The task finishes, for example because it has reached a row limit. 3. The BasePythonRunner task completion listener sets the interrupt status of the writer thread, but the writer thread does not check it immediately. 4. The child plan's task completion listener frees its off-heap memory, invalidating the row that the Python writer thread is processing. 5. The Python writer thread attempts to access the invalidated row. The use-after-free triggers a segfault that crashes the executor. This PR fixes the bug by making the BasePythonRunner task completion listener wait for the writer thread to exit before returning. This prevents its input from being invalidated while the thread is running. The sequence of events is now as follows: 1. The Python writer thread is processing a row backed by off-heap memory. 2. The task finishes, for example because it has reached a row limit. 3. The BasePythonRunner task completion listener sets the interrupt status of the writer thread and waits for the writer thread to exit. 4. The child plan's task completion listener can safely free its off-heap memory without invalidating live rows. TaskContextImpl previously held a lock while invoking the task completion listeners. This would now cause a deadlock because the writer thread's exception handler calls `TaskContextImpl#isCompleted()`, which needs to acquire the same lock. To avoid deadlock, this PR modifies TaskContextImpl to release the lock before invoking the listeners, while still maintaining sequential execution of listeners. [^1]: This guarantee was not historically recognized, leading to similar bugs as far back as 2014 ([SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019?focusedCommentId=13953661&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13953661)). The root cause was the lack of a reliably-ordered mechanism for operators to free resources at the end of a task. Such a mechanism (task completion listeners) was added and gradually refined, and we can now make this guarantee explicit. (An alternative approach is to use closeable iterators everywhere, but this would be a major change.) ### Why are the changes needed? Without this PR, attempting to use Python UDFs while the off-heap vectorized Parquet reader is enabled (`spark.sql.columnVector.offheap.enabled true`) can cause executors to segfault. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A [previous PR](apache#30177) reduced the likelihood of encountering this race condition, but did not eliminate it. The accompanying tests were therefore flaky and had to be disabled. This PR eliminates the race condition, allowing us to re-enable these tests. One of the tests, `test_pandas_udf_scalar`, previously failed 30/1000 times and now always succeeds. An internal workload previously failed with a segfault about 40% of the time when run with `spark.sql.columnVector.offheap.enabled true`, and now succeeds 100% of the time. Closes apache#34245 from ankurdave/SPARK-33277-thread-join. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dfca1d1) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ner lazily ### What changes were proposed in this pull request? The previous PR apache#34245 assumed task completion listeners are registered bottom-up. `ParquetFileFormat#buildReaderWithPartitionValues()` violates this assumption by registering a task completion listener to close its output iterator lazily. Since task completion listeners are executed in reverse order of registration, this listener always runs before other listeners. When the downstream operator contains a Python UDF and the off-heap vectorized Parquet reader is enabled, this results in a use-after-free that causes a segfault. The fix is to close the output iterator using FileScanRDD's task completion listener. ### Why are the changes needed? Without this PR, the Python tests introduced in apache#34245 are flaky ([see details in thread](apache#34245 (comment))). They intermittently fail with a segfault. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Repeatedly ran one of the Python tests introduced in apache#34245 using the commands below. Previously, the test was flaky and failed after about 50 runs. With this PR, the test has not failed after 1000 runs. ```sh ./build/sbt -Phive clean package && ./build/sbt test:compile seq 1000 | parallel -j 8 --halt now,fail=1 'echo {#}; python/run-tests --testnames pyspark.sql.tests.test_udf' ``` Closes apache#34369 from ankurdave/SPARK-37089. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1fc1d07) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… task completion listener returns ### What changes were proposed in this pull request? Python UDFs in Spark SQL are run in a separate Python process. The Python process is fed input by a dedicated thread (`BasePythonRunner.WriterThread`). This writer thread drives the child plan by pulling rows from its output iterator and serializing them across a socket. When the child exec node is the off-heap vectorized Parquet reader, these rows are backed by off-heap memory. The child node uses a task completion listener to free the off-heap memory at the end of the task, which invalidates the output iterator and any rows it has produced. Since task completion listeners are registered bottom-up and executed in reverse order of registration, this is safe as long as an exec node never accesses its input after its task completion listener has executed.[^1] The BasePythonRunner task completion listener violates this assumption. It interrupts the writer thread, but does not wait for it to exit. This causes a race condition that can lead to an executor crash: 1. The Python writer thread is processing a row backed by off-heap memory. 2. The task finishes, for example because it has reached a row limit. 3. The BasePythonRunner task completion listener sets the interrupt status of the writer thread, but the writer thread does not check it immediately. 4. The child plan's task completion listener frees its off-heap memory, invalidating the row that the Python writer thread is processing. 5. The Python writer thread attempts to access the invalidated row. The use-after-free triggers a segfault that crashes the executor. This PR fixes the bug by making the BasePythonRunner task completion listener wait for the writer thread to exit before returning. This prevents its input from being invalidated while the thread is running. The sequence of events is now as follows: 1. The Python writer thread is processing a row backed by off-heap memory. 2. The task finishes, for example because it has reached a row limit. 3. The BasePythonRunner task completion listener sets the interrupt status of the writer thread and waits for the writer thread to exit. 4. The child plan's task completion listener can safely free its off-heap memory without invalidating live rows. TaskContextImpl previously held a lock while invoking the task completion listeners. This would now cause a deadlock because the writer thread's exception handler calls `TaskContextImpl#isCompleted()`, which needs to acquire the same lock. To avoid deadlock, this PR modifies TaskContextImpl to release the lock before invoking the listeners, while still maintaining sequential execution of listeners. [^1]: This guarantee was not historically recognized, leading to similar bugs as far back as 2014 ([SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019?focusedCommentId=13953661&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13953661)). The root cause was the lack of a reliably-ordered mechanism for operators to free resources at the end of a task. Such a mechanism (task completion listeners) was added and gradually refined, and we can now make this guarantee explicit. (An alternative approach is to use closeable iterators everywhere, but this would be a major change.) ### Why are the changes needed? Without this PR, attempting to use Python UDFs while the off-heap vectorized Parquet reader is enabled (`spark.sql.columnVector.offheap.enabled true`) can cause executors to segfault. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A [previous PR](apache#30177) reduced the likelihood of encountering this race condition, but did not eliminate it. The accompanying tests were therefore flaky and had to be disabled. This PR eliminates the race condition, allowing us to re-enable these tests. One of the tests, `test_pandas_udf_scalar`, previously failed 30/1000 times and now always succeeds. An internal workload previously failed with a segfault about 40% of the time when run with `spark.sql.columnVector.offheap.enabled true`, and now succeeds 100% of the time. Closes apache#34245 from ankurdave/SPARK-33277-thread-join. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dfca1d1) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ner lazily ### What changes were proposed in this pull request? The previous PR apache#34245 assumed task completion listeners are registered bottom-up. `ParquetFileFormat#buildReaderWithPartitionValues()` violates this assumption by registering a task completion listener to close its output iterator lazily. Since task completion listeners are executed in reverse order of registration, this listener always runs before other listeners. When the downstream operator contains a Python UDF and the off-heap vectorized Parquet reader is enabled, this results in a use-after-free that causes a segfault. The fix is to close the output iterator using FileScanRDD's task completion listener. ### Why are the changes needed? Without this PR, the Python tests introduced in apache#34245 are flaky ([see details in thread](apache#34245 (comment))). They intermittently fail with a segfault. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Repeatedly ran one of the Python tests introduced in apache#34245 using the commands below. Previously, the test was flaky and failed after about 50 runs. With this PR, the test has not failed after 1000 runs. ```sh ./build/sbt -Phive clean package && ./build/sbt test:compile seq 1000 | parallel -j 8 --halt now,fail=1 'echo {#}; python/run-tests --testnames pyspark.sql.tests.test_udf' ``` Closes apache#34369 from ankurdave/SPARK-37089. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1fc1d07) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… task completion listener returns ### What changes were proposed in this pull request? Python UDFs in Spark SQL are run in a separate Python process. The Python process is fed input by a dedicated thread (`BasePythonRunner.WriterThread`). This writer thread drives the child plan by pulling rows from its output iterator and serializing them across a socket. When the child exec node is the off-heap vectorized Parquet reader, these rows are backed by off-heap memory. The child node uses a task completion listener to free the off-heap memory at the end of the task, which invalidates the output iterator and any rows it has produced. Since task completion listeners are registered bottom-up and executed in reverse order of registration, this is safe as long as an exec node never accesses its input after its task completion listener has executed.[^1] The BasePythonRunner task completion listener violates this assumption. It interrupts the writer thread, but does not wait for it to exit. This causes a race condition that can lead to an executor crash: 1. The Python writer thread is processing a row backed by off-heap memory. 2. The task finishes, for example because it has reached a row limit. 3. The BasePythonRunner task completion listener sets the interrupt status of the writer thread, but the writer thread does not check it immediately. 4. The child plan's task completion listener frees its off-heap memory, invalidating the row that the Python writer thread is processing. 5. The Python writer thread attempts to access the invalidated row. The use-after-free triggers a segfault that crashes the executor. This PR fixes the bug by making the BasePythonRunner task completion listener wait for the writer thread to exit before returning. This prevents its input from being invalidated while the thread is running. The sequence of events is now as follows: 1. The Python writer thread is processing a row backed by off-heap memory. 2. The task finishes, for example because it has reached a row limit. 3. The BasePythonRunner task completion listener sets the interrupt status of the writer thread and waits for the writer thread to exit. 4. The child plan's task completion listener can safely free its off-heap memory without invalidating live rows. TaskContextImpl previously held a lock while invoking the task completion listeners. This would now cause a deadlock because the writer thread's exception handler calls `TaskContextImpl#isCompleted()`, which needs to acquire the same lock. To avoid deadlock, this PR modifies TaskContextImpl to release the lock before invoking the listeners, while still maintaining sequential execution of listeners. [^1]: This guarantee was not historically recognized, leading to similar bugs as far back as 2014 ([SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019?focusedCommentId=13953661&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13953661)). The root cause was the lack of a reliably-ordered mechanism for operators to free resources at the end of a task. Such a mechanism (task completion listeners) was added and gradually refined, and we can now make this guarantee explicit. (An alternative approach is to use closeable iterators everywhere, but this would be a major change.) ### Why are the changes needed? Without this PR, attempting to use Python UDFs while the off-heap vectorized Parquet reader is enabled (`spark.sql.columnVector.offheap.enabled true`) can cause executors to segfault. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A [previous PR](apache#30177) reduced the likelihood of encountering this race condition, but did not eliminate it. The accompanying tests were therefore flaky and had to be disabled. This PR eliminates the race condition, allowing us to re-enable these tests. One of the tests, `test_pandas_udf_scalar`, previously failed 30/1000 times and now always succeeds. An internal workload previously failed with a segfault about 40% of the time when run with `spark.sql.columnVector.offheap.enabled true`, and now succeeds 100% of the time. Closes apache#34245 from ankurdave/SPARK-33277-thread-join. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dfca1d1) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ner lazily ### What changes were proposed in this pull request? The previous PR apache#34245 assumed task completion listeners are registered bottom-up. `ParquetFileFormat#buildReaderWithPartitionValues()` violates this assumption by registering a task completion listener to close its output iterator lazily. Since task completion listeners are executed in reverse order of registration, this listener always runs before other listeners. When the downstream operator contains a Python UDF and the off-heap vectorized Parquet reader is enabled, this results in a use-after-free that causes a segfault. The fix is to close the output iterator using FileScanRDD's task completion listener. ### Why are the changes needed? Without this PR, the Python tests introduced in apache#34245 are flaky ([see details in thread](apache#34245 (comment))). They intermittently fail with a segfault. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Repeatedly ran one of the Python tests introduced in apache#34245 using the commands below. Previously, the test was flaky and failed after about 50 runs. With this PR, the test has not failed after 1000 runs. ```sh ./build/sbt -Phive clean package && ./build/sbt test:compile seq 1000 | parallel -j 8 --halt now,fail=1 'echo {#}; python/run-tests --testnames pyspark.sql.tests.test_udf' ``` Closes apache#34369 from ankurdave/SPARK-37089. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1fc1d07) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Python UDFs in Spark SQL are run in a separate Python process. The Python process is fed input by a dedicated thread (
BasePythonRunner.WriterThread
). This writer thread drives the child plan by pulling rows from its output iterator and serializing them across a socket.When the child exec node is the off-heap vectorized Parquet reader, these rows are backed by off-heap memory. The child node uses a task completion listener to free the off-heap memory at the end of the task, which invalidates the output iterator and any rows it has produced. Since task completion listeners are registered bottom-up and executed in reverse order of registration, this is safe as long as an exec node never accesses its input after its task completion listener has executed.1
The BasePythonRunner task completion listener violates this assumption. It interrupts the writer thread, but does not wait for it to exit. This causes a race condition that can lead to an executor crash:
This PR fixes the bug by making the BasePythonRunner task completion listener wait for the writer thread to exit before returning. This prevents its input from being invalidated while the thread is running. The sequence of events is now as follows:
TaskContextImpl previously held a lock while invoking the task completion listeners. This would now cause a deadlock because the writer thread's exception handler calls
TaskContextImpl#isCompleted()
, which needs to acquire the same lock. To avoid deadlock, this PR modifies TaskContextImpl to release the lock before invoking the listeners, while still maintaining sequential execution of listeners.Why are the changes needed?
Without this PR, attempting to use Python UDFs while the off-heap vectorized Parquet reader is enabled (
spark.sql.columnVector.offheap.enabled true
) can cause executors to segfault.Does this PR introduce any user-facing change?
No.
How was this patch tested?
A previous PR reduced the likelihood of encountering this race condition, but did not eliminate it. The accompanying tests were therefore flaky and had to be disabled. This PR eliminates the race condition, allowing us to re-enable these tests. One of the tests,
test_pandas_udf_scalar
, previously failed 30/1000 times and now always succeeds.An internal workload previously failed with a segfault about 40% of the time when run with
spark.sql.columnVector.offheap.enabled true
, and now succeeds 100% of the time.Footnotes
This guarantee was not historically recognized, leading to similar bugs as far back as 2014 (SPARK-1019). The root cause was the lack of a reliably-ordered mechanism for operators to free resources at the end of a task. Such a mechanism (task completion listeners) was added and gradually refined, and we can now make this guarantee explicit. (An alternative approach is to use closeable iterators everywhere, but this would be a major change.) ↩