Reduce python expansion service startup time#38611
Conversation
…ilure (apache#38572)" This reverts commit 930b94c.
When a SubprocessServer fails to start (e.g., due to a process exit or startup error), the server process could leak if standard purging is blocked by other active owners sharing the cached subprocess. To fix this: - Implement `_SharedCache.force_remove()` to immediately remove a key from the cache and run its destructor regardless of active owners. - Add `SubprocessServer.stop_force()` which calls `force_remove()` to completely terminate the server's process. - Call `stop_force()` in the `except` block of `SubprocessServer.start()`
This ensures we can download pre-built wheels for environment staging rather than relying on tarball building, which is sometimes slow.
|
r: @derrickaw |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses expansion service startup timeouts by optimizing dependency installation and improving subprocess lifecycle management. By allowing the use of pre-built wheels and implementing robust cleanup mechanisms for failed processes, the changes ensure more reliable and faster service initialization in CI environments. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the Python post-commit trigger version, changes the expansion service binding address to '0.0.0.0', and updates the manylinux platform detection logic for newer pip versions. Additionally, it refactors the SubprocessServer by adding a force_remove method to the shared cache and implementing a stop_force cleanup mechanism during startup failures. Review feedback highlights that removing the validation check for add_insecure_port may lead to silent failures and misleading logs. Furthermore, the removal of the retry logic in the start method could make the service less resilient to transient startup issues.
There was a problem hiding this comment.
Code Review
The pull request modifies the expansion service to bind to 0.0.0.0 and updates the SubprocessServer with a force_remove method and a stop_force cleanup routine to prevent resource leaks. It also adjusts the manylinux platform detection logic for Python SDK containers. The reviewer recommends restoring the gRPC port binding check to ensure the service starts correctly and handles ephemeral ports properly. Additionally, the reviewer suggests replacing bare except blocks with Exception to avoid catching system interrupts and to improve error logging.
This PR reverts the changes from #38572 and addresses the root cause of the expansion service timeouts by enabling pre-built wheel downloads and improving subprocess cleanup.
Further investigation into the expansion service test hangs revealed that the issue was not caused by network instability or port collisions. Instead, the service was timing out while downloading and building the numpy source tarball. This build process is heavily influenced by GitHub Runner load, leading to frequent timeouts during environment staging.
In this PR, we update Stager to support new manylinux tag, which allows the service to download pre-built wheels rather than compiling from source. This significanly accelerate the startup time of the expansion service. In local environments, these changes reduced the runtime of
MLTest::test_ml_preprocessing_yamlfrom 110 seconds to 8 seconds!Furthermore, we introduce
force_remove()andstop_force()to ensure that if a SubprocessServer fails to start, the process is immediately terminated and removed from the cache, preventing leaks even when shared by multiple owners.