[pull] master from tensorflow:master#350
Merged
pull[bot] merged 15 commits intobarkpixels:masterfrom May 30, 2025
Merged
Conversation
PiperOrigin-RevId: 765209533
PiperOrigin-RevId: 765233541
…uild-arm64 container to us-docker.pkg.dev/ml-oss-artifacts-published/ml-public-container/ml-build-arm64. These containers are the same (same build script), but they are just in a different repositories. Also change the remaining of `ml-build` container over to the new one as well. PiperOrigin-RevId: 765237536
Calling the same subgraph recursively via a CALL_ONCE op creates an infinite recursion that causes a stack overflow. Added a check so that the same subgraph cannot call itself via a CALL_ONCE op. PiperOrigin-RevId: 765251354
…imes. PiperOrigin-RevId: 765269354
…e us-docker.pkg.dev/ml-ss-artifacts-published/ml-public-container. The older container is `us-central1-docker.pkg.dev` is no longer maintained. PiperOrigin-RevId: 765283088
…d updates relevant scripts and configs PiperOrigin-RevId: 765284686
PiperOrigin-RevId: 765287094
PiperOrigin-RevId: 765302519
PiperOrigin-RevId: 765306410
PiperOrigin-RevId: 765308679
…_PLUGIN` PiperOrigin-RevId: 765312495
…ency PiperOrigin-RevId: 765325170
…s. This is most relevant for Async Jax PST training, where workers can reconnect on preemption and the training continues on other workers. Following scenario is addressed: 1. begin loop barrier 2. Run training steps 3. end loop barrier 4. some_other_barrier 5. Perform checkpointing etc. 6. Go back to 1. If a task is restarted, while 2 is in progress, the restarted task will wait on begin loop barrier, while other tasks will wait on end loop barrier or the some_other_barrier (depending where the other tasks are). A task can wait only on one barrier at a time, so it creates a deadlock. So in this case, to avoid deadlock, we should ignore the restarted task in the end_loop barrier or any other barrier until the restarted task is synced again at begin_loop_barrier. This enables other tasks to proceed proceed. The restarted task will continue to wait on the begin loop barrier. When the other tasks reach begin loop barrier, at 5, the restarted task will be synced with the other tasks and thus can be removed from the unsynced_tasks set. This change allows the model to gracefully recover from preemption when some of the workers are slow and thus get preempted during training. PiperOrigin-RevId: 765331856
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.1)
Can you help keep this open source service alive? 💖 Please sponsor : )