Skip to content

[pull] master from tensorflow:master#350

Merged
pull[bot] merged 15 commits intobarkpixels:masterfrom
tensorflow:master
May 30, 2025
Merged

[pull] master from tensorflow:master#350
pull[bot] merged 15 commits intobarkpixels:masterfrom
tensorflow:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented May 30, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

mkuperst and others added 15 commits May 30, 2025 08:59
PiperOrigin-RevId: 765209533
PiperOrigin-RevId: 765233541
…uild-arm64 container to us-docker.pkg.dev/ml-oss-artifacts-published/ml-public-container/ml-build-arm64.

These containers are the same (same build script), but they are just in a different repositories. Also change the remaining of `ml-build` container over to the new one as well.

PiperOrigin-RevId: 765237536
Calling the same subgraph recursively via a CALL_ONCE op creates an infinite recursion that causes a stack overflow.
Added a check so that the same subgraph cannot call itself via a CALL_ONCE op.

PiperOrigin-RevId: 765251354
…e us-docker.pkg.dev/ml-ss-artifacts-published/ml-public-container.

The older container is `us-central1-docker.pkg.dev` is no longer maintained.

PiperOrigin-RevId: 765283088
…d updates relevant scripts and configs

PiperOrigin-RevId: 765284686
PiperOrigin-RevId: 765306410
When XProf removed Tensorflow as a dependency, we also renamed @local_xla back to @xla, as well as @tsl. This broke compatibility with Tensorflow, so adding a mapping to mimic the old behavior.

PiperOrigin-RevId: 765326579
…s. This is most relevant for Async Jax PST training, where workers can reconnect on preemption and the training continues on other workers.

Following scenario is addressed:
1. begin loop barrier
2. Run training steps
3. end loop barrier
4. some_other_barrier
5. Perform checkpointing etc.
6. Go back to 1.

If a task is restarted, while 2 is in progress, the restarted task will wait on begin loop barrier, while other tasks will wait on end loop barrier or the some_other_barrier (depending where the other tasks are).

A task can wait only on one barrier at a time, so it creates a deadlock. So in this case, to avoid deadlock, we should ignore the restarted task in the end_loop barrier or any other barrier until the restarted task is synced again at begin_loop_barrier. This enables other tasks to proceed proceed. The restarted task will continue to wait on the begin loop barrier. When the other tasks reach begin loop barrier, at 5, the restarted task will be synced with the other tasks and thus can be removed from the unsynced_tasks set.

This change allows the model to gracefully recover from preemption when some of the workers are slow and thus get preempted during training.

PiperOrigin-RevId: 765331856
@pull pull bot added the ⤵️ pull label May 30, 2025
@pull pull bot merged commit 6fb7fa5 into barkpixels:master May 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants