Sac fix #96

amsks · 2025-08-07T09:33:57Z

Updates to SAC

Major: Base agnent now has an option handle_timeout_termination. When this is set to true, we treat the final states of terminated differently from truncated. This is related to [Bug] Infinite horizon tasks are handled like episodic tasks DLR-RM/stable-baselines3#284
Minor: Interface changes in SAC
- We now use the policy_log_prob() from the SAC model exclusivley for the tanh correction instead of sample_nondeterministic_logprobs. The latter can be potentially just made for PPO
- Added make_policy_head() to seperate hte policy head functionality
- The SAC network forward method now handles action rescaling and log_prob resampling
- SAC update uses fresh samples for alpha update, and exponentiates log_alpha
Updated Tests

TheEimer · 2025-08-08T07:01:44Z

mighty/mighty_agents/base_agent.py

+
+
+                # 3) optionally overwrite next_s on truncation
+                if self.handle_timeout_termination:


Not sure I like the naming. What does it mean to "handle_timeout_termination"? Should be more expressive. Also: when do we want this? Always? On specific envs? Specific algos? I would actually assume always, since we only want the next_s for next action prediction and always final obs in the replay. In that case we don't need a flag at all.

Removed the optional flag

TheEimer · 2025-08-08T07:03:37Z

mighty/mighty_agents/base_agent.py

+                next_s, reward, terminated, truncated, infos = self.env.step(action)
+
+                # 2) decide which samples are true “done”
+                replay_dones = terminated          # physics‐failure only


Comment here is env-specific. Also inconsistent: dones are always overwritten to real termination regardless of what the flag says.

Make this default

TheEimer · 2025-08-08T07:04:14Z

mighty/mighty_agents/sac.py

-        # Pack transition
-        transition = TransitionBatch(curr_s, action, reward, next_s, dones)
+        # Pack transition    
+        # `terminated` is used for physics failures in environments like `MightyEnv`


At least remove the weird AI comments

TheEimer · 2025-08-08T07:04:42Z

mighty/mighty_exploration/mighty_exploration_policy.py

        elif isinstance(out, tuple) and len(out) == 4:
            action = out[0]  # [batch, action_dim]
+
+            print(f'Self Model : {self.model}')


remove print

TheEimer · 2025-08-08T07:05:30Z

mighty/mighty_exploration/mighty_exploration_policy.py

+            print(f'Self Model : {self.model}')
            log_prob = sample_nondeterministic_logprobs(
-                z=out[1], mean=out[2], log_std=out[3], sac=self.algo == "sac"
+                z=out[1], mean=out[2], log_std=out[3], sac=isinstance(self.model, SACModel)


Bad idea! What If I want to implement a different model class for SAC that e.g. handles prediction differently? Then the policy stops functioning.

TheEimer · 2025-08-08T07:06:12Z

mighty/mighty_exploration/mighty_exploration_policy.py

+    z: torch.Tensor,
+    mean: torch.Tensor,
+    log_std: torch.Tensor,
+    sac: bool = False


The flag is here to stay model agnostic. Now you make it impossible to add new model classes for SAC...

TheEimer · 2025-08-08T07:07:12Z

mighty/mighty_exploration/stochastic_policy.py

            # 4-tuple case (Tanh squashing): (action, z, mean, log_std)
            elif isinstance(model_output, tuple) and len(model_output) == 4:
                action, z, mean, log_std = model_output
-                log_prob = sample_nondeterministic_logprobs(


I don't understand the reason for changing this, it's the same code but longer and locking into a specific model class?

TheEimer · 2025-08-08T07:07:59Z

mighty/mighty_exploration/stochastic_policy.py

                    return action.detach().cpu().numpy(), log_prob
                else:
-                    weighted_log_prob = log_prob * self.entropy_coefficient
+                    weighted_log_prob = log_prob


This is strange, now both do the same?!

TheEimer · 2025-08-08T07:08:23Z

mighty/mighty_exploration/stochastic_policy.py

-                    log_prob = sample_nondeterministic_logprobs(
-                        z=z, mean=mean, log_std=log_std, sac=self.algo == "sac"
-                    )
+                    if not isinstance(self.model, SACModel):


Same issue as above, identical function longer and worse

TheEimer · 2025-08-08T07:11:27Z

mighty/mighty_models/sac.py

        """
-        feats = self.feature_extractor(state)
-        x = self.policy_net(feats)
+        x = self.policy_net(state)


Not in the mighty format. The separate feature extractor is there to have a predictable structure and access to a feature embedding "Mighty-er" format would be to have a feature extractor -> policy head and then a q_feature_extractor. No functional difference, but it's relevant for continuity between algos.

Updated -- performance similar

amsks added 6 commits August 6, 2025 23:06

updated sac to test if it works

488b439

update

707b280

update

979742e

SAC updates

484d1f2

updated code + tests

c667d0a

removed FIX comments

ade9d40

amsks requested a review from TheEimer August 7, 2025 09:34

amsks added the bug Something isn't working label Aug 7, 2025

amsks added this to the MLOSS milestone Aug 7, 2025

TheEimer reviewed Aug 8, 2025

View reviewed changes

amsks added 2 commits August 8, 2025 11:56

updates for Merge

c4a6d81

removed instance comparisons in stochastic and exploration policies

a47ace9

TheEimer approved these changes Aug 8, 2025

View reviewed changes

TheEimer merged commit 9edb039 into main Aug 8, 2025
2 checks passed

TheEimer deleted the sac_fix branch August 8, 2025 10:51



		# 3) optionally overwrite next_s on truncation
		if self.handle_timeout_termination:

Sac fix #96

Sac fix #96

Uh oh!

Conversation

amsks commented Aug 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants