🧪 RL Algorithms from Scratch

PyTorch implementations of core Reinforcement Learning algorithms, built to understand the mathematical progression from Policy Gradients to Actor-Critic.

This repository is a "Lab Notebook" documenting the trade-offs between Bias, Variance, and Sample Efficiency in Deep RL.

📊 The Evolution of Algorithms

I implemented three distinct variations to observe how changing the update target affects learning stability.

Implementation	Type	Update Frequency	Key Characteristic	Pros/Cons
1. REINFORCE	Monte Carlo	End of Episode	Uses full episode return $G_t$.	✅ Unbiased ❌ High Variance (Noisy)
2. Truncated PG	N-Step	Every 10 Steps	Experimental: Updates on partial returns without a Critic.	✅ Faster Updates ❌ Biased/Myopic (Ignores long-term future)
3. A2C	Actor-Critic	Every 10 Steps	Uses $N$-step return + $V(s)$ bootstrap.	✅ Low Variance & Low Bias ❌ Complex Implementation

🛠️ Implementation Details

1. REINFORCE (Vanilla Policy Gradient)

The baseline implementation.

Method: Uses full Monte Carlo returns. The agent must finish an entire episode before learning anything.
Optimization: Implemented Batching (updating every 4-8 episodes) to average out the noise in the gradients.
Result: Solves CartPole-v1, but training is jittery due to high variance in episode lengths.

2. Truncated Policy Gradient

An intermediate experiment between REINFORCE and A2C.

Method: Updates the policy every N=10 steps using the immediate sum of rewards.
The "Bias" Lesson: Unlike REINFORCE, this method cuts off the return calculation after 10 steps. Without a Value Function to estimate the rest of the future, the agent becomes short-sighted, prioritizing immediate survival over long-term balance.
Key Feature: Implemented custom KL Divergence tracking to monitor how aggressively the policy updates.

3. A2C (N-step Advantage Actor-Critic)

The robust solution.

Method: Solves the "Short-sighted" problem of Method #2 by introducing a Critic Network.
Bootstrapping: $G_t = r_t + \gamma r_{t+1} + \dots + \gamma^n V(s_{t+n})$.
Shared Architecture: Uses a single backbone network with split heads (Actor vs. Critic) for computational efficiency.
Status: Currently migrating this agent to Continuous Control (Inverted Pendulum) using Gaussian Distributions.
Increasing N generally reduces bias but increases variance.

📉 Performance & Metrics

All experiments are tracked using Weights & Biases (WandB).

Key metrics tracked:

Entropy: To monitor policy collapse (premature convergence).
Approx KL: To ensure updates aren't destroying the policy.
Grad Norm: To detect exploding gradients in the early phase.

🧰 Tech Stack

Framework: PyTorch (MPS/Metal Acceleration enabled for Mac).
Environment: Gymnasium (CartPole-v1) & MuJoCo (InvertedPendulum-v4).
Logging: WandB.

🔜 Next Steps

Complete Sim-to-Real transfer tests on Inverted Pendulum (Domain Randomization).
Implement PPO (Proximal Policy Optimization) to fix the trust-region issue in A2C.

🔬 Experimental Observations

Finding: Contrary to standard theory, Batched REINFORCE outperformed A2C on CartPole-v1.

Analysis:

Complexity Overhead: A2C suffers from "non-stationary targets"—the Actor learns from a Critic that is itself still learning. On a simple task like CartPole, this added complexity slows down initial convergence compared to the unbiased Monte Carlo returns of REINFORCE.
The "Bias" of Truncated PG: The Truncated Policy Gradient (10-step updates without a Critic) struggled because it is myopic. It optimizes for immediate survival (next 10 steps) but lacks a value signal for long-term stability.
Gradient Clipping: Aggressive clipping (Norm=2.0) degraded performance. In CartPole, recovering from a near-fail state requires large gradient updates. Clipping prevented the policy from making these urgent corrections quickly.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
1_reinforce		1_reinforce
2_pg		2_pg
3_nstep_actorcritic		3_nstep_actorcritic
4_a2c_gae		4_a2c_gae
5_ppo		5_ppo
common_methods		common_methods
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧪 RL Algorithms from Scratch

📊 The Evolution of Algorithms

🛠️ Implementation Details

1. REINFORCE (Vanilla Policy Gradient)

2. Truncated Policy Gradient

3. A2C (N-step Advantage Actor-Critic)

📉 Performance & Metrics

🧰 Tech Stack

🔜 Next Steps

🔬 Experimental Observations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧪 RL Algorithms from Scratch

📊 The Evolution of Algorithms

🛠️ Implementation Details

1. REINFORCE (Vanilla Policy Gradient)

2. Truncated Policy Gradient

3. A2C (N-step Advantage Actor-Critic)

📉 Performance & Metrics

🧰 Tech Stack

🔜 Next Steps

🔬 Experimental Observations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages